fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397
fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397Sanchit2662 wants to merge 7 commits intolitmuschaos:masterfrom
Conversation
- Add proper cleanup in GetInfraEvents to remove channels on disconnect - Use non-blocking sends in SendInfraEvent to prevent mutex deadlock - Add mutex protection to map deletes in GetPodLog, GetKubeObject, GetKubeNamespace Signed-off-by: Sanchit2662 <[email protected]>
|
Hi @PriteshKiri, @amityt , @SarthakJain26 Whenever you get a chance, I’d really appreciate a review. Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR addresses concurrency problems in the ChaosCenter GraphQL subscription layer, focusing on preventing blocked publishers and cleaning up subscription listeners to avoid leaked channels and map access hazards.
Changes:
- Made
SendInfraEventpublish using a non-blocking channel send to avoid indefinitely blocking while holding the shared mutex. - Added
GetInfraEventssubscription cleanup to remove the subscriber channel onctx.Done(). - Wrapped several subscription cleanup
delete(...)operations (ExperimentLog,KubeObjectData,KubeNamespaceData) with the shared mutex.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go |
Switches infra event fan-out to non-blocking sends to prevent deadlocks. |
chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go |
Adds disconnect cleanup for infra event subscriptions and mutex-protects cleanup deletes for several subscription maps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| r.Mutex.Lock() | ||
| if r.InfraEventPublish != nil { | ||
| for _, observer := range r.InfraEventPublish[infra.ProjectID] { | ||
| observer <- &newEvent | ||
| // Use non-blocking send to prevent deadlock if channel buffer is full | ||
| select { | ||
| case observer <- &newEvent: | ||
| default: | ||
| // Channel full or no receiver, skip to prevent blocking | ||
| } | ||
| } |
| data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...) | ||
| break | ||
| } | ||
| } |
| logrus.Print("CLOSED LOG LISTENER: ", request.InfraID, request.PodName) | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.ExperimentLog, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
| delete(data_store.Store.KubeObjectData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() | ||
| }() | ||
| go r.chaosExperimentHandler.GetKubeObjData(reqID.String(), request, *data_store.Store) |
| <-ctx.Done() | ||
| logrus.Println("Closed KubeNamespace Listener") | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.KubeNamespaceData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
|
@Sanchit2662 please check the comments from copilot |
…eanup in subscription resolvers - Copy observer slice under lock before fan-out in SendInfraEvent to avoid holding mutex during channel sends - Delete InfraEventPublish map key when last subscriber disconnects to prevent map growth - Guard ExperimentLog, KubeObjectData, KubeNamespaceData map reads with mutex in service.go (PodLog, KubeObj, KubeNamespace) - Guard ConnectedInfra and data map reads with mutex in handler.go (GetLogs, GetKubeObjData, GetKubeNamespaceData) Signed-off-by: Sanchit2662 <[email protected]>
|
Hi @SarthakJain26 @PriteshKiri , I have made the requested changes. |
There was a problem hiding this comment.
Pull request overview
This PR hardens ChaosCenter’s GraphQL subscription plumbing to avoid leaks and deadlocks in infra event/log subscriptions by improving cleanup behavior and reducing lock contention during publish.
Changes:
- Adds context-cancel cleanup for
GetInfraEventssubscribers and deletes empty per-project publish lists. - Makes infra event publishing non-blocking and avoids holding the shared mutex while delivering events.
- Adds mutex protection around several subscription map reads/deletes to prevent concurrent map access panics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go | Copies infra-event observers under lock and publishes via non-blocking sends; adds locking around request-channel map reads. |
| chaoscenter/graphql/server/pkg/chaos_experiment/handler/handler.go | Adds mutex protection around ConnectedInfra and per-request channel map reads before sending responses/closing channels. |
| chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go | Adds disconnect cleanup for infra event subscribers and guards request-map deletes with the shared mutex. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| channels := data_store.Store.InfraEventPublish[projectID] | ||
| for i, ch := range channels { | ||
| if ch == infraEvent { | ||
| data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...) |
There was a problem hiding this comment.
When removing infraEvent from the channels slice, the current append(channels[:i], channels[i+1:]...) can retain a reference to the removed channel in the underlying array (notably when removing the last element). That keeps the channel (and its buffered data) reachable and can undermine the memory-leak fix under churn. Consider shifting elements with copy, explicitly nil-ing the last slot, and then reslicing (or otherwise ensuring the removed element is cleared) before storing back into the map.
| data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...) | |
| copy(channels[i:], channels[i+1:]) | |
| channels[len(channels)-1] = nil | |
| channels = channels[:len(channels)-1] | |
| data_store.Store.InfraEventPublish[projectID] = channels |
| logrus.Println("Closed KubeObj Listener") | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.KubeObjectData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
There was a problem hiding this comment.
Even with the map delete guarded by the mutex, kubeObjData is created as an unbuffered channel (make(chan *model.KubeObjectResponse) earlier in this resolver). If the subscriber responds right around disconnect, the infra service can still obtain the channel before this delete and then block forever on reqChan <- &resp after the GraphQL layer stops receiving. Consider making these per-request channels buffered (size 1, like GetPodLog) and/or making the send in KubeObj non-blocking to avoid goroutine leaks on client disconnects.
| logrus.Println("Closed KubeNamespace Listener") | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.KubeNamespaceData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
There was a problem hiding this comment.
Same disconnect race exists for namespace subscriptions: kubeNamespaceData is also unbuffered in this resolver, so a late subscriber response can block indefinitely on reqChan <- &resp if it grabs the channel before this delete and the GraphQL runtime has stopped receiving. Buffering the channel (size 1) and/or making the send in KubeNamespace non-blocking would prevent goroutine leaks on disconnect.
|
Hey @Sanchit2662 Could you please review comments from Co-Pilot? |
Summary
This PR fixes a critical concurrency issue in the ChaosCenter GraphQL subscription layer that could lead to unbounded memory growth and a process-wide deadlock under normal UI usage.
Specifically,
GetInfraEventssubscriptions were leaking channels after client disconnects, andSendInfraEventcould block indefinitely while holding a shared mutex. Over time, this caused the GraphQL server to become unresponsive with no crash logs or clear error signals.The fix ensures proper subscription cleanup, prevents blocking sends, and hardens related cleanup paths against concurrent map access.
Fix
1. Proper subscription cleanup on disconnect
Channels are now removed from the publisher slice when the subscription context is cancelled:
2. Non-blocking event delivery to prevent deadlocks
Event publishing no longer blocks on slow or disconnected subscribers:
This ensures one stalled subscription cannot block the entire system.
3. Thread-safe cleanup in related subscriptions
Cleanup paths in
GetPodLog,GetKubeObject, andGetKubeNamespacenow properly guard map deletes with the shared mutex, preventing concurrent map access panics.Impact
Types of changes
Checklist