fix(graphql): prevent memory leak and deadlock in subscription resolvers by Sanchit2662 · Pull Request #5397 · litmuschaos/litmus

Sanchit2662 · 2026-01-14T22:18:34Z

Summary

This PR fixes a critical concurrency issue in the ChaosCenter GraphQL subscription layer that could lead to unbounded memory growth and a process-wide deadlock under normal UI usage.

Specifically, GetInfraEvents subscriptions were leaking channels after client disconnects, and SendInfraEvent could block indefinitely while holding a shared mutex. Over time, this caused the GraphQL server to become unresponsive with no crash logs or clear error signals.

The fix ensures proper subscription cleanup, prevents blocking sends, and hardens related cleanup paths against concurrent map access.

Fix

1. Proper subscription cleanup on disconnect

Channels are now removed from the publisher slice when the subscription context is cancelled:

go func() {
    <-ctx.Done()
    data_store.Store.Mutex.Lock()
    channels := data_store.Store.InfraEventPublish[projectID]
    for i, ch := range channels {
        if ch == infraEvent {
            data_store.Store.InfraEventPublish[projectID] =
                append(channels[:i], channels[i+1:]...)
            break
        }
    }
    data_store.Store.Mutex.Unlock()
}()

2. Non-blocking event delivery to prevent deadlocks

Event publishing no longer blocks on slow or disconnected subscribers:

for _, observer := range r.InfraEventPublish[infra.ProjectID] {
    select {
    case observer <- &newEvent:
    default:
        // skip slow/dead subscriber
    }
}

This ensures one stalled subscription cannot block the entire system.

3. Thread-safe cleanup in related subscriptions

Cleanup paths in GetPodLog, GetKubeObject, and GetKubeNamespace now properly guard map deletes with the shared mutex, preventing concurrent map access panics.

Impact

Memory leak eliminated: subscription channels are no longer leaked.
Deadlock prevented: event publishing cannot block while holding the mutex.
Improved resilience: slow or disconnected clients degrade gracefully.
Stability improved: prevents rare but severe production outages in ChaosCenter.

Types of changes

Bugfix (non-breaking change which fixes an issue)

Checklist

I have read the [CONTRIBUTING](https://github.com/litmuschaos/litmus/blob/master/CONTRIBUTING.md) doc
I have signed the commit for DCO to be passed
Lint and unit tests pass locally with my changes

- Add proper cleanup in GetInfraEvents to remove channels on disconnect - Use non-blocking sends in SendInfraEvent to prevent mutex deadlock - Add mutex protection to map deletes in GetPodLog, GetKubeObject, GetKubeNamespace Signed-off-by: Sanchit2662 <[email protected]>

Sanchit2662 · 2026-01-15T17:25:08Z

Hi @PriteshKiri, @amityt , @SarthakJain26
I’ve updated the PR to address the issue and adjusted the implementation accordingly. This helps avoid a potential memory leak and deadlock in the GraphQL subscription flow by improving how subscriptions are cleaned up and how events are delivered.

Whenever you get a chance, I’d really appreciate a review. Thanks!

Copilot

Pull request overview

This PR addresses concurrency problems in the ChaosCenter GraphQL subscription layer, focusing on preventing blocked publishers and cleaning up subscription listeners to avoid leaked channels and map access hazards.

Changes:

Made SendInfraEvent publish using a non-blocking channel send to avoid indefinitely blocking while holding the shared mutex.
Added GetInfraEvents subscription cleanup to remove the subscriber channel on ctx.Done().
Wrapped several subscription cleanup delete(...) operations (ExperimentLog, KubeObjectData, KubeNamespaceData) with the shared mutex.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go`	Switches infra event fan-out to non-blocking sends to prevent deadlocks.
`chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go`	Adds disconnect cleanup for infra event subscriptions and mutex-protects cleanup deletes for several subscription maps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	r.Mutex.Lock()
 	if r.InfraEventPublish != nil {
 		for _, observer := range r.InfraEventPublish[infra.ProjectID] {
-			observer <- &newEvent
+			// Use non-blocking send to prevent deadlock if channel buffer is full
+			select {
+			case observer <- &newEvent:
+			default:
+				// Channel full or no receiver, skip to prevent blocking
+			}
 		}


+				data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...)
+				break
+			}
+		}


 		logrus.Print("CLOSED LOG LISTENER: ", request.InfraID, request.PodName)
+		data_store.Store.Mutex.Lock()
 		delete(data_store.Store.ExperimentLog, reqID.String())
+		data_store.Store.Mutex.Unlock()


 		delete(data_store.Store.KubeObjectData, reqID.String())
+		data_store.Store.Mutex.Unlock()
 	}()
 	go r.chaosExperimentHandler.GetKubeObjData(reqID.String(), request, *data_store.Store)


 		<-ctx.Done()
 		logrus.Println("Closed KubeNamespace Listener")
+		data_store.Store.Mutex.Lock()
 		delete(data_store.Store.KubeNamespaceData, reqID.String())
+		data_store.Store.Mutex.Unlock()


SarthakJain26 · 2026-03-17T10:29:35Z

@Sanchit2662 please check the comments from copilot

…eanup in subscription resolvers - Copy observer slice under lock before fan-out in SendInfraEvent to avoid holding mutex during channel sends - Delete InfraEventPublish map key when last subscriber disconnects to prevent map growth - Guard ExperimentLog, KubeObjectData, KubeNamespaceData map reads with mutex in service.go (PodLog, KubeObj, KubeNamespace) - Guard ConnectedInfra and data map reads with mutex in handler.go (GetLogs, GetKubeObjData, GetKubeNamespaceData) Signed-off-by: Sanchit2662 <[email protected]>

Sanchit2662 · 2026-03-21T13:08:41Z

Hi @SarthakJain26 @PriteshKiri , I have made the requested changes.

Copilot

Pull request overview

This PR hardens ChaosCenter’s GraphQL subscription plumbing to avoid leaks and deadlocks in infra event/log subscriptions by improving cleanup behavior and reducing lock contention during publish.

Changes:

Adds context-cancel cleanup for GetInfraEvents subscribers and deletes empty per-project publish lists.
Makes infra event publishing non-blocking and avoids holding the shared mutex while delivering events.
Adds mutex protection around several subscription map reads/deletes to prevent concurrent map access panics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go	Copies infra-event observers under lock and publishes via non-blocking sends; adds locking around request-channel map reads.
chaoscenter/graphql/server/pkg/chaos_experiment/handler/handler.go	Adds mutex protection around `ConnectedInfra` and per-request channel map reads before sending responses/closing channels.
chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go	Adds disconnect cleanup for infra event subscribers and guards request-map deletes with the shared mutex.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T11:00:20Z

+		channels := data_store.Store.InfraEventPublish[projectID]
+		for i, ch := range channels {
+			if ch == infraEvent {
+				data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...)


When removing infraEvent from the channels slice, the current append(channels[:i], channels[i+1:]...) can retain a reference to the removed channel in the underlying array (notably when removing the last element). That keeps the channel (and its buffered data) reachable and can undermine the memory-leak fix under churn. Consider shifting elements with copy, explicitly nil-ing the last slot, and then reslicing (or otherwise ensuring the removed element is cleared) before storing back into the map.

Suggested change

data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...)

copy(channels[i:], channels[i+1:])

channels[len(channels)-1] = nil

channels = channels[:len(channels)-1]

data_store.Store.InfraEventPublish[projectID] = channels

Copilot · 2026-04-14T11:00:21Z

 		logrus.Println("Closed KubeObj Listener")
+		data_store.Store.Mutex.Lock()
 		delete(data_store.Store.KubeObjectData, reqID.String())
+		data_store.Store.Mutex.Unlock()


Even with the map delete guarded by the mutex, kubeObjData is created as an unbuffered channel (make(chan *model.KubeObjectResponse) earlier in this resolver). If the subscriber responds right around disconnect, the infra service can still obtain the channel before this delete and then block forever on reqChan <- &resp after the GraphQL layer stops receiving. Consider making these per-request channels buffered (size 1, like GetPodLog) and/or making the send in KubeObj non-blocking to avoid goroutine leaks on client disconnects.

Copilot · 2026-04-14T11:00:21Z

 		logrus.Println("Closed KubeNamespace Listener")
+		data_store.Store.Mutex.Lock()
 		delete(data_store.Store.KubeNamespaceData, reqID.String())
+		data_store.Store.Mutex.Unlock()


Same disconnect race exists for namespace subscriptions: kubeNamespaceData is also unbuffered in this resolver, so a late subscriber response can block indefinitely on reqChan <- &resp if it grabs the channel before this delete and the GraphQL runtime has stopped receiving. Buffering the channel (size 1) and/or making the send in KubeNamespace non-blocking would prevent goroutine leaks on disconnect.

PriteshKiri · 2026-04-14T13:27:39Z

Hey @Sanchit2662

Could you please review comments from Co-Pilot?

Merge branch 'master' into fix/subscription-memory-leak-deadlock

886850b

SarthakJain26 requested a review from Copilot March 17, 2026 10:12

Copilot started reviewing on behalf of SarthakJain26 March 17, 2026 10:13 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Merge branch 'master' into fix/subscription-memory-leak-deadlock

61fc735

Sanchit2662 added 2 commits March 21, 2026 18:37

Merge branch 'master' into fix/subscription-memory-leak-deadlock

5aa06e0

PriteshKiri added 2 commits April 8, 2026 18:16

Merge branch 'master' into fix/subscription-memory-leak-deadlock

445920e

Merge branch 'master' into fix/subscription-memory-leak-deadlock

7cdde26

SarthakJain26 requested a review from Copilot April 14, 2026 10:54

Copilot started reviewing on behalf of SarthakJain26 April 14, 2026 10:54 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397
Sanchit2662 wants to merge 7 commits intolitmuschaos:masterfrom
Sanchit2662:fix/subscription-memory-leak-deadlock

Sanchit2662 commented Jan 14, 2026 •

edited

Loading

Uh oh!

Sanchit2662 commented Jan 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

SarthakJain26 commented Mar 17, 2026

Uh oh!

Sanchit2662 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

PriteshKiri commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-				data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...)
+				copy(channels[i:], channels[i+1:])
+				channels[len(channels)-1] = nil
+				channels = channels[:len(channels)-1]
+				data_store.Store.InfraEventPublish[projectID] = channels

Conversation

Sanchit2662 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

1. Proper subscription cleanup on disconnect

2. Non-blocking event delivery to prevent deadlocks

3. Thread-safe cleanup in related subscriptions

Impact

Types of changes

Checklist

Uh oh!

Sanchit2662 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

SarthakJain26 commented Mar 17, 2026

Uh oh!

Sanchit2662 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

PriteshKiri commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sanchit2662 commented Jan 14, 2026 •

edited

Loading

Sanchit2662 commented Jan 15, 2026 •

edited

Loading

Sanchit2662 commented Mar 21, 2026 •

edited

Loading