fix(api): race conditions with multiple APIs and fresh orchestrators by jakubno · Pull Request #2191 · e2b-dev/infra

jakubno · 2026-03-20T16:51:23Z

Fixes:
// 1. A new orchestrator node is running
// 2. Nomad service discovery knows about it
// 3. API 1 creates a sandbox on the node
// 4. API 2 receives a request to manipulate sandbox on the node, which it doesn't know about yet

Note

Medium Risk
Medium risk because it changes orchestrator node lookup/connection behavior under concurrency and adds on-demand discovery paths; mistakes could cause missed nodes or extra load during cache misses.

Overview
Reduces race conditions when multiple API instances interact with newly joined orchestrators by deduplicating concurrent node connection attempts via singleflight, adding an on-demand getOrConnectNode fallback that triggers targeted discovery (Nomad list or cluster instance resync) on cache misses, and wiring this fallback into sandbox operations that previously assumed the node was already cached. It also exposes Cluster.SyncInstances for immediate instance resync and makes the shared synchronization helper’s Sync context-cancellable with a semaphore guard, with new tests covering cache-miss discovery, singleflight deduplication, and sync cancellation behavior.

^{Written by Cursor Bugbot for commit ad5d1ff. This will update automatically on new commits. Configure here.}

packages/shared/pkg/synchronization/synchronization.go

packages/api/internal/orchestrator/client.go

packages/shared/pkg/synchronization/synchronization_test.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

packages/api/internal/orchestrator/client.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3426452a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

packages/api/internal/orchestrator/client.go

sitole · 2026-03-23T21:10:14Z

packages/api/internal/orchestrator/client.go

+//   - Gap 2 (0–20 s): the node is in the local instance map but has not yet been
+//     promoted into o.nodes by keepInSync.


We can simplify by removing two layers os nodes sync. This should remove duplicated logic, but it should be done separately as it will introduce lot of changes.

sitole · 2026-03-23T21:16:13Z

packages/api/internal/orchestrator/client.go

+//
+// discoveryGroup ensures that concurrent requests targeting the same missing
+// node share a single discovery attempt rather than fanning out.
+func (o *Orchestrator) getOrConnectNode(ctx context.Context, clusterID uuid.UUID, nodeID string) *nodemanager.Node {


Let's add some tracing here so we can see why some sandbox requests will be slower.

sitole · 2026-03-23T21:22:25Z

packages/api/internal/orchestrator/client.go

+
+// discoverNomadNode lists all ready Nomad nodes and connects any that are not yet in the pool.
+// Once a new node is connected its orchestrator ID becomes the map key, making subsequent GetNode calls succeed.
+func (o *Orchestrator) discoverNomadNode(ctx context.Context) {


Suggested change

func (o *Orchestrator) discoverNomadNode(ctx context.Context) {

func (o *Orchestrator) discoverNomadNodes(ctx context.Context) {

sitole · 2026-03-23T21:26:31Z

packages/api/internal/orchestrator/client.go

+	for _, n := range nomadNodes {
+		if o.GetNodeByNomadShortID(n.NomadNodeShortID) == nil {
+			wg.Go(func() {
+				if err := o.connectToNode(ctx, n); err != nil {


I think we can hit the race of manual connect and natural periodic one?

Same for discoverClusterNode

that's why there's connectGroupin connectToNode handling exactly this situation
same for discoverClusterNode, there's connectGroup in connectToClusterNode

e2b-request-same-site-reviewers bot assigned sitole Mar 20, 2026

claude bot reviewed Mar 20, 2026

View reviewed changes

packages/shared/pkg/synchronization/synchronization.go Show resolved Hide resolved

packages/api/internal/orchestrator/client.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 20, 2026

View reviewed changes

packages/api/internal/orchestrator/client.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 23, 2026

View reviewed changes

packages/shared/pkg/synchronization/synchronization_test.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 23, 2026

View reviewed changes

packages/api/internal/orchestrator/client.go Show resolved Hide resolved

jakubno added 7 commits March 23, 2026 15:53

fix(api): race conditions with multiple APIs and fresh orchestrators

7b7b576

fix: lint

278593a

fix: context issues

28ef543

chore: use semaphore with context

dba2377

chore: clean up the logic

a0c2968

chore: fix small issues

b9726e4

chore: add nodes oppurtunistically

b342645

jakubno force-pushed the fix/race-condition-for-new-nodes branch from 5751fe3 to b342645 Compare March 23, 2026 14:53

jakubno marked this pull request as ready for review March 23, 2026 14:54

jakubno requested review from ValentaTomas and dobrac as code owners March 23, 2026 14:54

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

packages/api/internal/orchestrator/client.go Show resolved Hide resolved

jakubno added 2 commits March 23, 2026 17:17

chore: optimize the loop

d91442f

chore: add tests

3e8e22f

sitole self-requested a review March 23, 2026 16:44

fix: lint

ad5d1ff

sitole reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): race conditions with multiple APIs and fresh orchestrators#2191

fix(api): race conditions with multiple APIs and fresh orchestrators#2191
jakubno wants to merge 10 commits intomainfrom
fix/race-condition-for-new-nodes

jakubno commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

sitole Mar 23, 2026

Uh oh!

sitole Mar 23, 2026

Uh oh!

sitole Mar 23, 2026

Uh oh!

sitole Mar 23, 2026

Uh oh!

sitole Mar 23, 2026

Uh oh!

jakubno Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// - Gap 2 (0–20 s): the node is in the local instance map but has not yet been
		// promoted into o.nodes by keepInSync.

	func (o *Orchestrator) discoverNomadNode(ctx context.Context) {
	func (o *Orchestrator) discoverNomadNodes(ctx context.Context) {

Conversation

jakubno commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

sitole Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

sitole Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

sitole Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

sitole Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

sitole Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

jakubno Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jakubno commented Mar 20, 2026 •

edited

Loading