fix(api): race conditions with multiple APIs and fresh orchestrators#2191
fix(api): race conditions with multiple APIs and fresh orchestrators#2191
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
5751fe3 to
b342645
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3426452a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // - Gap 2 (0–20 s): the node is in the local instance map but has not yet been | ||
| // promoted into o.nodes by keepInSync. |
There was a problem hiding this comment.
We can simplify by removing two layers os nodes sync. This should remove duplicated logic, but it should be done separately as it will introduce lot of changes.
| // | ||
| // discoveryGroup ensures that concurrent requests targeting the same missing | ||
| // node share a single discovery attempt rather than fanning out. | ||
| func (o *Orchestrator) getOrConnectNode(ctx context.Context, clusterID uuid.UUID, nodeID string) *nodemanager.Node { |
There was a problem hiding this comment.
Let's add some tracing here so we can see why some sandbox requests will be slower.
|
|
||
| // discoverNomadNode lists all ready Nomad nodes and connects any that are not yet in the pool. | ||
| // Once a new node is connected its orchestrator ID becomes the map key, making subsequent GetNode calls succeed. | ||
| func (o *Orchestrator) discoverNomadNode(ctx context.Context) { |
There was a problem hiding this comment.
| func (o *Orchestrator) discoverNomadNode(ctx context.Context) { | |
| func (o *Orchestrator) discoverNomadNodes(ctx context.Context) { |
| for _, n := range nomadNodes { | ||
| if o.GetNodeByNomadShortID(n.NomadNodeShortID) == nil { | ||
| wg.Go(func() { | ||
| if err := o.connectToNode(ctx, n); err != nil { |
There was a problem hiding this comment.
I think we can hit the race of manual connect and natural periodic one?
There was a problem hiding this comment.
that's why there's connectGroupin connectToNode handling exactly this situation
same for discoverClusterNode, there's connectGroup in connectToClusterNode

Fixes:
// 1. A new orchestrator node is running
// 2. Nomad service discovery knows about it
// 3. API 1 creates a sandbox on the node
// 4. API 2 receives a request to manipulate sandbox on the node, which it doesn't know about yet
Note
Medium Risk
Medium risk because it changes orchestrator node lookup/connection behavior under concurrency and adds on-demand discovery paths; mistakes could cause missed nodes or extra load during cache misses.
Overview
Reduces race conditions when multiple API instances interact with newly joined orchestrators by deduplicating concurrent node connection attempts via
singleflight, adding an on-demandgetOrConnectNodefallback that triggers targeted discovery (Nomad list or cluster instance resync) on cache misses, and wiring this fallback into sandbox operations that previously assumed the node was already cached. It also exposesCluster.SyncInstancesfor immediate instance resync and makes the sharedsynchronizationhelper’sSynccontext-cancellable with a semaphore guard, with new tests covering cache-miss discovery, singleflight deduplication, and sync cancellation behavior.Written by Cursor Bugbot for commit ad5d1ff. This will update automatically on new commits. Configure here.