Skip to content

fix(api): race conditions with multiple APIs and fresh orchestrators#2191

Open
jakubno wants to merge 10 commits intomainfrom
fix/race-condition-for-new-nodes
Open

fix(api): race conditions with multiple APIs and fresh orchestrators#2191
jakubno wants to merge 10 commits intomainfrom
fix/race-condition-for-new-nodes

Conversation

@jakubno
Copy link
Member

@jakubno jakubno commented Mar 20, 2026

Fixes:
// 1. A new orchestrator node is running
// 2. Nomad service discovery knows about it
// 3. API 1 creates a sandbox on the node
// 4. API 2 receives a request to manipulate sandbox on the node, which it doesn't know about yet

Note

Medium Risk
Medium risk because it changes orchestrator node lookup/connection behavior under concurrency and adds on-demand discovery paths; mistakes could cause missed nodes or extra load during cache misses.

Overview
Reduces race conditions when multiple API instances interact with newly joined orchestrators by deduplicating concurrent node connection attempts via singleflight, adding an on-demand getOrConnectNode fallback that triggers targeted discovery (Nomad list or cluster instance resync) on cache misses, and wiring this fallback into sandbox operations that previously assumed the node was already cached. It also exposes Cluster.SyncInstances for immediate instance resync and makes the shared synchronization helper’s Sync context-cancellable with a semaphore guard, with new tests covering cache-miss discovery, singleflight deduplication, and sync cancellation behavior.

Written by Cursor Bugbot for commit ad5d1ff. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@jakubno jakubno force-pushed the fix/race-condition-for-new-nodes branch from 5751fe3 to b342645 Compare March 23, 2026 14:53
@jakubno jakubno marked this pull request as ready for review March 23, 2026 14:54
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3426452a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@sitole sitole self-requested a review March 23, 2026 16:44
Comment on lines +136 to +137
// - Gap 2 (0–20 s): the node is in the local instance map but has not yet been
// promoted into o.nodes by keepInSync.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify by removing two layers os nodes sync. This should remove duplicated logic, but it should be done separately as it will introduce lot of changes.

//
// discoveryGroup ensures that concurrent requests targeting the same missing
// node share a single discovery attempt rather than fanning out.
func (o *Orchestrator) getOrConnectNode(ctx context.Context, clusterID uuid.UUID, nodeID string) *nodemanager.Node {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some tracing here so we can see why some sandbox requests will be slower.


// discoverNomadNode lists all ready Nomad nodes and connects any that are not yet in the pool.
// Once a new node is connected its orchestrator ID becomes the map key, making subsequent GetNode calls succeed.
func (o *Orchestrator) discoverNomadNode(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (o *Orchestrator) discoverNomadNode(ctx context.Context) {
func (o *Orchestrator) discoverNomadNodes(ctx context.Context) {

for _, n := range nomadNodes {
if o.GetNodeByNomadShortID(n.NomadNodeShortID) == nil {
wg.Go(func() {
if err := o.connectToNode(ctx, n); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hit the race of manual connect and natural periodic one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for discoverClusterNode

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why there's connectGroupin connectToNode handling exactly this situation
same for discoverClusterNode, there's connectGroup in connectToClusterNode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants