refactor(fdv2): add server FDv2 heuristic fallback and recovery#531
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.
| OnConditionFired(*cond_future.GetResult()); | ||
| } else { | ||
| OnSynchronizerResult(*next_future.GetResult()); | ||
| } |
There was a problem hiding this comment.
Unbounded continuation accumulation on conditions future in orchestration loop
Medium Severity
Each iteration of RunSynchronizerNext calls WhenAny(cond_future, next_future). The variadic WhenAny registers a Then continuation on cond_future, which shares the same PromiseInternal across all iterations (obtained via active_conditions_->GetFuture()). When the synchronizer result wins the race (normal path), the continuation on cond_future remains in the PromiseInternal::continuations_ vector, each holding a shared_ptr to its WhenAny's promise. For a stable primary connection where the fallback condition never fires, these continuations accumulate for the lifetime of the synchronizer — potentially days or weeks in a server SDK — causing linear memory growth proportional to the number of processed results.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.
There was a problem hiding this comment.
Ugh, I gotta think about this. Normally, I would cancel the promise that wasn't resolved, but for conditions, we re-use the same promise.
There was a problem hiding this comment.
Okay, so...
- Having the conditions have a single long-lived future that multiple callers wait on inherently has this problem.
- This isn't a problem with our Promise implementation -- every promise implementation would have this issue, even if the language had GC.
- AFAICT, our Java FDv2 implementation has this same issue, using
CompletableFutureandanyOf, and no one has noticed an issue. - The only reasonable solution that I can see here would be creating some kind of "subscriber" interface for conditions so that callers could "unsubscribe". But it's a bunch more code, and kind of messy.
Given that there's no simple workaround, and this hasn't been an issue for Java, I think we should just punt on this and maybe clean it up in the future.
There was a problem hiding this comment.
Yeah, I don't think this needs to block this PR. But it is something that we will want to think of a remediation for.
…163) ## Summary `FDv2DataSource.Conditions.getFuture()` returned the same shared `CompletableFuture<Object>` instance to every caller. The run loop does `CompletableFuture.anyOf(getFuture(), synchronizer.next()).get()` on every iteration, which attaches a new `OrRelay` `Completion` to the shared future's `stack` each time. `CompletableFuture` has no deregister path for the loser of an `anyOf` race, so those `Completion` nodes stay on the stack until the shared future itself completes. On a healthy primary streaming ChangeSets without ever firing fallback/recovery, the shared future never completes — the `stack` grows monotonically for the synchronizer's entire tenure (effectively the SDK's uptime on a stable server). **Per-iteration cost: ~200 B** (OrRelay + anyOf result CF + chain references). **At 10 ChangeSets/sec sustained: ~150 MB/day per active synchronizer.** ## The fix A single permanent `whenComplete` listener on the underlying aggregate fans out completion to every fresh future handed out by `getFuture()`. Pending fresh futures are tracked via `WeakReference`, so a fresh future whose only strong references were the caller's local variables (typical lifetime: one loop iteration) becomes garbage-collectable once that iteration ends. Pending entries whose referent has been collected are pruned opportunistically on each `getFuture()` call and on `close()`. `Conditions` is now package-private (was `private`) so direct unit tests can reach it. A test-only `pendingSize()` helper is added. ## Test plan Adds `FDv2DataSourceConditionsAggregateTest` with five tests: - **`getFutureReturnsDistinctInstancesPerCall`** — bug-prover. Fails on the pre-fix shared-instance behavior, passes after the fix. - **`getFutureReturnsDistinctInstancesEvenWithNoConditions`** — bug-prover. Covers the empty-conditions case (single-synchronizer configuration), which is exactly where per-iteration accumulation would be most damaging. - **`allFreshFuturesCompleteWhenAggregateFires`** — verifies fan-out via the single permanent listener actually delivers to multiple fresh futures handed out before the aggregate fires. - **`getFutureAfterAggregateFiresReturnsCompletedFuture`** — verifies the fast path: callers arriving after completion get an already-completed future synchronously. - **`pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped`** — 10k-iteration soak test that simulates the run-loop pattern (race a fresh future against a fast-resolving sibling, drop the result) and asserts the pending list stays bounded via GC + opportunistic pruning. Caveat in the test docstring about `System.gc()` not being guaranteed — if it ever flakes on CI we can migrate to `-XX:+UseSerialGC` or relax the ceiling. Verified bug-proving discipline: the two distinctness tests fail on the pre-fix shared-instance behavior and pass after the fix. The full server-sdk test suite (1857 tests across 109 classes) is clean. ## Context This was identified during a multi-agent review of the analogous cpp-sdks PR (launchdarkly/cpp-sdks#531), which mirrors this Java implementation's `Conditions` design. The cpp version has the same structural leak; this Java fix shape is what was prototyped there. Filing here first since the runtime impact on a long-running JVM-based server SDK is more pronounced. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches the FDv2 synchronizer condition-aggregation logic used in the main run loop; mistakes could cause missed fallback/recovery signals or incorrect exceptional completion behavior, though changes are localized and covered by new unit tests. > > **Overview** > Prevents a long-lived memory leak in `FDv2DataSource.Conditions` by changing `getFuture()` to return a *fresh* `CompletableFuture` per call until the underlying condition aggregate completes, rather than returning the same shared pending future each iteration. > > Adds a single `whenComplete` fan-out from the aggregate to complete all outstanding per-call futures (and to propagate exceptional completion), tracks pending futures in a `WeakHashMap`-backed set for GC cleanup, and makes `Conditions` package-private to allow direct testing. > > Introduces `FDv2DataSourceConditionsAggregateTest` to assert per-call distinctness, correct completion fan-out, and correct behavior on exceptional and post-completion paths. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a6c7c36. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->


Summary
Heuristic FDv2 fallback (interrupted ≥ 2 min) and recovery (valid ≥ 5 min) for the server SDK, modeled on Java's event-driven
Conditiondesign. The orchestrator races a condition future againstsynchronizer.Next(); whichever resolves first wins. Synchronizer rotation is cyclic withAvailable/Blockedstate, mirroring Java'sSourceManager.What's in
IFDv2Condition+ concreteFallbackCondition/RecoveryConditionwith internal timers and anInform/Executeinterface.Conditionsaggregator over multiple conditions, built on a newasync::WhenAnyvector overload.SourceManagerowns the synchronizer factory list with per-factoryAvailable/Blockedstate. Iteration is cyclic (wraps + skips blocked).BlockCurrentSynchronizeronTerminalError;ResetSourceIndexon recovery.FDv2DataSystembuilds conditions keyed offAvailableSynchronizerCount/IsPrimeSynchronizer(no conditions if only one available; prime gets fallback only; others get fallback + recovery).timeoutargument fromIFDv2Synchronizer::Next()and removes the unreachableFDv2SourceResult::Timeoutvariant.Deferred to future PRs
X-LD-FD-Fallbackresponse header) — server-driven, one-way switch to an FDv1 synchronizer.IFDv2Synchronizer.Test plan
Unit tests for each new component and integration tests for fallback advance, wrap-around on last synchronizer, recovery reset, and exhaustion via
block.
Note
Medium Risk
Changes FDv2 orchestration and synchronizer interfaces (removing
Next()timeouts/Timeoutresults) while introducing timer-driven fallback/recovery and cyclic source rotation, which could affect update reliability and shutdown behavior.Overview
Adds a new FDv2 condition abstraction (
IFDv2Condition) with timedFallbackCondition/RecoveryConditionimplementations plus aConditionsaggregator that races condition futures againstsynchronizer.Next()to trigger source transitions.Refactors FDv2 synchronizer rotation into a new
SourceManagerthat cycles through synchronizer factories, skips blocked sources, blocks onTerminalError, and resets to the most-preferred source on recovery.Removes FDv2 per-
Next()timeouts by dropping the timeout parameter fromIFDv2Synchronizer::Next()and deleting theFDv2SourceResult::Timeoutvariant; updates polling/streaming synchronizers and tests accordingly. Also adds a vector overload ofasync::WhenAnyand comprehensive unit/integration tests for the new rotation and condition behavior.Reviewed by Cursor Bugbot for commit 1f7f26e. Bugbot is set up for automated code reviews on this repo. Configure here.