Skip to content

refactor(fdv2): add server FDv2 heuristic fallback and recovery#531

Merged
beekld merged 10 commits into
mainfrom
beeklimt/SDK-2099
May 15, 2026
Merged

refactor(fdv2): add server FDv2 heuristic fallback and recovery#531
beekld merged 10 commits into
mainfrom
beeklimt/SDK-2099

Conversation

@beekld
Copy link
Copy Markdown
Contributor

@beekld beekld commented May 11, 2026

Summary

Heuristic FDv2 fallback (interrupted ≥ 2 min) and recovery (valid ≥ 5 min) for the server SDK, modeled on Java's event-driven Condition design. The orchestrator races a condition future against synchronizer.Next(); whichever resolves first wins. Synchronizer rotation is cyclic with Available/Blocked state, mirroring Java's SourceManager.

What's in

  • IFDv2Condition + concrete FallbackCondition / RecoveryCondition with internal timers and an Inform / Execute interface.
  • Conditions aggregator over multiple conditions, built on a new async::WhenAny vector overload.
  • SourceManager owns the synchronizer factory list with per-factory Available/Blocked state. Iteration is cyclic (wraps + skips blocked). BlockCurrentSynchronizer on TerminalError; ResetSourceIndex on recovery.
  • FDv2DataSystem builds conditions keyed off AvailableSynchronizerCount/IsPrimeSynchronizer (no conditions if only one available; prime gets fallback only; others get fallback + recovery).
  • Drops the timeout argument from IFDv2Synchronizer::Next() and removes the unreachable FDv2SourceResult::Timeout variant.

Deferred to future PRs

  • FDv1 fallback directive (X-LD-FD-Fallback response header) — server-driven, one-way switch to an FDv1 synchronizer.
  • FDv1 synchronizer adapter wrapping the existing FDv1 sources as IFDv2Synchronizer.
  • Most of the spec's orchestration-logging requirements.

Test plan

Unit tests for each new component and integration tests for fallback advance, wrap-around on last synchronizer, recovery reset, and exhaustion via
block.


Note

Medium Risk
Changes FDv2 orchestration and synchronizer interfaces (removing Next() timeouts/Timeout results) while introducing timer-driven fallback/recovery and cyclic source rotation, which could affect update reliability and shutdown behavior.

Overview
Adds a new FDv2 condition abstraction (IFDv2Condition) with timed FallbackCondition/RecoveryCondition implementations plus a Conditions aggregator that races condition futures against synchronizer.Next() to trigger source transitions.

Refactors FDv2 synchronizer rotation into a new SourceManager that cycles through synchronizer factories, skips blocked sources, blocks on TerminalError, and resets to the most-preferred source on recovery.

Removes FDv2 per-Next() timeouts by dropping the timeout parameter from IFDv2Synchronizer::Next() and deleting the FDv2SourceResult::Timeout variant; updates polling/streaming synchronizers and tests accordingly. Also adds a vector overload of async::WhenAny and comprehensive unit/integration tests for the new rotation and condition behavior.

Reviewed by Cursor Bugbot for commit 1f7f26e. Bugbot is set up for automated code reviews on this repo. Configure here.

@beekld beekld marked this pull request as ready for review May 12, 2026 22:30
@beekld beekld requested a review from a team as a code owner May 12, 2026 22:30
Comment thread libs/server-sdk/src/data_systems/fdv2/conditions.cpp
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.

OnConditionFired(*cond_future.GetResult());
} else {
OnSynchronizerResult(*next_future.GetResult());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded continuation accumulation on conditions future in orchestration loop

Medium Severity

Each iteration of RunSynchronizerNext calls WhenAny(cond_future, next_future). The variadic WhenAny registers a Then continuation on cond_future, which shares the same PromiseInternal across all iterations (obtained via active_conditions_->GetFuture()). When the synchronizer result wins the race (normal path), the continuation on cond_future remains in the PromiseInternal::continuations_ vector, each holding a shared_ptr to its WhenAny's promise. For a stable primary connection where the fallback condition never fires, these continuations accumulate for the lifetime of the synchronizer — potentially days or weeks in a server SDK — causing linear memory growth proportional to the number of processed results.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I gotta think about this. Normally, I would cancel the promise that wasn't resolved, but for conditions, we re-use the same promise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so...

  1. Having the conditions have a single long-lived future that multiple callers wait on inherently has this problem.
  2. This isn't a problem with our Promise implementation -- every promise implementation would have this issue, even if the language had GC.
  3. AFAICT, our Java FDv2 implementation has this same issue, using CompletableFuture and anyOf, and no one has noticed an issue.
  4. The only reasonable solution that I can see here would be creating some kind of "subscriber" interface for conditions so that callers could "unsubscribe". But it's a bunch more code, and kind of messy.

Given that there's no simple workaround, and this hasn't been an issue for Java, I think we should just punt on this and maybe clean it up in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think this needs to block this PR. But it is something that we will want to think of a remediation for.

@beekld beekld merged commit 64b98e3 into main May 15, 2026
47 checks passed
@beekld beekld deleted the beeklimt/SDK-2099 branch May 15, 2026 21:55
kinyoklion added a commit to launchdarkly/java-core that referenced this pull request May 26, 2026
…163)

## Summary

`FDv2DataSource.Conditions.getFuture()` returned the same shared
`CompletableFuture<Object>` instance to every caller. The run loop does
`CompletableFuture.anyOf(getFuture(), synchronizer.next()).get()` on
every iteration, which attaches a new `OrRelay` `Completion` to the
shared future's `stack` each time. `CompletableFuture` has no deregister
path for the loser of an `anyOf` race, so those `Completion` nodes stay
on the stack until the shared future itself completes.

On a healthy primary streaming ChangeSets without ever firing
fallback/recovery, the shared future never completes — the `stack` grows
monotonically for the synchronizer's entire tenure (effectively the
SDK's uptime on a stable server).

**Per-iteration cost: ~200 B** (OrRelay + anyOf result CF + chain
references).
**At 10 ChangeSets/sec sustained: ~150 MB/day per active synchronizer.**

## The fix

A single permanent `whenComplete` listener on the underlying aggregate
fans out completion to every fresh future handed out by `getFuture()`.
Pending fresh futures are tracked via `WeakReference`, so a fresh future
whose only strong references were the caller's local variables (typical
lifetime: one loop iteration) becomes garbage-collectable once that
iteration ends. Pending entries whose referent has been collected are
pruned opportunistically on each `getFuture()` call and on `close()`.

`Conditions` is now package-private (was `private`) so direct unit tests
can reach it. A test-only `pendingSize()` helper is added.

## Test plan

Adds `FDv2DataSourceConditionsAggregateTest` with five tests:

- **`getFutureReturnsDistinctInstancesPerCall`** — bug-prover. Fails on
the pre-fix shared-instance behavior, passes after the fix.
- **`getFutureReturnsDistinctInstancesEvenWithNoConditions`** —
bug-prover. Covers the empty-conditions case (single-synchronizer
configuration), which is exactly where per-iteration accumulation would
be most damaging.
- **`allFreshFuturesCompleteWhenAggregateFires`** — verifies fan-out via
the single permanent listener actually delivers to multiple fresh
futures handed out before the aggregate fires.
- **`getFutureAfterAggregateFiresReturnsCompletedFuture`** — verifies
the fast path: callers arriving after completion get an
already-completed future synchronously.
- **`pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped`** —
10k-iteration soak test that simulates the run-loop pattern (race a
fresh future against a fast-resolving sibling, drop the result) and
asserts the pending list stays bounded via GC + opportunistic pruning.
Caveat in the test docstring about `System.gc()` not being guaranteed —
if it ever flakes on CI we can migrate to `-XX:+UseSerialGC` or relax
the ceiling.

Verified bug-proving discipline: the two distinctness tests fail on the
pre-fix shared-instance behavior and pass after the fix. The full
server-sdk test suite (1857 tests across 109 classes) is clean.

## Context

This was identified during a multi-agent review of the analogous
cpp-sdks PR (launchdarkly/cpp-sdks#531), which mirrors this Java
implementation's `Conditions` design. The cpp version has the same
structural leak; this Java fix shape is what was prototyped there.
Filing here first since the runtime impact on a long-running JVM-based
server SDK is more pronounced.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches the FDv2 synchronizer condition-aggregation logic used in the
main run loop; mistakes could cause missed fallback/recovery signals or
incorrect exceptional completion behavior, though changes are localized
and covered by new unit tests.
> 
> **Overview**
> Prevents a long-lived memory leak in `FDv2DataSource.Conditions` by
changing `getFuture()` to return a *fresh* `CompletableFuture` per call
until the underlying condition aggregate completes, rather than
returning the same shared pending future each iteration.
> 
> Adds a single `whenComplete` fan-out from the aggregate to complete
all outstanding per-call futures (and to propagate exceptional
completion), tracks pending futures in a `WeakHashMap`-backed set for GC
cleanup, and makes `Conditions` package-private to allow direct testing.
> 
> Introduces `FDv2DataSourceConditionsAggregateTest` to assert per-call
distinctness, correct completion fan-out, and correct behavior on
exceptional and post-completion paths.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a6c7c36. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants