Skip to content

feat(scaling): fan-out debounce (#7756 lever 3 prototype)#7766

Closed
JohnMcLear wants to merge 2 commits into
developfrom
feat/fanout-debounce
Closed

feat(scaling): fan-out debounce (#7756 lever 3 prototype)#7766
JohnMcLear wants to merge 2 commits into
developfrom
feat/fanout-debounce

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

The scaling-dive doc (#7765) identifies fan-out as the dominant cost at high concurrency:

  • `etherpad_changeset_apply_duration_seconds` stays under 5 ms even at 200 authors — server-side apply is not the bottleneck.
  • `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²): 1 160 emits/dwell at 20 authors becomes 66 032 at 200 authors.
  • Baseline cliffs at ~250 authors (errorRate flag fires).

This is the prototype for spec lever 3 — coalesce per-pad fan-out within a debounce window. With `fanoutDebounceMs: 50` (or any value > 0), rapid commits batch into a single `updatePadClients` invocation per pad per window. With `fanoutDebounceMs: 0` (the default) the legacy behaviour is preserved byte-for-byte.

Implementation

  • New setting `fanoutDebounceMs: 0` (default 0). Production deployments are not affected.
  • New module `src/node/handler/FanoutScheduler.ts` (~30 lines) holding the per-pad `Map<padId, Timeout>` and the public `scheduleFanout(padId, callback)`. Lives in its own file so it can be unit-tested without standing up the full pad/DB/socket.io stack — extracting it avoided a module-load cycle when the test imports `PadMessageHandler` directly.
  • `handleUserChanges` replaces `await exports.updatePadClients(pad)` with `scheduleFanout(pad.id, runFanout)`. Errors from the deferred fan-out route through `messageLogger` via `setErrorHandler`.
  • Initial-join catch-up at line ~1377 keeps the synchronous call — a joining client must not miss any revisions, no batching there.

Tests

`src/tests/backend-new/specs/fanout-debounce.test.ts` — 5 passing tests:

  • debounce=0 passthrough fires every call
  • debounce>0 coalesces N rapid calls into one
  • per-pad independence (one pad's window doesn't gate another)
  • after-flush rescheduling starts a fresh window
  • error routing through the configurable handler

Scoring this

The dive workflow doesn't have a `fanout-debounce` matrix entry yet — that's a small follow-up in ether/etherpad-load-test. For now, the lever can be scored manually:

```
gh workflow run "Scaling dive" --repo ether/etherpad-load-test \
--ref main \
-f core_ref=feat/fanout-debounce \
-f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
```

(Plus a sed step in the workflow to inject `fanoutDebounceMs: 50` — I'll send that as a separate ether/etherpad-load-test PR.)

Test plan

🤖 Generated with Claude Code

Per the scaling-dive doc (docs/scaling-dive-2026-05.md):
- etherpad_changeset_apply_duration_seconds stays under 5ms even at
  200 concurrent authors — server-side apply is not the bottleneck.
- etherpad_socket_emits_total{type=NEW_CHANGES} scales O(N^2): 1160
  emits/dwell at 20 authors -> 66032 at 200 authors.
- Cliff in baseline appears at ~250 authors (errorRate flag).

Fan-out is the dominant cost. This prototype debounces the per-pad
fan-out so rapid commits coalesce into a single updatePadClients
invocation per pad per window.

Implementation:

- New `settings.fanoutDebounceMs` (default 0 = legacy behaviour).
  When > 0, scheduleFanout queues a setTimeout per pad; further
  commits arriving inside the window don't reschedule, they just
  ride the existing timer.

- Extracted scheduling into src/node/handler/FanoutScheduler.ts
  rather than nesting it in PadMessageHandler. The wrapper has no
  pad/DB dependency so it's unit-testable in isolation.

- handleUserChanges replaces its direct
  `await exports.updatePadClients(pad)` with
  `scheduleFanout(pad.id, runFanout)`. Errors during the deferred
  fan-out are surfaced via the existing messageLogger.

- Initial-join catch-up (line ~1377) keeps the synchronous call —
  a joining client must not miss revisions, no batching there.

Tests: 5/5 in src/tests/backend-new/specs/fanout-debounce.test.ts
— covers debounce=0 passthrough, N-into-1 coalescing, per-pad
independence, post-flush rescheduling, and error routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Implement fan-out debounce to reduce socket.io emit volume

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement fan-out debounce scheduler to coalesce rapid commits
  - New FanoutScheduler.ts module batches per-pad broadcasts within configurable window
  - Reduces O(N²) socket.io emit volume under high concurrency
• Add fanoutDebounceMs setting (default 0 for backward compatibility)
  - When > 0, defers pad-wide updates by N milliseconds
  - Rapid commits arriving during window batch into single fan-out pass
• Integrate scheduler into PadMessageHandler.handleUserChanges
  - Replace synchronous updatePadClients with scheduleFanout call
  - Route deferred errors through existing messageLogger
• Add comprehensive test coverage for debounce behavior
  - 5 tests verify passthrough, coalescing, per-pad independence, rescheduling, error handling
Diagram
flowchart LR
  A["USER_CHANGES<br/>accepted"] --> B["scheduleFanout<br/>called"]
  B --> C{debounceMs<br/>= 0?}
  C -->|Yes| D["updatePadClients<br/>fires immediately"]
  C -->|No| E["setTimeout<br/>queued per pad"]
  F["Further commits<br/>arrive"] --> G{Timer<br/>pending?}
  G -->|Yes| H["Coalesce into<br/>existing window"]
  G -->|No| E
  E --> I["Timer fires<br/>after debounceMs"]
  H --> I
  I --> D
  D --> J["Broadcast to<br/>all clients"]
Loading

Grey Divider

File Changes

1. src/node/handler/FanoutScheduler.ts ✨ Enhancement +43/-0

New fan-out scheduler module with debounce logic

• New module implementing per-pad fan-out scheduling with debounce support
• Maintains Map<padId, Timeout> to track pending fan-outs per pad
• scheduleFanout() either fires immediately (debounce ≤ 0) or queues setTimeout
• Provides setErrorHandler() for custom error routing and _state test helper

src/node/handler/FanoutScheduler.ts


2. src/node/handler/PadMessageHandler.ts ✨ Enhancement +15/-1

Integrate fan-out scheduler into message handler

• Import scheduleFanout and setErrorHandler from new FanoutScheduler module
• Install error handler that routes fan-out failures through messageLogger
• Create runFanout helper function that retrieves pad and calls updatePadClients
• Replace direct await exports.updatePadClients(pad) with scheduleFanout(pad.id, runFanout) in
 handleUserChanges

src/node/handler/PadMessageHandler.ts


3. src/node/utils/Settings.ts ⚙️ Configuration changes +14/-0

Add fanoutDebounceMs configuration setting

• Add fanoutDebounceMs: number to SettingsType interface
• Add fanoutDebounceMs: 0 default setting with comprehensive documentation
• Document behavior: 0 = legacy synchronous fan-out, > 0 = deferred batching
• Note that setting is gated behind loadTest and scalingDiveMetrics flags

src/node/utils/Settings.ts


View more (1)
4. src/tests/backend-new/specs/fanout-debounce.test.ts 🧪 Tests +78/-0

Comprehensive test coverage for fan-out debounce

• New test suite with 5 tests covering debounce scheduler behavior
• Test debounce=0 passthrough fires synchronously per call
• Test debounce>0 coalesces N rapid calls into single fan-out
• Test per-pad independence and post-flush rescheduling
• Test error routing through setErrorHandler callback

src/tests/backend-new/specs/fanout-debounce.test.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 15, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. fanoutDebounceMs=0 no longer sync ✓ Resolved 📘 Rule violation ≡ Correctness
Description
When fanoutDebounceMs <= 0 (the disabled/default state), handleUserChanges() no longer awaits
updatePadClients(), making fan-out asynchronous so errors no longer flow through the existing
try/catch and later queued USER_CHANGES can advance while the prior fan-out is still running. This
breaks the requirement that disabling the feature flag preserves pre-change behavior and risks races
on sessioninfo.rev that can produce duplicate or out-of-order NEW_CHANGES messages to clients.
Code

src/node/handler/PadMessageHandler.ts[R950-953]

   socket.emit('message', {type: 'COLLABROOM', data: {type: 'ACCEPT_COMMIT', newRev}});
   thisSession.rev = newRev;
   if (newRev !== r) thisSession.time = await pad.getRevisionDate(newRev);
-    await exports.updatePadClients(pad);
+    scheduleFanout(pad.id, runFanout);
Evidence
Rule 6 requires new functionality to be gated behind a flag and to match legacy behavior when
disabled, but the PR changes handleUserChanges() from await exports.updatePadClients(pad) to a
non-awaited scheduleFanout(...). In the scheduler, the debounceMs <= 0 path triggers the fan-out
with a void ... async call (not awaited), so padChannels.enqueue() serialization only covers the
synchronous portion of handleUserChanges() and the queue can process subsequent commits while the
earlier updatePadClients() is still iterating sockets and mutating sessioninfo.rev, which is not
concurrency-safe; this changes ordering and error propagation even when the flag is
disabled/default.

src/node/handler/PadMessageHandler.ts[950-953]
src/node/handler/FanoutScheduler.ts[27-32]
src/node/handler/PadMessageHandler.ts[580-593]
src/node/handler/PadMessageHandler.ts[943-954]
src/node/handler/PadMessageHandler.ts[964-1021]
src/node/handler/FanoutScheduler.ts[26-32]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When `fanoutDebounceMs <= 0` (feature-flag-disabled/default), `handleUserChanges()` must preserve the legacy semantics by awaiting fan-out completion so that ordering remains consistent, the per-pad queue doesn’t advance while `updatePadClients()` is still emitting/mutating `sessioninfo.rev`, and failures continue to be handled by the existing `try/catch`.
## Issue Context
Compliance requires new features to be disabled by default and to match pre-change behavior when disabled. `padChannels.enqueue()` serializes `handleUserChanges()` per pad only for awaited work; by switching from `await exports.updatePadClients(pad)` to `scheduleFanout(...)` (which doesn’t propagate/return an awaited Promise in the `debounceMs <= 0` path), later USER_CHANGES can be applied while a prior `updatePadClients()` is still running, risking duplicate/out-of-order `NEW_CHANGES` due to concurrent mutation/observation of `sessioninfo.rev` and changing error propagation.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[59-64]
- src/node/handler/PadMessageHandler.ts[943-954]
- src/node/handler/PadMessageHandler.ts[950-953]
- src/node/handler/FanoutScheduler.ts[27-32]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Overlapping fanouts per pad ✓ Resolved 🐞 Bug ☼ Reliability
Description
For fanoutDebounceMs>0, the scheduler deletes the per-pad entry before running the async fan-out
callback, so a commit that arrives while fan-out is still executing can schedule another timer and
trigger concurrent updatePadClients() runs for the same pad.
Code

src/node/handler/FanoutScheduler.ts[R33-37]

+  if (pendingFanouts.has(padId)) return;
+  const t = setTimeout(() => {
+    pendingFanouts.delete(padId);
+    void fanout(padId).catch((err) => onError(padId, err));
+  }, debounceMs);
Evidence
The scheduler drops the padId from pendingFanouts before starting async fanout(), which allows a
second schedule during execution. Concurrent fan-outs are unsafe because updatePadClients() reads
and writes sessioninfo.rev as it loops and emits.

src/node/handler/FanoutScheduler.ts[33-39]
src/node/handler/PadMessageHandler.ts[964-1021]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`FanoutScheduler.scheduleFanout()` can start multiple overlapping fan-outs for the same pad because it removes the `pendingFanouts` entry before invoking the async callback. If `updatePadClients()` is slow (high concurrency), later commits can schedule additional fan-outs that run concurrently, racing on shared mutable session state.
### Issue Context
The scheduler currently tracks only a pending timer, not an in-flight fan-out. `updatePadClients()` updates `sessioninfo.rev` while emitting; concurrent runs can interleave and duplicate/out-of-order emits.
### Fix Focus Areas
- src/node/handler/FanoutScheduler.ts[15-40]
### Suggested direction
- Track an in-flight state per pad (e.g., `Map<padId, {timeout?: Timeout, running: boolean, dirty: boolean}>`).
- If `scheduleFanout()` is called while a fan-out is running, set `dirty=true`.
- When the running fan-out completes, if `dirty` is set then schedule a new timer/window (or run immediately) and clear `dirty`.
- Do not allow two `fanout(padId)` executions concurrently for the same pad.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. fanoutDebounceMs undocumented setting ✓ Resolved 📘 Rule violation ⚙ Maintainability
Description
A new user-facing configuration option fanoutDebounceMs is added but is not documented in the
repository’s configuration documentation (for example settings.json.template). This increases
upgrade/configuration risk because operators will not discover or understand the new setting’s
behavior.
Code

src/node/utils/Settings.ts[R662-674]

+  /**
+   * Coalesce NEW_CHANGES fan-out within a debounce window of N milliseconds
+   * (#7756 lever 3). When > 0, after a USER_CHANGES is accepted the pad-wide
+   * broadcast is deferred by this many ms; further commits arriving during
+   * the window are batched into the same fan-out pass.
+   *
+   * 0 (default) = legacy behaviour: fan-out fires synchronously after every
+   * accepted commit. Increase to trade a few ms of latency for a quadratic
+   * reduction in socket.io emit volume under high concurrency. Gated behind
+   * `loadTest` AND `scalingDiveMetrics` — production deployments are not
+   * affected by default.
+   */
+  fanoutDebounceMs: 0,
Evidence
Rule 2 requires documentation updates when behavior/configuration changes are introduced. The PR
adds a new config key (fanoutDebounceMs) in Settings.ts, but the shipped settings documentation
template (settings.json.template) does not include this key near related settings such as
loadTest.

src/node/utils/Settings.ts[662-674]
settings.json.template[729-740]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new configuration option (`fanoutDebounceMs`) is introduced but is not documented alongside other settings, leaving operators without guidance.
## Issue Context
The project’s settings are commonly documented via `settings.json.template` (and optionally other `doc/` pages). This PR adds a new knob that changes fan-out behavior and should be documented with usage guidance.
## Fix Focus Areas
- src/node/utils/Settings.ts[662-674]
- settings.json.template[729-740]
- doc/docker.md[1-60]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Test leaks error handler ✓ Resolved 🐞 Bug ☼ Reliability
Description
The new unit test overrides the module-level FanoutScheduler error handler in beforeEach() but
never restores it in afterEach(), which can cause later tests to silently swallow scheduler
errors.
Code

src/tests/backend-new/specs/fanout-debounce.test.ts[R14-25]

+  beforeEach(() => {
+    vi.useFakeTimers();
+    _state.pendingFanouts.clear();
+    calls = [];
+    fanout = async (padId) => { calls.push(padId); };
+    setErrorHandler(() => {/* swallow in tests */});
+  });
+
+  afterEach(() => {
+    vi.useRealTimers();
+    settings.fanoutDebounceMs = originalDebounce;
+  });
Evidence
The test sets a global handler (setErrorHandler) but afterEach does not restore it, and the
scheduler keeps the handler in a module-level onError variable.

src/tests/backend-new/specs/fanout-debounce.test.ts[14-25]
src/node/handler/FanoutScheduler.ts[15-24]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`setErrorHandler()` changes a module-level variable, but the test suite never resets it after each test. This can pollute global state across test files.
### Issue Context
`FanoutScheduler` stores `onError` in module scope; the test sets it to a no-op.
### Fix Focus Areas
- src/tests/backend-new/specs/fanout-debounce.test.ts[14-25]
- src/node/handler/FanoutScheduler.ts[15-24]
### Suggested direction
- In the test file, capture the previous handler and restore it in `afterEach()`.
- If the previous handler is not accessible, add a `resetErrorHandler()` export (or allow `setErrorHandler(null)` to restore the default) and use it in `afterEach()`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

5. Misleading gating comment ✓ Resolved 🐞 Bug ⚙ Maintainability
Description
Settings.ts claims fanoutDebounceMs is gated behind loadTest and scalingDiveMetrics, but the
scheduler applies fanoutDebounceMs unconditionally, so the comment is incorrect and can mislead
operators/config reviewers.
Code

src/node/utils/Settings.ts[R670-672]

+   * reduction in socket.io emit volume under high concurrency. Gated behind
+   * `loadTest` AND `scalingDiveMetrics` — production deployments are not
+   * affected by default.
Evidence
The Settings comment explicitly states a gate, but scheduleFanout() computes debounceMs solely
from settings.fanoutDebounceMs and applies it regardless of other flags.

src/node/utils/Settings.ts[662-674]
src/node/handler/FanoutScheduler.ts[26-33]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The documentation for `fanoutDebounceMs` says the behavior is gated behind `loadTest` AND `scalingDiveMetrics`, but the implementation uses `settings.fanoutDebounceMs` directly with no such gating. This is a misleading contract.
### Issue Context
Defaults are safe (`0`), but the comment currently asserts a stronger constraint than the code enforces.
### Fix Focus Areas
- src/node/utils/Settings.ts[662-674]
- src/node/handler/FanoutScheduler.ts[27-33]
### Suggested direction
Choose one:
- Update the Settings comment to describe the real behavior (purely opt-in via `fanoutDebounceMs`).
- Or enforce the intended gate in code (treat as `0` unless `settings.loadTest && settings.scalingDiveMetrics`) and optionally log a warning if misconfigured.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

JohnMcLear added a commit to ether/etherpad-load-test that referenced this pull request May 15, 2026
- Adds fanout-debounce matrix entry. Sets settings.fanoutDebounceMs
  to 50ms via sed-append; requires the core_ref to include
  ether/etherpad#7766 (the FanoutScheduler). Fails fast with a
  clear error if the setting isn't picked up.

- Bumps job timeout 30 -> 60 minutes. Cliff-finding sweeps at
  authors=100..500 take long enough that nodemem and websocket-only
  were getting cancelled mid-sweep at 30 min on run 25937943898.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/node/handler/PadMessageHandler.ts Outdated
Comment thread src/node/handler/FanoutScheduler.ts Outdated
Five issues raised on the initial PR:

1. **Disabled state regressed to async.** Caller went through the
   scheduler even when debounceMs<=0, dropping the legacy
   `await updatePadClients(...)` semantics — errors no longer
   flowed through handleUserChanges' try/catch. Fix: only call
   scheduleFanout when debounceMs>0; otherwise the existing
   synchronous await path runs unchanged.

2. **Overlapping fan-outs per pad.** The scheduler deleted the
   pad's entry before the async callback resolved, so a schedule
   arriving mid-fanout could spawn a concurrent updatePadClients.
   Now tracked via running/dirty state: schedules during a run
   set `dirty`, and the running fan-out re-arms itself in its
   finally block to catch up. Late commits aren't missed.

3. **fanoutDebounceMs undocumented.** Added to
   settings.json.template alongside loadTest with the same prose
   as Settings.ts.

4. **Test leaked error handler.** beforeEach overrode the module-
   level handler but afterEach didn't restore it, polluting later
   test files. Added a resetErrorHandler() export and an afterEach
   that calls it.

5. **Misleading gating comment.** Settings.ts claimed the debounce
   was gated behind loadTest + scalingDiveMetrics, but the
   scheduler applied it unconditionally. Updated comment to
   reflect actual behaviour.

10/10 tests green (5 fanout + 5 prom-instruments).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear
Copy link
Copy Markdown
Member Author

Closing. Scored against the dive workflow on run 25940112728 — the lever does not meaningfully reduce emit volume.

Why: this PR debounces calls to `updatePadClients` but the per-socket `while (sessioninfo.rev < head)` loop still fires one emit per revision per socket regardless of how many revs are pending. So a debounced call fans out more revs per socket but the total emit count stays the same. Comparison at 200 authors:

p95 emits
baseline 22 ms 66 006
this PR (fanoutDebounceMs=50) 20 ms 66 014

The real fix is a NEW_CHANGES_BATCH message type that carries an array of revisions in a single emit per socket. That requires client-side handling and is a bigger change — to be opened as its own PR.

`FanoutScheduler.ts` and the test scaffolding are still useful as primitives but I don't want to ship a setting that operators would set without benefit. Closing without merging.

@JohnMcLear JohnMcLear closed this May 15, 2026
JohnMcLear added a commit to ether/etherpad-load-test that referenced this pull request May 15, 2026
Scored against run 25940112728: this lever does not reduce emit
volume (the per-socket while-loop still emits one message per
revision per socket regardless of how many revs are batched into
one updatePadClients call). ether/etherpad#7766 closed.

Keep the 60-min timeout — that fix stands.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit that referenced this pull request May 16, 2026
Captures everything learned since the first draft:

- The "250-author cliff" was a measurement artefact from per-IP
  commitRateLimiting + colocated harness. Fixed via the
  etherpad-load-test#105 workflow patch. Real ceiling is ~350-400
  authors on a 4-vCPU GitHub runner.

- apply_mean ballooning at the cliff isn't slow code — it's OS
  preemption (7+ cores of work on 4 vCPU). Application-level JS
  rearrangement can't reach it.

- Two changes hold up under the dive: fan-out serialization
  + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and
  historicalAuthorData cache (#7769, neutral on dive but real
  production thundering-herd fix at join time).

- Four directions didn't pan out: WebSocket-only transport, heap
  bump, message-level batching alone (#7766 closed), and
  rebase-loop prefetch (#7770 closed). Each has a one-line cause
  documented for the record.

- Engine.io transport-level packing (#7767) is the meatiest
  untouched lever — sending multiple packets per WebSocket frame
  the way polling already does via encodePayload.

Qodo-flagged corrections incorporated:
1. The new instruments are Histogram + Counter + Gauge, not
   "three counters" — labelled correctly.
2. The lever-3 line reference now points at updatePadClients
   (lines 985-999) where NEW_CHANGES actually emits, not the
   wrong line 627 (handleSaveRevisionMessage).
3. Lever 3's results are written up against measured data, not
   "deferred".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant