feat(scaling): fan-out debounce (#7756 lever 3 prototype)#7766
feat(scaling): fan-out debounce (#7756 lever 3 prototype)#7766JohnMcLear wants to merge 2 commits into
Conversation
Per the scaling-dive doc (docs/scaling-dive-2026-05.md):
- etherpad_changeset_apply_duration_seconds stays under 5ms even at
200 concurrent authors — server-side apply is not the bottleneck.
- etherpad_socket_emits_total{type=NEW_CHANGES} scales O(N^2): 1160
emits/dwell at 20 authors -> 66032 at 200 authors.
- Cliff in baseline appears at ~250 authors (errorRate flag).
Fan-out is the dominant cost. This prototype debounces the per-pad
fan-out so rapid commits coalesce into a single updatePadClients
invocation per pad per window.
Implementation:
- New `settings.fanoutDebounceMs` (default 0 = legacy behaviour).
When > 0, scheduleFanout queues a setTimeout per pad; further
commits arriving inside the window don't reschedule, they just
ride the existing timer.
- Extracted scheduling into src/node/handler/FanoutScheduler.ts
rather than nesting it in PadMessageHandler. The wrapper has no
pad/DB dependency so it's unit-testable in isolation.
- handleUserChanges replaces its direct
`await exports.updatePadClients(pad)` with
`scheduleFanout(pad.id, runFanout)`. Errors during the deferred
fan-out are surfaced via the existing messageLogger.
- Initial-join catch-up (line ~1377) keeps the synchronous call —
a joining client must not miss revisions, no batching there.
Tests: 5/5 in src/tests/backend-new/specs/fanout-debounce.test.ts
— covers debounce=0 passthrough, N-into-1 coalescing, per-pad
independence, post-flush rescheduling, and error routing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoImplement fan-out debounce to reduce socket.io emit volume
WalkthroughsDescription• Implement fan-out debounce scheduler to coalesce rapid commits - New FanoutScheduler.ts module batches per-pad broadcasts within configurable window - Reduces O(N²) socket.io emit volume under high concurrency • Add fanoutDebounceMs setting (default 0 for backward compatibility) - When > 0, defers pad-wide updates by N milliseconds - Rapid commits arriving during window batch into single fan-out pass • Integrate scheduler into PadMessageHandler.handleUserChanges - Replace synchronous updatePadClients with scheduleFanout call - Route deferred errors through existing messageLogger • Add comprehensive test coverage for debounce behavior - 5 tests verify passthrough, coalescing, per-pad independence, rescheduling, error handling Diagramflowchart LR
A["USER_CHANGES<br/>accepted"] --> B["scheduleFanout<br/>called"]
B --> C{debounceMs<br/>= 0?}
C -->|Yes| D["updatePadClients<br/>fires immediately"]
C -->|No| E["setTimeout<br/>queued per pad"]
F["Further commits<br/>arrive"] --> G{Timer<br/>pending?}
G -->|Yes| H["Coalesce into<br/>existing window"]
G -->|No| E
E --> I["Timer fires<br/>after debounceMs"]
H --> I
I --> D
D --> J["Broadcast to<br/>all clients"]
File Changes1. src/node/handler/FanoutScheduler.ts
|
Code Review by Qodo
1.
|
- Adds fanout-debounce matrix entry. Sets settings.fanoutDebounceMs to 50ms via sed-append; requires the core_ref to include ether/etherpad#7766 (the FanoutScheduler). Fails fast with a clear error if the setting isn't picked up. - Bumps job timeout 30 -> 60 minutes. Cliff-finding sweeps at authors=100..500 take long enough that nodemem and websocket-only were getting cancelled mid-sweep at 30 min on run 25937943898. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five issues raised on the initial PR: 1. **Disabled state regressed to async.** Caller went through the scheduler even when debounceMs<=0, dropping the legacy `await updatePadClients(...)` semantics — errors no longer flowed through handleUserChanges' try/catch. Fix: only call scheduleFanout when debounceMs>0; otherwise the existing synchronous await path runs unchanged. 2. **Overlapping fan-outs per pad.** The scheduler deleted the pad's entry before the async callback resolved, so a schedule arriving mid-fanout could spawn a concurrent updatePadClients. Now tracked via running/dirty state: schedules during a run set `dirty`, and the running fan-out re-arms itself in its finally block to catch up. Late commits aren't missed. 3. **fanoutDebounceMs undocumented.** Added to settings.json.template alongside loadTest with the same prose as Settings.ts. 4. **Test leaked error handler.** beforeEach overrode the module- level handler but afterEach didn't restore it, polluting later test files. Added a resetErrorHandler() export and an afterEach that calls it. 5. **Misleading gating comment.** Settings.ts claimed the debounce was gated behind loadTest + scalingDiveMetrics, but the scheduler applied it unconditionally. Updated comment to reflect actual behaviour. 10/10 tests green (5 fanout + 5 prom-instruments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing. Scored against the dive workflow on run 25940112728 — the lever does not meaningfully reduce emit volume. Why: this PR debounces calls to `updatePadClients` but the per-socket `while (sessioninfo.rev < head)` loop still fires one emit per revision per socket regardless of how many revs are pending. So a debounced call fans out more revs per socket but the total emit count stays the same. Comparison at 200 authors:
The real fix is a NEW_CHANGES_BATCH message type that carries an array of revisions in a single emit per socket. That requires client-side handling and is a bigger change — to be opened as its own PR. `FanoutScheduler.ts` and the test scaffolding are still useful as primitives but I don't want to ship a setting that operators would set without benefit. Closing without merging. |
Scored against run 25940112728: this lever does not reduce emit volume (the per-socket while-loop still emits one message per revision per socket regardless of how many revs are batched into one updatePadClients call). ether/etherpad#7766 closed. Keep the 60-min timeout — that fix stands. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures everything learned since the first draft: - The "250-author cliff" was a measurement artefact from per-IP commitRateLimiting + colocated harness. Fixed via the etherpad-load-test#105 workflow patch. Real ceiling is ~350-400 authors on a 4-vCPU GitHub runner. - apply_mean ballooning at the cliff isn't slow code — it's OS preemption (7+ cores of work on 4 vCPU). Application-level JS rearrangement can't reach it. - Two changes hold up under the dive: fan-out serialization + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and historicalAuthorData cache (#7769, neutral on dive but real production thundering-herd fix at join time). - Four directions didn't pan out: WebSocket-only transport, heap bump, message-level batching alone (#7766 closed), and rebase-loop prefetch (#7770 closed). Each has a one-line cause documented for the record. - Engine.io transport-level packing (#7767) is the meatiest untouched lever — sending multiple packets per WebSocket frame the way polling already does via encodePayload. Qodo-flagged corrections incorporated: 1. The new instruments are Histogram + Counter + Gauge, not "three counters" — labelled correctly. 2. The lever-3 line reference now points at updatePadClients (lines 985-999) where NEW_CHANGES actually emits, not the wrong line 627 (handleSaveRevisionMessage). 3. Lever 3's results are written up against measured data, not "deferred". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The scaling-dive doc (#7765) identifies fan-out as the dominant cost at high concurrency:
This is the prototype for spec lever 3 — coalesce per-pad fan-out within a debounce window. With `fanoutDebounceMs: 50` (or any value > 0), rapid commits batch into a single `updatePadClients` invocation per pad per window. With `fanoutDebounceMs: 0` (the default) the legacy behaviour is preserved byte-for-byte.
Implementation
Tests
`src/tests/backend-new/specs/fanout-debounce.test.ts` — 5 passing tests:
Scoring this
The dive workflow doesn't have a `fanout-debounce` matrix entry yet — that's a small follow-up in ether/etherpad-load-test. For now, the lever can be scored manually:
```
gh workflow run "Scaling dive" --repo ether/etherpad-load-test \
--ref main \
-f core_ref=feat/fanout-debounce \
-f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
```
(Plus a sed step in the workflow to inject `fanoutDebounceMs: 50` — I'll send that as a separate ether/etherpad-load-test PR.)
Test plan
🤖 Generated with Claude Code