docs: scaling dive 2026-05 (closes Phase 2 of #7756)#7765
Conversation
Phase 2 deliverable from the scaling-dive program. Documents: - Methodology (harness commit, runner shape, sweep specs, decision rules) - Baseline curve at authors=20..200 against develop HEAD - Per-lever scoring (perMessageDeflate deferred, nodemem no-effect, websocket-only refuted, raw ws not pursued) - Recommendation: prototype fan-out batching as the next lever (the data identifies emits scaling O(N^2) as the dominant cost) Closes Phase 2 of #7756. Phase 3 (batching prototype) is a separate feature branch the dive workflow will score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoScaling dive 2026-05: Phase 2 analysis with fan-out bottleneck findings
WalkthroughsDescription• Documents Phase 2 scaling analysis with empirical measurements • Identifies fan-out as dominant bottleneck, not server-side apply • Refutes websocket-only hypothesis; polling fallback beneficial • Recommends fan-out batching as next concrete optimization lever Diagramflowchart LR
baseline["Baseline measurements<br/>20-200 authors"]
analysis["Decision rule analysis<br/>CPU/network bound"]
levers["Per-lever evaluation<br/>4 configurations tested"]
findings["Key findings:<br/>Fan-out O(N²) cost<br/>Websocket-only refuted"]
recommendation["Recommendation:<br/>Prototype batching<br/>Verify perMessageDeflate"]
baseline --> analysis
analysis --> levers
levers --> findings
findings --> recommendation
File Changes1. docs/scaling-dive-2026-05.md
|
Code Review by Qodo
Context used 1. Fan-out batching lacks results
|
| ## Lever 3 — fan-out batching | ||
|
|
||
| **Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch. | ||
|
|
||
| The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU. | ||
|
|
There was a problem hiding this comment.
1. Fan-out batching lacks results 📎 Requirement gap ➹ Performance
The doc recommends fan-out batching but explicitly defers running/recording any batching experiment, so there are no load-test findings on message batching/buffering impact. This fails the requirement to validate and report findings for batching/message-size reduction under load.
Agent Prompt
## Issue description
The PR proposes/recommends fan-out batching but provides no implemented or documented batching/buffering experiment and no load-test findings for it.
## Issue Context
PR Compliance ID 4 requires an implementation or documented experiment for batching/buffering (or other message-size reduction) plus reported load-test findings demonstrating the effect.
## Fix Focus Areas
- docs/scaling-dive-2026-05.md[77-82]
- docs/scaling-dive-2026-05.md[108-109]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Captures everything learned since the first draft: - The "250-author cliff" was a measurement artefact from per-IP commitRateLimiting + colocated harness. Fixed via the etherpad-load-test#105 workflow patch. Real ceiling is ~350-400 authors on a 4-vCPU GitHub runner. - apply_mean ballooning at the cliff isn't slow code — it's OS preemption (7+ cores of work on 4 vCPU). Application-level JS rearrangement can't reach it. - Two changes hold up under the dive: fan-out serialization + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and historicalAuthorData cache (#7769, neutral on dive but real production thundering-herd fix at join time). - Four directions didn't pan out: WebSocket-only transport, heap bump, message-level batching alone (#7766 closed), and rebase-loop prefetch (#7770 closed). Each has a one-line cause documented for the record. - Engine.io transport-level packing (#7767) is the meatiest untouched lever — sending multiple packets per WebSocket frame the way polling already does via encodePayload. Qodo-flagged corrections incorporated: 1. The new instruments are Histogram + Counter + Gauge, not "three counters" — labelled correctly. 2. The lever-3 line reference now points at updatePadClients (lines 985-999) where NEW_CHANGES actually emits, not the wrong line 627 (handleSaveRevisionMessage). 3. Lever 3's results are written up against measured data, not "deferred". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressing Qodo's three items (commit
Bonus rewrite: the doc now reflects everything we learned after the original draft — the rate-limit artefact, the real ~350-400 author ceiling on a 4-vCPU runner, the OS-preemption diagnosis of apply_mean ballooning at the cliff, the negative results on websocket-only and the closed |
Three identical sweeps against develop quantify the runner-noise envelope. Same workload, same code, same workflow → p95 at step 350 ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread. What this means for prior conclusions in this doc: - websocket-only-is-worst HOLDS at the cliff: its envelope min (2463) equals baseline's max (2463), envelopes don't overlap. Single contradicting run was an outlier. - lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier comparison. The real reliable improvement is ~5-15% median p95 plus much tighter consistency (fewer tail-latency excursions). The mechanism — per-socket serialization preventing overlapping fan-outs that contend for CPU — is still real and still worth merging; the headline number was inflated. - below the cliff, all four levers' noise envelopes overlap. No clear winner. Going forward: lever scoring should default to N >= 3 trials and report min/median/max, not single-run point estimates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
N=3 scoring of feat/cache-historical-author-data shows it's net-negative above 300 authors (step 350 p95 envelope 301/488/633ms vs develop baseline 39/39/122ms). Two compounding issues: - The motivating hypothesis (250-cliff is a join thundering herd) was falsified — that cliff was the per-IP rate-limit artefact. - The defensive shallow-clone-on-every-get() added in the Qodo fix walks O(N) author entries per join, costing more than the inline Promise.all it replaced. Updated recommendations: lever 3 (#7768) is now the only PR worth merging. lever 6 (#7769) added to the do-not-merge list with honest data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings from rigorous N=3 scoring: 1. Lever 3 (#7768) is NOT a perf win. When you compare like-for- like matrix entries (develop-baseline vs PR-baseline), the per-socket serialization is slightly net-negative across the curve. My earlier "70% drop" was a single-run outlier; the subsequent "tighter envelope" was a cross-matrix-entry comparison confounded by noise. The serialization is still a real correctness fix (race on concurrent fan-outs + lost revisions on emit error) so the PR stays open, but the recommendation is now correctness-only. 2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to the closed lever 8 that actually patches Socket.sendPacket instead of just transport.send. queueMicrotask-coalesced flush gives the transport multi-packet batches to work with at last. N=3 shows tighter tail at step 300-350 (122 → 110 max at 350, 71 → 58 max at 300). Not a cliff-mover. The only PR in this program with N=3-confirmed perf benefit. Final disposition: - Merge: #7774 (modest perf), #7768 (correctness), #7762 (already merged, instruments). - The cliff at 350-400 authors is hardware-bound on a 4-vCPU runner, not code-bound. Production with more cores per host scales proportionally with no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CPU profile of develop at the 100-400 author dive sweep (load-test run 25956384097) identified a ~6% process-CPU win in SessionManager: throw-as-control-flow on every CLIENT_READY session lookup. Add lever 9 section with the profile evidence, link the open PR (#7775), and add a "Other CPU hotspots surfaced" subsection documenting findings not yet acted on (Changeset internals, appendRevision, ueberdb/dirty backing as test-harness artifact, esbuild __name overhead). Update Recommendation to include #7775 as the highest-priority merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "score pending" placeholder under lever 9 with the actual numbers from runs 25957107195/25957108328/25957109418 (perf branch) vs 25954537767/25954538807/25954540108 (develop), both at authors=100..500:step=50:dwell=8s:warmup=2s. Result: consistent -1.4% to -5.3% CPU reduction across all 9 steps, matching profile direction at 2-5% (vs 6% profile-attributed upper bound). Latency delta sits inside the noise envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three combined-branch runs (perf/dive-combined = #7776 cherry-picked onto #7775 base; runs 25960003164/25960004223/25960005248) vs the same three develop baselines: -12% to -20% CPU% across all 9 sweep steps, with the p95 cliff effectively moving from ~400 to ~500 authors (at step 400, two of three combined runs land below the cliff at 45ms and 112ms p95 vs develop [1758, 2275, 2463]). Adds: - Lever 10 section for #7776 with its own N=3 numbers (-3.6 to -8.9% alone). - "Stacking" section showing super-additive interaction. - Local vCPU experiment showing the cliff is single-event-loop-bound, not total-CPU-bound: 4-core and 8-core pinned SUTs hit the same cliff at the same step. - Updated TL;DR, Recommendation order (merge both #7775+#7776 first), and "Where to take this next" with worker-thread offload as the smallest next architectural step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-#7775/#7776 profile shows applyToAText splits cleanly: - applyToText (Changeset.ts:404) is pure (cs, text) -> text; trivially offloadable to a worker via worker_threads structured-clone postMessage. - applyToAttribution (Changeset.ts:684) mutates AttributePool; not trivially offloadable. Document the obvious first-pass design (run them in parallel via Promise.all inside applyToAText) and the realistic estimate (~6-8% CPU moved off the main event loop). putAttrib is only 0.26% in the post-fix profile, confirming the bulk of applyToAText's cost is in the string-manipulation half. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Microbenchmark on branch experiment/worker-thread-applytotext shows that dispatching applyToText to a worker is net-NEGATIVE at every realistic pad size (+11% to +326% overhead). The string postMessage serialization cost exceeds the per-call applyToText work for our workload (typical pads are 1-10KB, calls 17-86 µs sync, dispatch overhead 40-90 µs). Replace the earlier "Concrete first-pass design" recommendation (which assumed worker offload would win) with the actual numbers and reframe the architectural next step as per-pad worker isolation (handoff serialization paid once at pad ownership transfer rather than per changeset). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… after worker-thread falsification
Add a "Roadmap for future effort" section ahead of Reproducing, ranking the next concrete options by impact-per-time-spent. Tier 1 (mechanical / <1 day each): - merge ready perf PRs (#7775+#7776+#7774) - implement #7780 room-broadcast fan-out - additional post-fix profile pass Tier 2 (medium, real cliff moves): - selective fan-out / viewport-based broadcast (~2 weeks; cliff ~500 → 1000-1500) - per-pad worker isolation PoC (~1-2 weeks PoC, 1-2 months prod) Tier 3 (large bets): - sticky-session cluster mode (~2-4 weeks PoC) - CRDT migration (months; anti-recommended) Tier 4 (operational): - production telemetry hookup (~3-5 days) - nightly dive in CI (~1 day) Records the recommended sequence (Tier 1.2 → Tier 2.4) so the next person picking this up doesn't need to re-derive it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRDT migration's blast radius is wider than "plugins" — the changeset wire format is the lingua franca for the Electron/Capacitor desktop app, the mobile app, etherpad-cli, MCP servers, and server plugins. A CRDT switch means parallel reimplementation in every consumer, not just core. Upgrade the wording from "anti-recommended" to "strongly anti-recommended" with the actual list of affected projects. Cluster mode in contrast is mostly transparent to wire-protocol consumers; the only deployment shape that needs care is in-process embeds (Electron/Capacitor bundle a single Node process for a single user — they can skip cluster mode entirely). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…6b (cluster) My original Tier 3 option 6 conflated two distinct shapes the operator audience would treat separately: - 6a (reverse-proxy pad sharding): the L7-proxy answer the operator ecosystem already runs in production. No core changes; cost is deployment. Solves "more pads across many boxes". - 6b (Node cluster module with sticky padId routing): single host, multi-worker. Solves "more pads per box". Pick this or Tier 2 option 5 (worker_threads), not both — same problem shape, different isolation boundary. Both are transparent to wire-protocol clients (desktop, mobile, terminal/CLI, MCP) — same as the original note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Numbers-backed answer to #7756: "how many editors can be on one pad, and where is the bottleneck?" Every claim links to a CI run whose
report.jsonis downloadable for re-analysis.This PR description is intentionally brief; the doc itself (
docs/scaling-dive-2026-05.md) is the source of truth and contains the full methodology, every lever scored, the negative results, and the honest re-evaluations.Headline findings
Real ceiling on a 4-vCPU GitHub runner: ~350-400 authors per pad. Beyond that, p95 cliffs from ~80 ms to ~2000 ms within one step. CPU consumed 7+ cores of work per wall-second — OS preemption, not slow code paths.
The "250-author cliff" earlier reports kept hitting was a measurement artefact. Per-IP
commitRateLimiting+ harness colocation (all simulated authors share127.0.0.1). Fixed in load-test#105. Production deployments with many client IPs are not affected.Runner noise is significant. Each GHA matrix entry runs on a potentially different physical runner. The same
developbaseline produced p95 between 38 ms and 122 ms at step 350 across three identical runs. All single-run lever conclusions sit inside the noise envelope below the cliff. The doc adopts N=3-per-lever methodology and reports min/median/max.Two changes hold up under N=3 scoring:
Five directions did not pan out, documented in full:
--max-old-space-sizeheap bump — no effect.The cliff is hardware-bound, not code-bound, on the runner we measure. Production with more cores per host sees proportionally higher ceilings without code changes. Remaining application-level surface explored end-to-end.
What changed since the first draft of this doc
Test plan
This is documentation only — no code changes. Reviewers should verify:
report.jsonfiles.🤖 Generated with Claude Code