feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769
feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769JohnMcLear wants to merge 2 commits into
Conversation
The connect-handshake cliff at ~250 authors isn't a fan-out problem (see #7768) — it's a join hot path problem. Each CLIENT_READY did Promise.all(pad.getAllAuthors().map(authorManager.getAuthor)) = O(N) ueberdb cache lookups + async/await scheduling per joiner. At 200 existing authors and 50 simultaneous joiners that's 10 000 lookups racing against the existing authors' USER_CHANGES traffic for event-loop time. The new joiners' CLIENT_VARS responses then exceed the harness's 10s connect timeout and the cliff appears. This PR adds a per-pad HistoricalAuthorDataCache (TTL 5s, defaults overridable for tests) that collapses simultaneous joiners onto a single shared computation. 50 joiners now share 1 fetch per author instead of doing 50 independent N-author fans. Implementation: - src/node/db/HistoricalAuthorDataCache.ts — pure helper, no DB or Pad dependency; takes the author-id list and the per-id fetcher as constructor args. Lives in its own module so the test can exercise the real production code without standing up the full pad / DB stack (importing Pad transitively pulls DB). - Pad.getHistoricalAuthorData() — lazy-inits a cache instance per pad, fetches via authorManager.getAuthor. - PadMessageHandler.handleClientReady — replaces the inline Promise.all(authors.map(authorManager.getAuthor)) with a single await pad.getHistoricalAuthorData() call. Tests (6/6) pin: - correct {name, colorId} per author - N simultaneous get() -> 1 fetch per author (the actual win) - TTL refetch - falsy fetch results omitted - explicit invalidate() - failed fetch leaves the cache clearable, next call retries Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoPer-pad cache for historical author data to fix join cliff
WalkthroughsDescription• Adds per-pad cache for historical author data to eliminate O(N²) lookups during simultaneous client joins • Collapses 50 simultaneous joiners' author fetches into single shared computation • Reduces connect-handshake cliff from ~250 authors by caching with 5-second TTL • Replaces inline Promise.all loop in handleClientReady with cached getHistoricalAuthorData call Diagramflowchart LR
A["50 Simultaneous<br/>CLIENT_READY"] -->|Before: 50×N<br/>ueberdb lookups| B["Event Loop<br/>Contention"]
B --> C["10s Timeout<br/>Exceeded"]
A -->|After: 1 shared<br/>computation| D["HistoricalAuthorDataCache<br/>TTL 5s"]
D --> E["Shared Author<br/>Data Map"]
E --> F["Fast CLIENT_VARS<br/>Response"]
File Changes1. src/node/db/HistoricalAuthorDataCache.ts
|
Code Review by Qodo
Context used 1. Cache recompute race
|
| async get(): Promise<{[id: string]: AuthorRecord}> { | ||
| const now = this.now(); | ||
| const cached = this.cached; | ||
| if (cached && now - cached.builtAt < this.ttlMs) { | ||
| return cached.promise ?? cached.data; | ||
| } | ||
| const promise = this.compute(); | ||
| this.cached = {data: {}, promise, builtAt: now}; | ||
| try { | ||
| const data = await promise; | ||
| this.cached = {data, builtAt: now}; | ||
| return data; |
There was a problem hiding this comment.
1. Cache recompute race 🐞 Bug ☼ Reliability
HistoricalAuthorDataCache.get() can start a second compute() while a previous compute promise is still in flight if the first run exceeds ttlMs, causing duplicate lookups and allowing an older promise to overwrite a newer cached value.
Agent Prompt
### Issue description
`HistoricalAuthorDataCache.get()` uses `builtAt` (set before `compute()` resolves) for the TTL check, and it does not prioritize an in-flight `cached.promise`. If `compute()` takes longer than `ttlMs`, a subsequent `get()` will start a new `compute()`, and whichever promise resolves last will overwrite `this.cached`, potentially reverting the cache to older data.
### Issue Context
This cache is used on the `CLIENT_READY` join path, where slow author fetches under load are plausible; the race undermines the PR’s primary goal (coalescing work under concurrency).
### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-46]
### Suggested fix
- If `this.cached?.promise` exists, return it **regardless of TTL** (TTL should apply to completed `data`, not to in-flight work).
- Set `builtAt` when the compute finishes (e.g., `builtAt: this.now()` after `await promise`).
- Ensure only the “current” promise can commit results (compare stored promise identity or use a monotonically increasing generation counter):
- Create `const promise = this.compute(); this.cached = {data: {}, promise, builtAt: now};`
- After `await promise`, only assign `this.cached = {data, builtAt}` if `this.cached?.promise === promise`.
- Add a unit test that simulates a slow fetch exceeding TTL and asserts only one compute is used and the cache is not overwritten by an older completion.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Four issues raised on the first push:
1. **Cache recompute race.** If compute() ran past ttlMs, a second
get() would start a duplicate compute and the older resolution
could clobber the newer cached value. Fixed by piggybacking on
the in-flight promise regardless of TTL — never start a second
compute while one is running.
2. **Unexplained conditional type.** `AuthorRecord extends never ?
never : ...` was a leftover that confused future readers
without earning anything. Replaced with a plain `CacheState |
null` typed field with documented members.
3. **Shared cache object.** Cache hits returned the same object
reference; handleClientReady embeds it in clientVars exposed to
the clientVars hook, where a mutation by one join could bleed
into the next. Now defensively shallow-clones on every get().
4. **Missing author logs.** The previous inline Promise.all loop
in handleClientReady emitted `messageLogger.error('There is no
author for authorId...')` per missing id; the refactor dropped
that. Cache now takes an optional onMissingAuthor callback;
Pad.ts wires it to the same log4js logger so the error still
surfaces.
Plus: commit-state-promise-identity guard in refresh() — only
commit the resolved data if the in-flight promise is still the
one the state references (covers invalidate() racing a compute).
8/8 tests green: original 6 + new fresh-object guarantee + slow-
compute-past-TTL guarantee.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures everything learned since the first draft: - The "250-author cliff" was a measurement artefact from per-IP commitRateLimiting + colocated harness. Fixed via the etherpad-load-test#105 workflow patch. Real ceiling is ~350-400 authors on a 4-vCPU GitHub runner. - apply_mean ballooning at the cliff isn't slow code — it's OS preemption (7+ cores of work on 4 vCPU). Application-level JS rearrangement can't reach it. - Two changes hold up under the dive: fan-out serialization + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and historicalAuthorData cache (#7769, neutral on dive but real production thundering-herd fix at join time). - Four directions didn't pan out: WebSocket-only transport, heap bump, message-level batching alone (#7766 closed), and rebase-loop prefetch (#7770 closed). Each has a one-line cause documented for the record. - Engine.io transport-level packing (#7767) is the meatiest untouched lever — sending multiple packets per WebSocket frame the way polling already does via encodePayload. Qodo-flagged corrections incorporated: 1. The new instruments are Histogram + Counter + Gauge, not "three counters" — labelled correctly. 2. The lever-3 line reference now points at updatePadClients (lines 985-999) where NEW_CHANGES actually emits, not the wrong line 627 (handleSaveRevisionMessage). 3. Lever 3's results are written up against measured data, not "deferred". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressing Qodo's four items from the initial review (commit
8/8 tests in |
|
Closing. Triple-run N=3 scoring against this PR shows it's net-negative above ~300 authors vs develop. p95 envelope comparison (min/med/max across 3 identical sweeps each):
Probable cause: the defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per joining client. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp, plus GC pressure, the cache becomes more expensive than the inline Promise.all it was meant to collapse. The hypothesis this PR addressed — the 250-author cliff is a join-thundering-herd problem — was also falsified: the 250 cliff turned out to be a per-IP commitRateLimiting artefact from harness colocation (fixed in load-test#105). The cache PR was built on that wrong diagnosis. What might still be salvageable for production (where many real users from many IPs DO join simultaneously, not measurable in the dive) is the join coalescing idea itself — but without the defensive clone, and without the TTL refresh that adds another N-object-allocation footgun. If anyone wants to revisit, the The 8 unit tests in |
N=3 scoring of feat/cache-historical-author-data shows it's net-negative above 300 authors (step 350 p95 envelope 301/488/633ms vs develop baseline 39/39/122ms). Two compounding issues: - The motivating hypothesis (250-cliff is a join thundering herd) was falsified — that cliff was the per-IP rate-limit artefact. - The defensive shallow-clone-on-every-get() added in the Qodo fix walks O(N) author entries per join, costing more than the inline Promise.all it replaced. Updated recommendations: lever 3 (#7768) is now the only PR worth merging. lever 6 (#7769) added to the do-not-merge list with honest data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Per-pad cache for the
{authorId → {name, colorId}}map thathandleClientReadybuilds for every CLIENT_READY message. Eliminates a thundering-herd condition on join when multiple users connect at once to a long-lived pad.Origin: the dive's 250-author cliff turned out to be two stacked artifacts
I built this PR thinking it would move the dive's 250-author cliff (#7756). It didn't, and the data is interesting:
The 250-author cliff was a measurement artifact.
NODE_ENV=productionenables Etherpad's per-IPcommitRateLimiting.handleMessageruns the rate limiter on every message includingCLIENT_READY. With the dive harness and SUT colocated on one runner, all simulated authors share127.0.0.1= one rate-limit bucket = trips at N × edit-rate. Fixed in etherpad-load-test#105.The real cliff on a 4-vCPU GitHub runner is 350-400 authors, with or without this PR. Both develop and this branch hit the
p95flag at step 400 withp95 ≈ 2000 msand CPU at 54-65 CPU-seconds per 8-second dwell. Steady-state fan-out CPU saturation, not join-path cost.So: this PR doesn't move the dive's cliff. It does fix a real thundering-herd condition that only appears when many users join one pad at the same time, which is exactly the case it's worth caching for — just not what limits concurrency at the saturation boundary.
Change
Each CLIENT_READY previously did:
At 200 existing authors × 50 simultaneous joiners that's 10 000 ueberdb cache lookups + async/await bookkeeping for a result the joiners share. New per-pad cache (5-second TTL) collapses that to one shared computation per burst.
src/node/db/HistoricalAuthorDataCache.ts— pure helper, no DB or Pad dependency. Takes the author-id list and the per-id fetcher as constructor args.Pad.getHistoricalAuthorData()— lazy-inits a cache per pad.PadMessageHandler.handleClientReady— replaces the inline Promise.all loop with a single await.Tests (6/6)
{name, colorId}.get()calls produce 1 fetch per author — the win.invalidate()works (hookable later for author-add events).When this matters
Production-significant when many users join the same pad in a short window: classroom-style "everyone open the link now", live-event onboarding bursts, integration tests, etc. The cost it removes is CPU-bound on the event loop right at the moment new users are waiting on
CLIENT_VARS, so removing it improves perceived join latency under that specific load.The dive doesn't exercise this scenario at a level where it dominates; the cliff at 350-400 in the dive is true steady-state CPU saturation. Moving that ceiling further is a separate investigation (engine.io transport-level packing per #7767, or bigger hardware).
Test plan
getAuthorfans should disappear.🤖 Generated with Claude Code