feat(pad): cache historicalAuthorData (remove join-time thundering herd) by JohnMcLear · Pull Request #7769 · ether/etherpad

JohnMcLear · 2026-05-16T00:20:54Z

Summary

Per-pad cache for the {authorId → {name, colorId}} map that handleClientReady builds for every CLIENT_READY message. Eliminates a thundering-herd condition on join when multiple users connect at once to a long-lived pad.

Origin: the dive's 250-author cliff turned out to be two stacked artifacts

I built this PR thinking it would move the dive's 250-author cliff (#7756). It didn't, and the data is interesting:

The 250-author cliff was a measurement artifact. NODE_ENV=production enables Etherpad's per-IP commitRateLimiting. handleMessage runs the rate limiter on every message including CLIENT_READY. With the dive harness and SUT colocated on one runner, all simulated authors share 127.0.0.1 = one rate-limit bucket = trips at N × edit-rate. Fixed in etherpad-load-test#105.
The real cliff on a 4-vCPU GitHub runner is 350-400 authors, with or without this PR. Both develop and this branch hit the p95 flag at step 400 with p95 ≈ 2000 ms and CPU at 54-65 CPU-seconds per 8-second dwell. Steady-state fan-out CPU saturation, not join-path cost.

So: this PR doesn't move the dive's cliff. It does fix a real thundering-herd condition that only appears when many users join one pad at the same time, which is exactly the case it's worth caching for — just not what limits concurrency at the saturation boundary.

Change

Each CLIENT_READY previously did:

const authors = pad.getAllAuthors();
await Promise.all(authors.map(async (id) => {
  const a = await authorManager.getAuthor(id);
  historicalAuthorData[id] = {name: a.name, colorId: a.colorId};
}));

At 200 existing authors × 50 simultaneous joiners that's 10 000 ueberdb cache lookups + async/await bookkeeping for a result the joiners share. New per-pad cache (5-second TTL) collapses that to one shared computation per burst.

src/node/db/HistoricalAuthorDataCache.ts — pure helper, no DB or Pad dependency. Takes the author-id list and the per-id fetcher as constructor args.
Pad.getHistoricalAuthorData() — lazy-inits a cache per pad.
PadMessageHandler.handleClientReady — replaces the inline Promise.all loop with a single await.

Tests (6/6)

Returns one entry per author with {name, colorId}.
50 simultaneous get() calls produce 1 fetch per author — the win.
Refetches once the TTL expires.
Omits authors the fetcher returns falsy for.
Explicit invalidate() works (hookable later for author-add events).
Failed fetch leaves the cache clearable, next call retries.

When this matters

Production-significant when many users join the same pad in a short window: classroom-style "everyone open the link now", live-event onboarding bursts, integration tests, etc. The cost it removes is CPU-bound on the event loop right at the moment new users are waiting on CLIENT_VARS, so removing it improves perceived join latency under that specific load.

The dive doesn't exercise this scenario at a level where it dominates; the cliff at 350-400 in the dive is true steady-state CPU saturation. Moving that ceiling further is a separate investigation (engine.io transport-level packing per #7767, or bigger hardware).

Test plan

CI green
6/6 unit tests pass
Steady-state numbers unchanged at all step sizes through 350 in run 25949421120 vs develop in run 25949525421
Manual: open a pad with N>50 authors, join from 10 browsers within 2 seconds. Expect no observable change vs develop except the connect-time CPU spike from getAuthor fans should disappear.

🤖 Generated with Claude Code

The connect-handshake cliff at ~250 authors isn't a fan-out problem (see #7768) — it's a join hot path problem. Each CLIENT_READY did Promise.all(pad.getAllAuthors().map(authorManager.getAuthor)) = O(N) ueberdb cache lookups + async/await scheduling per joiner. At 200 existing authors and 50 simultaneous joiners that's 10 000 lookups racing against the existing authors' USER_CHANGES traffic for event-loop time. The new joiners' CLIENT_VARS responses then exceed the harness's 10s connect timeout and the cliff appears. This PR adds a per-pad HistoricalAuthorDataCache (TTL 5s, defaults overridable for tests) that collapses simultaneous joiners onto a single shared computation. 50 joiners now share 1 fetch per author instead of doing 50 independent N-author fans. Implementation: - src/node/db/HistoricalAuthorDataCache.ts — pure helper, no DB or Pad dependency; takes the author-id list and the per-id fetcher as constructor args. Lives in its own module so the test can exercise the real production code without standing up the full pad / DB stack (importing Pad transitively pulls DB). - Pad.getHistoricalAuthorData() — lazy-inits a cache instance per pad, fetches via authorManager.getAuthor. - PadMessageHandler.handleClientReady — replaces the inline Promise.all(authors.map(authorManager.getAuthor)) with a single await pad.getHistoricalAuthorData() call. Tests (6/6) pin: - correct {name, colorId} per author - N simultaneous get() -> 1 fetch per author (the actual win) - TTL refetch - falsy fetch results omitted - explicit invalidate() - failed fetch leaves the cache clearable, next call retries Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-16T00:20:59Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-16T00:21:13Z

Review Summary by Qodo

Per-pad cache for historical author data to fix join cliff

✨ Enhancement

Walkthroughs

Description

• Adds per-pad cache for historical author data to eliminate O(N²) lookups during simultaneous
  client joins
• Collapses 50 simultaneous joiners' author fetches into single shared computation
• Reduces connect-handshake cliff from ~250 authors by caching with 5-second TTL
• Replaces inline Promise.all loop in handleClientReady with cached getHistoricalAuthorData call

Diagram

flowchart LR
  A["50 Simultaneous<br/>CLIENT_READY"] -->|Before: 50×N<br/>ueberdb lookups| B["Event Loop<br/>Contention"]
  B --> C["10s Timeout<br/>Exceeded"]
  A -->|After: 1 shared<br/>computation| D["HistoricalAuthorDataCache<br/>TTL 5s"]
  D --> E["Shared Author<br/>Data Map"]
  E --> F["Fast CLIENT_VARS<br/>Response"]

File Changes

1. src/node/db/HistoricalAuthorDataCache.ts ✨ Enhancement +62/-0

New cache module for coalesced author data fetches

• New pure helper module implementing per-pad author data cache with configurable TTL (default 5s)
• Coalesces simultaneous get() calls into single fetch operation via shared promise
• Supports explicit invalidate() for future author-add event hooks
• Clears cache on fetch failure to enable retry on next call

src/node/db/HistoricalAuthorDataCache.ts

2. src/node/db/Pad.ts ✨ Enhancement +22/-0

Add lazy-initialized per-pad author data cache

• Adds private historicalAuthorDataCache field lazily initialized on first call
• Implements getHistoricalAuthorData() method that returns cached {authorId -> {name, colorId}} map
• Cache uses pad's getAllAuthors() and authorManager.getAuthor as data sources
• Lazy initialization avoids touching authorManager during pad construction

src/node/db/Pad.ts

3. src/node/handler/PadMessageHandler.ts ✨ Enhancement +8/-16

Replace inline author fetch with cached getHistoricalAuthorData call

• Replaces inline Promise.all(authors.map(authorManager.getAuthor)) with single await
 pad.getHistoricalAuthorData() call
• Removes manual author filtering logic (now handled by cache)
• Simplifies handleClientReady by delegating author data fetching to Pad layer
• Adds explanatory comment about #7756 connect-handshake cliff and cache rationale

src/node/handler/PadMessageHandler.ts

View more (1)

4. src/tests/backend-new/specs/pad-historical-author-data.test.ts 🧪 Tests +86/-0

Comprehensive test suite for author data cache

• New test suite with 6 test cases covering cache behavior and correctness
• Verifies correct {name, colorId} structure per author
• Confirms 50 simultaneous get() calls produce only 1 fetch per author (main performance win)
• Tests TTL expiration, falsy result filtering, explicit invalidation, and failure recovery
• Uses pure functions (no DB/Pad dependency) to exercise production code in isolation

src/tests/backend-new/specs/pad-historical-author-data.test.ts

qodo-free-for-open-source-projects · 2026-05-16T00:21:14Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (1) 📎 Requirement gaps (0)

Context used

✅ Tickets: 🎫 Investigate why dropping socket.io polling fallback degrades p95 under load (#7756 follow-up)

1. Cache recompute race 🐞 Bug ☼ Reliability

Description

HistoricalAuthorDataCache.get() can start a second compute() while a previous compute promise is
still in flight if the first run exceeds ttlMs, causing duplicate lookups and allowing an older
promise to overwrite a newer cached value.

Code

src/node/db/HistoricalAuthorDataCache.ts[R31-42]

+  async get(): Promise<{[id: string]: AuthorRecord}> {
+    const now = this.now();
+    const cached = this.cached;
+    if (cached && now - cached.builtAt < this.ttlMs) {
+      return cached.promise ?? cached.data;
+    }
+    const promise = this.compute();
+    this.cached = {data: {}, promise, builtAt: now};
+    try {
+      const data = await promise;
+      this.cached = {data, builtAt: now};
+      return data;

Evidence

The TTL check is applied to builtAt even when a compute is in progress, and builtAt is recorded
before awaiting the promise. If compute duration exceeds TTL, a later call will skip the cache-hit
return and start a new compute, and both awaiters will unconditionally assign to this.cached,
enabling last-writer-wins overwrites.

src/node/db/HistoricalAuthorDataCache.ts[24-46]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`HistoricalAuthorDataCache.get()` uses `builtAt` (set before `compute()` resolves) for the TTL check, and it does not prioritize an in-flight `cached.promise`. If `compute()` takes longer than `ttlMs`, a subsequent `get()` will start a new `compute()`, and whichever promise resolves last will overwrite `this.cached`, potentially reverting the cache to older data.

### Issue Context
This cache is used on the `CLIENT_READY` join path, where slow author fetches under load are plausible; the race undermines the PR’s primary goal (coalescing work under concurrency).

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-46]

### Suggested fix
- If `this.cached?.promise` exists, return it **regardless of TTL** (TTL should apply to completed `data`, not to in-flight work).
- Set `builtAt` when the compute finishes (e.g., `builtAt: this.now()` after `await promise`).
- Ensure only the “current” promise can commit results (compare stored promise identity or use a monotonically increasing generation counter):
 - Create `const promise = this.compute(); this.cached = {data: {}, promise, builtAt: now};`
 - After `await promise`, only assign `this.cached = {data, builtAt}` if `this.cached?.promise === promise`.
- Add a unit test that simulates a slow fetch exceeding TTL and asserts only one compute is used and the cache is not overwritten by an older completion.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Unexplained conditional type in cached 📘 Rule violation ⚙ Maintainability

Description

HistoricalAuthorDataCache introduces a non-obvious conditional type (`AuthorRecord extends never ?
never : ...`) without any explanation. This reduces maintainability because future readers will not
understand why the type trick is needed or whether it is safe to remove.

Code

src/node/db/HistoricalAuthorDataCache.ts[R18-22]

+  private cached: AuthorRecord extends never ? never : {
+    data: {[id: string]: AuthorRecord};
+    promise?: Promise<{[id: string]: AuthorRecord}>;
+    builtAt: number;
+  } | null = null;

Evidence

The checklist requires comments for complex or non-obvious code. The added cached field uses a
conditional type that appears to be a type-system workaround, but it is not documented in the
surrounding comments, making the intent unclear.

src/node/db/HistoricalAuthorDataCache.ts[18-22]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A non-obvious TypeScript conditional type is used for `HistoricalAuthorDataCache.cached` (`AuthorRecord extends never ? never : ...`) with no explanation. This makes the code harder to maintain and increases the risk of accidental breakage during refactors.

## Issue Context
The PR adds `src/node/db/HistoricalAuthorDataCache.ts` as a helper module. Most of the logic is well-commented, but the `cached` field type includes an unexplained type-level trick that is not self-evident.

## Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[18-22]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Shared cache object 🐞 Bug ☼ Reliability

Description

Cache hits return the same historicalAuthorData object reference, and handleClientReady embeds it
into clientVars passed to the clientVars hook; any mutation by one join can affect other joiners
within the TTL window.

Code

src/node/handler/PadMessageHandler.ts[R1090-1100]

+  // Per-pad cached author lookup (#7756 connect-handshake cliff). At 200+
+  // authors a fresh burst of 50 simultaneous joiners would otherwise do
+  // 50 * 200 = 10000 ueberdb cache lookups inside the join hot path,
+  // competing for the same event loop as the existing authors' USER_CHANGES
+  // traffic. The Pad-level cache collapses that to a single computation
+  // shared across the simultaneous joins (5-second TTL).
+  const historicalAuthorData: MapArrayType<{
    name: string;
    colorId: string;
-  }> = {};
-  await Promise.all(authors.map(async (authorId: string) => {
-    const author = await authorManager.getAuthor(authorId);
-    if (!author) {
-      messageLogger.error(`There is no author for authorId: ${authorId}. ` +
-          'This is possibly related to https://github.com/ether/etherpad-lite/issues/2802');
-    } else {
-      // Filter author attribs (e.g. don't send author's pads to all clients)
-      historicalAuthorData[authorId] = {name: author.name, colorId: author.colorId};
-    }
-  }));
+  }> = await pad.getHistoricalAuthorData();

Evidence
The cache returns cached.data directly on cache hits, and the handler uses the returned object as
part of clientVars and exposes it to hooks.aCallAll('clientVars', ...), making it reachable by
code that can mutate it.
src/node/db/HistoricalAuthorDataCache.ts[31-36]
src/node/handler/PadMessageHandler.ts[1090-1100]
src/node/handler/PadMessageHandler.ts[1245-1326]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`HistoricalAuthorDataCache.get()` returns the cached object by reference on cache hits. `handleClientReady` then places that object on `clientVars` and passes it to the `clientVars` hook, which can mutate it. Because the same object is shared, mutations can leak across join requests during the TTL.

### Issue Context
Previously, `historicalAuthorData` was rebuilt per `CLIENT_READY`, so per-request mutations were isolated. With caching, the object is shared across multiple joins.

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-36]
- src/node/handler/PadMessageHandler.ts[1090-1100]
- src/node/handler/PadMessageHandler.ts[1245-1326]

### Suggested fix
Pick one boundary to enforce immutability/isolation:
- In `handleClientReady`, clone the returned map before embedding in `clientVars` (recommended to keep cache fast internally):
 - Shallow clone keys + value objects, e.g. `Object.fromEntries(Object.entries(data).map(([k,v]) => [k, {...v}]))`.
- Alternatively, have `HistoricalAuthorDataCache.get()` return a defensive copy on every call (higher overhead for repeated callers).
- Consider `Object.freeze()` only if you’re confident plugins won’t mutate `clientVars` (freezing could break existing plugins).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Missing author logs 🐞 Bug ◔ Observability

Description

HistoricalAuthorDataCache.compute() silently drops authorIds whose global author record is missing,
removing the explicit error log previously emitted during CLIENT_READY and making this
data-integrity issue harder to detect.

Code

src/node/db/HistoricalAuthorDataCache.ts[R53-59]

+  private async compute(): Promise<{[id: string]: AuthorRecord}> {
+    const ids = this.listAuthorIds();
+    const out: {[id: string]: AuthorRecord} = {};
+    await Promise.all(ids.map(async (id) => {
+      const a = await this.getAuthor(id);
+      if (a) out[id] = {name: a.name, colorId: a.colorId};
+    }));

Evidence
The cache’s compute path only writes entries when a is truthy and has no logging, and the
underlying authorManager.getAuthor is a plain DB read without diagnostics, so missing historical
authors are now silently ignored in this code path.
src/node/db/HistoricalAuthorDataCache.ts[53-59]
src/node/db/AuthorManager.ts[218-218]
src/node/handler/PadMessageHandler.ts[1087-1100]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `getAuthor(id)` returns null/undefined for an author ID present in the pad’s author pool, the cache currently omits that author without logging. Previously, `handleClientReady` logged an error in this situation, which helped surface author DB/pool inconsistencies.

### Issue Context
The cache helper is intentionally “pure” and doesn’t depend on logging infrastructure, but the call site can still emit diagnostics.

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[53-59]
- src/node/handler/PadMessageHandler.ts[1090-1100]
- src/node/db/AuthorManager.ts[218-218]

### Suggested fix
- Add an optional callback to `HistoricalAuthorDataCache` (e.g., `onMissingAuthorId?: (id) => void`) and call it when `a` is falsy.
 - Provide the callback from `Pad.getHistoricalAuthorData()` or `handleClientReady` using the existing `messageLogger`.
- If you want to keep the helper strictly pure, do a lightweight integrity check at the call site (e.g., if `pad.getAllAuthors().length !== Object.keys(data).length`, compute the missing IDs and log them).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-05-16T00:25:37Z

+  async get(): Promise<{[id: string]: AuthorRecord}> {
+    const now = this.now();
+    const cached = this.cached;
+    if (cached && now - cached.builtAt < this.ttlMs) {
+      return cached.promise ?? cached.data;
+    }
+    const promise = this.compute();
+    this.cached = {data: {}, promise, builtAt: now};
+    try {
+      const data = await promise;
+      this.cached = {data, builtAt: now};
+      return data;


1. Cache recompute race 🐞 Bug ☼ Reliability

HistoricalAuthorDataCache.get() can start a second compute() while a previous compute promise is still in flight if the first run exceeds ttlMs, causing duplicate lookups and allowing an older promise to overwrite a newer cached value.

Agent Prompt

### Issue description `HistoricalAuthorDataCache.get()` uses `builtAt` (set before `compute()` resolves) for the TTL check, and it does not prioritize an in-flight `cached.promise`. If `compute()` takes longer than `ttlMs`, a subsequent `get()` will start a new `compute()`, and whichever promise resolves last will overwrite `this.cached`, potentially reverting the cache to older data. ### Issue Context This cache is used on the `CLIENT_READY` join path, where slow author fetches under load are plausible; the race undermines the PR’s primary goal (coalescing work under concurrency). ### Fix Focus Areas - src/node/db/HistoricalAuthorDataCache.ts[31-46] ### Suggested fix - If `this.cached?.promise` exists, return it **regardless of TTL** (TTL should apply to completed `data`, not to in-flight work). - Set `builtAt` when the compute finishes (e.g., `builtAt: this.now()` after `await promise`). - Ensure only the “current” promise can commit results (compare stored promise identity or use a monotonically increasing generation counter): - Create `const promise = this.compute(); this.cached = {data: {}, promise, builtAt: now};` - After `await promise`, only assign `this.cached = {data, builtAt}` if `this.cached?.promise === promise`. - Add a unit test that simulates a slow fetch exceeding TTL and asserts only one compute is used and the cache is not overwritten by an older completion.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Four issues raised on the first push: 1. **Cache recompute race.** If compute() ran past ttlMs, a second get() would start a duplicate compute and the older resolution could clobber the newer cached value. Fixed by piggybacking on the in-flight promise regardless of TTL — never start a second compute while one is running. 2. **Unexplained conditional type.** `AuthorRecord extends never ? never : ...` was a leftover that confused future readers without earning anything. Replaced with a plain `CacheState | null` typed field with documented members. 3. **Shared cache object.** Cache hits returned the same object reference; handleClientReady embeds it in clientVars exposed to the clientVars hook, where a mutation by one join could bleed into the next. Now defensively shallow-clones on every get(). 4. **Missing author logs.** The previous inline Promise.all loop in handleClientReady emitted `messageLogger.error('There is no author for authorId...')` per missing id; the refactor dropped that. Cache now takes an optional onMissingAuthor callback; Pad.ts wires it to the same log4js logger so the error still surfaces. Plus: commit-state-promise-identity guard in refresh() — only commit the resolved data if the in-flight promise is still the one the state references (covers invalidate() racing a compute). 8/8 tests green: original 6 + new fresh-object guarantee + slow- compute-past-TTL guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures everything learned since the first draft: - The "250-author cliff" was a measurement artefact from per-IP commitRateLimiting + colocated harness. Fixed via the etherpad-load-test#105 workflow patch. Real ceiling is ~350-400 authors on a 4-vCPU GitHub runner. - apply_mean ballooning at the cliff isn't slow code — it's OS preemption (7+ cores of work on 4 vCPU). Application-level JS rearrangement can't reach it. - Two changes hold up under the dive: fan-out serialization + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and historicalAuthorData cache (#7769, neutral on dive but real production thundering-herd fix at join time). - Four directions didn't pan out: WebSocket-only transport, heap bump, message-level batching alone (#7766 closed), and rebase-loop prefetch (#7770 closed). Each has a one-line cause documented for the record. - Engine.io transport-level packing (#7767) is the meatiest untouched lever — sending multiple packets per WebSocket frame the way polling already does via encodePayload. Qodo-flagged corrections incorporated: 1. The new instruments are Histogram + Counter + Gauge, not "three counters" — labelled correctly. 2. The lever-3 line reference now points at updatePadClients (lines 985-999) where NEW_CHANGES actually emits, not the wrong line 627 (handleSaveRevisionMessage). 3. Lever 3's results are written up against measured data, not "deferred". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear · 2026-05-16T05:30:28Z

Addressing Qodo's four items from the initial review (commit 478c663):

Cache recompute race → Fixed. get() now piggybacks on the in-flight promise regardless of TTL (`if (s?.promise) return cloneData(await s.promise)`), and the resolve handler only commits state if it's still the active promise (covers `invalidate()` racing a compute). New test a slow compute that runs past TTL still resolves callers; no duplicate fetch starts in flight pins this.
Unexplained conditional type → Fixed. Replaced `AuthorRecord extends never ? never : ...` with a documented `CacheState | null` type, fields commented inline.
Shared cache object → Fixed. All get() returns go through a cloneData() helper that produces a fresh top-level object and fresh per-author records. Mutations by one joiner can't bleed into the next. New test returns a fresh object on every get() pins this.
Missing author logs → Fixed. Cache now accepts an optional onMissingAuthor(id) callback, invoked once per id the fetcher returns falsy for. Pad.getHistoricalAuthorData wires it to a log4js logger emitting the same 'There is no author for authorId: ... related to #2802' message the inline loop used to. New test calls onMissingAuthor exactly once per id the fetcher returns falsy for pins it.

8/8 tests in pad-historical-author-data.test.ts green. All 31 CI checks pass. CI: https://github.com/ether/etherpad/pull/7769/checks

JohnMcLear · 2026-05-16T06:19:09Z

Closing. Triple-run N=3 scoring against this PR shows it's net-negative above ~300 authors vs develop.

p95 envelope comparison (min/med/max across 3 identical sweeps each):

Step	develop	this PR
200	30 / 37 / 51	29 / 38 / 65
300	38 / 45 / 71	39 / 93 / 240
350	39 / 39 / 122	301 / 488 / 633
400	1758 / 2275 / 2463	3053 / 3203 / 3327

Probable cause: the defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per joining client. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp, plus GC pressure, the cache becomes more expensive than the inline Promise.all it was meant to collapse.

The hypothesis this PR addressed — the 250-author cliff is a join-thundering-herd problem — was also falsified: the 250 cliff turned out to be a per-IP commitRateLimiting artefact from harness colocation (fixed in load-test#105). The cache PR was built on that wrong diagnosis.

What might still be salvageable for production (where many real users from many IPs DO join simultaneously, not measurable in the dive) is the join coalescing idea itself — but without the defensive clone, and without the TTL refresh that adds another N-object-allocation footgun. If anyone wants to revisit, the HistoricalAuthorDataCache module remains a useful template; just rework the public surface to return the data structure the cache owns (with documented "don't mutate" contract) instead of cloning per call.

The 8 unit tests in pad-historical-author-data.test.ts still pass — the implementation is correct, it just doesn't deliver the perf benefit it was scoped to deliver.

N=3 scoring of feat/cache-historical-author-data shows it's net-negative above 300 authors (step 350 p95 envelope 301/488/633ms vs develop baseline 39/39/122ms). Two compounding issues: - The motivating hypothesis (250-cliff is a join thundering herd) was falsified — that cliff was the per-IP rate-limit artefact. - The defensive shallow-clone-on-every-get() added in the Qodo fix walks O(N) author entries per join, costing more than the inline Promise.all it replaced. Updated recommendations: lever 3 (#7768) is now the only PR worth merging. lever 6 (#7769) added to the do-not-merge list with honest data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-free-for-open-source-projects Bot added the Compliance violation label May 16, 2026

qodo-free-for-open-source-projects Bot reviewed May 16, 2026

View reviewed changes

JohnMcLear changed the title ~~feat(pad): per-pad cache for historicalAuthorData (move #7756 join cliff)~~ feat(pad): cache historicalAuthorData (remove join-time thundering herd) May 16, 2026

JohnMcLear marked this pull request as draft May 16, 2026 04:52

JohnMcLear closed this May 16, 2026

This was referenced May 16, 2026

docs: scaling dive 2026-05 (closes Phase 2 of #7756) #7765

Draft

[#7756 follow-up] Room-broadcast NEW_CHANGES fan-out to drop ~5-7% per-recipient packet construction #7780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769

feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769
JohnMcLear wants to merge 2 commits into
developfrom
feat/cache-historical-author-data

JohnMcLear commented May 16, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 16, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 16, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 16, 2026 •

edited

Loading

Uh oh!

qodo-free-for-open-source-projects Bot May 16, 2026

Uh oh!

JohnMcLear commented May 16, 2026

Uh oh!

JohnMcLear commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JohnMcLear commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Origin: the dive's 250-author cliff turned out to be two stacked artifacts

Change

Tests (6/6)

When this matters

Test plan

Uh oh!

qodo-code-review Bot commented May 16, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 16, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-free-for-open-source-projects Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

JohnMcLear commented May 16, 2026

Uh oh!

JohnMcLear commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JohnMcLear commented May 16, 2026 •

edited

Loading

qodo-free-for-open-source-projects Bot commented May 16, 2026 •

edited

Loading