Skip to content

feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769

Closed
JohnMcLear wants to merge 2 commits into
developfrom
feat/cache-historical-author-data
Closed

feat(pad): cache historicalAuthorData (remove join-time thundering herd)#7769
JohnMcLear wants to merge 2 commits into
developfrom
feat/cache-historical-author-data

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

@JohnMcLear JohnMcLear commented May 16, 2026

Summary

Per-pad cache for the {authorId → {name, colorId}} map that handleClientReady builds for every CLIENT_READY message. Eliminates a thundering-herd condition on join when multiple users connect at once to a long-lived pad.

Origin: the dive's 250-author cliff turned out to be two stacked artifacts

I built this PR thinking it would move the dive's 250-author cliff (#7756). It didn't, and the data is interesting:

  1. The 250-author cliff was a measurement artifact. NODE_ENV=production enables Etherpad's per-IP commitRateLimiting. handleMessage runs the rate limiter on every message including CLIENT_READY. With the dive harness and SUT colocated on one runner, all simulated authors share 127.0.0.1 = one rate-limit bucket = trips at N × edit-rate. Fixed in etherpad-load-test#105.

  2. The real cliff on a 4-vCPU GitHub runner is 350-400 authors, with or without this PR. Both develop and this branch hit the p95 flag at step 400 with p95 ≈ 2000 ms and CPU at 54-65 CPU-seconds per 8-second dwell. Steady-state fan-out CPU saturation, not join-path cost.

So: this PR doesn't move the dive's cliff. It does fix a real thundering-herd condition that only appears when many users join one pad at the same time, which is exactly the case it's worth caching for — just not what limits concurrency at the saturation boundary.

Change

Each CLIENT_READY previously did:

const authors = pad.getAllAuthors();
await Promise.all(authors.map(async (id) => {
  const a = await authorManager.getAuthor(id);
  historicalAuthorData[id] = {name: a.name, colorId: a.colorId};
}));

At 200 existing authors × 50 simultaneous joiners that's 10 000 ueberdb cache lookups + async/await bookkeeping for a result the joiners share. New per-pad cache (5-second TTL) collapses that to one shared computation per burst.

  • src/node/db/HistoricalAuthorDataCache.ts — pure helper, no DB or Pad dependency. Takes the author-id list and the per-id fetcher as constructor args.
  • Pad.getHistoricalAuthorData() — lazy-inits a cache per pad.
  • PadMessageHandler.handleClientReady — replaces the inline Promise.all loop with a single await.

Tests (6/6)

  • Returns one entry per author with {name, colorId}.
  • 50 simultaneous get() calls produce 1 fetch per author — the win.
  • Refetches once the TTL expires.
  • Omits authors the fetcher returns falsy for.
  • Explicit invalidate() works (hookable later for author-add events).
  • Failed fetch leaves the cache clearable, next call retries.

When this matters

Production-significant when many users join the same pad in a short window: classroom-style "everyone open the link now", live-event onboarding bursts, integration tests, etc. The cost it removes is CPU-bound on the event loop right at the moment new users are waiting on CLIENT_VARS, so removing it improves perceived join latency under that specific load.

The dive doesn't exercise this scenario at a level where it dominates; the cliff at 350-400 in the dive is true steady-state CPU saturation. Moving that ceiling further is a separate investigation (engine.io transport-level packing per #7767, or bigger hardware).

Test plan

  • CI green
  • 6/6 unit tests pass
  • Steady-state numbers unchanged at all step sizes through 350 in run 25949421120 vs develop in run 25949525421
  • Manual: open a pad with N>50 authors, join from 10 browsers within 2 seconds. Expect no observable change vs develop except the connect-time CPU spike from getAuthor fans should disappear.

🤖 Generated with Claude Code

The connect-handshake cliff at ~250 authors isn't a fan-out problem
(see #7768) — it's a join hot path problem. Each CLIENT_READY did
Promise.all(pad.getAllAuthors().map(authorManager.getAuthor)) =
O(N) ueberdb cache lookups + async/await scheduling per joiner.
At 200 existing authors and 50 simultaneous joiners that's 10 000
lookups racing against the existing authors' USER_CHANGES traffic
for event-loop time. The new joiners' CLIENT_VARS responses then
exceed the harness's 10s connect timeout and the cliff appears.

This PR adds a per-pad HistoricalAuthorDataCache (TTL 5s, defaults
overridable for tests) that collapses simultaneous joiners onto a
single shared computation. 50 joiners now share 1 fetch per
author instead of doing 50 independent N-author fans.

Implementation:

- src/node/db/HistoricalAuthorDataCache.ts — pure helper, no DB or
  Pad dependency; takes the author-id list and the per-id fetcher
  as constructor args. Lives in its own module so the test can
  exercise the real production code without standing up the full
  pad / DB stack (importing Pad transitively pulls DB).

- Pad.getHistoricalAuthorData() — lazy-inits a cache instance per
  pad, fetches via authorManager.getAuthor.

- PadMessageHandler.handleClientReady — replaces the inline
  Promise.all(authors.map(authorManager.getAuthor)) with a single
  await pad.getHistoricalAuthorData() call.

Tests (6/6) pin:
- correct {name, colorId} per author
- N simultaneous get() -> 1 fetch per author (the actual win)
- TTL refetch
- falsy fetch results omitted
- explicit invalidate()
- failed fetch leaves the cache clearable, next call retries

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Per-pad cache for historical author data to fix join cliff

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Adds per-pad cache for historical author data to eliminate O(N²) lookups during simultaneous
  client joins
• Collapses 50 simultaneous joiners' author fetches into single shared computation
• Reduces connect-handshake cliff from ~250 authors by caching with 5-second TTL
• Replaces inline Promise.all loop in handleClientReady with cached getHistoricalAuthorData call
Diagram
flowchart LR
  A["50 Simultaneous<br/>CLIENT_READY"] -->|Before: 50×N<br/>ueberdb lookups| B["Event Loop<br/>Contention"]
  B --> C["10s Timeout<br/>Exceeded"]
  A -->|After: 1 shared<br/>computation| D["HistoricalAuthorDataCache<br/>TTL 5s"]
  D --> E["Shared Author<br/>Data Map"]
  E --> F["Fast CLIENT_VARS<br/>Response"]
Loading

Grey Divider

File Changes

1. src/node/db/HistoricalAuthorDataCache.ts ✨ Enhancement +62/-0

New cache module for coalesced author data fetches

• New pure helper module implementing per-pad author data cache with configurable TTL (default 5s)
• Coalesces simultaneous get() calls into single fetch operation via shared promise
• Supports explicit invalidate() for future author-add event hooks
• Clears cache on fetch failure to enable retry on next call

src/node/db/HistoricalAuthorDataCache.ts


2. src/node/db/Pad.ts ✨ Enhancement +22/-0

Add lazy-initialized per-pad author data cache

• Adds private historicalAuthorDataCache field lazily initialized on first call
• Implements getHistoricalAuthorData() method that returns cached {authorId -> {name, colorId}} map
• Cache uses pad's getAllAuthors() and authorManager.getAuthor as data sources
• Lazy initialization avoids touching authorManager during pad construction

src/node/db/Pad.ts


3. src/node/handler/PadMessageHandler.ts ✨ Enhancement +8/-16

Replace inline author fetch with cached getHistoricalAuthorData call

• Replaces inline Promise.all(authors.map(authorManager.getAuthor)) with single await
 pad.getHistoricalAuthorData() call
• Removes manual author filtering logic (now handled by cache)
• Simplifies handleClientReady by delegating author data fetching to Pad layer
• Adds explanatory comment about #7756 connect-handshake cliff and cache rationale

src/node/handler/PadMessageHandler.ts


View more (1)
4. src/tests/backend-new/specs/pad-historical-author-data.test.ts 🧪 Tests +86/-0

Comprehensive test suite for author data cache

• New test suite with 6 test cases covering cache behavior and correctness
• Verifies correct {name, colorId} structure per author
• Confirms 50 simultaneous get() calls produce only 1 fetch per author (main performance win)
• Tests TTL expiration, falsy result filtering, explicit invalidation, and failure recovery
• Uses pure functions (no DB/Pad dependency) to exercise production code in isolation

src/tests/backend-new/specs/pad-historical-author-data.test.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 16, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (1) 📎 Requirement gaps (0)

Context used

Grey Divider


Action required

1. Cache recompute race 🐞 Bug ☼ Reliability
Description
HistoricalAuthorDataCache.get() can start a second compute() while a previous compute promise is
still in flight if the first run exceeds ttlMs, causing duplicate lookups and allowing an older
promise to overwrite a newer cached value.
Code

src/node/db/HistoricalAuthorDataCache.ts[R31-42]

+  async get(): Promise<{[id: string]: AuthorRecord}> {
+    const now = this.now();
+    const cached = this.cached;
+    if (cached && now - cached.builtAt < this.ttlMs) {
+      return cached.promise ?? cached.data;
+    }
+    const promise = this.compute();
+    this.cached = {data: {}, promise, builtAt: now};
+    try {
+      const data = await promise;
+      this.cached = {data, builtAt: now};
+      return data;
Evidence
The TTL check is applied to builtAt even when a compute is in progress, and builtAt is recorded
before awaiting the promise. If compute duration exceeds TTL, a later call will skip the cache-hit
return and start a new compute, and both awaiters will unconditionally assign to this.cached,
enabling last-writer-wins overwrites.

src/node/db/HistoricalAuthorDataCache.ts[24-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`HistoricalAuthorDataCache.get()` uses `builtAt` (set before `compute()` resolves) for the TTL check, and it does not prioritize an in-flight `cached.promise`. If `compute()` takes longer than `ttlMs`, a subsequent `get()` will start a new `compute()`, and whichever promise resolves last will overwrite `this.cached`, potentially reverting the cache to older data.

### Issue Context
This cache is used on the `CLIENT_READY` join path, where slow author fetches under load are plausible; the race undermines the PR’s primary goal (coalescing work under concurrency).

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-46]

### Suggested fix
- If `this.cached?.promise` exists, return it **regardless of TTL** (TTL should apply to completed `data`, not to in-flight work).
- Set `builtAt` when the compute finishes (e.g., `builtAt: this.now()` after `await promise`).
- Ensure only the “current” promise can commit results (compare stored promise identity or use a monotonically increasing generation counter):
 - Create `const promise = this.compute(); this.cached = {data: {}, promise, builtAt: now};`
 - After `await promise`, only assign `this.cached = {data, builtAt}` if `this.cached?.promise === promise`.
- Add a unit test that simulates a slow fetch exceeding TTL and asserts only one compute is used and the cache is not overwritten by an older completion.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Unexplained conditional type in cached 📘 Rule violation ⚙ Maintainability
Description
HistoricalAuthorDataCache introduces a non-obvious conditional type (`AuthorRecord extends never ?
never : ...`) without any explanation. This reduces maintainability because future readers will not
understand why the type trick is needed or whether it is safe to remove.
Code

src/node/db/HistoricalAuthorDataCache.ts[R18-22]

+  private cached: AuthorRecord extends never ? never : {
+    data: {[id: string]: AuthorRecord};
+    promise?: Promise<{[id: string]: AuthorRecord}>;
+    builtAt: number;
+  } | null = null;
Evidence
The checklist requires comments for complex or non-obvious code. The added cached field uses a
conditional type that appears to be a type-system workaround, but it is not documented in the
surrounding comments, making the intent unclear.

src/node/db/HistoricalAuthorDataCache.ts[18-22]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A non-obvious TypeScript conditional type is used for `HistoricalAuthorDataCache.cached` (`AuthorRecord extends never ? never : ...`) with no explanation. This makes the code harder to maintain and increases the risk of accidental breakage during refactors.

## Issue Context
The PR adds `src/node/db/HistoricalAuthorDataCache.ts` as a helper module. Most of the logic is well-commented, but the `cached` field type includes an unexplained type-level trick that is not self-evident.

## Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[18-22]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Shared cache object 🐞 Bug ☼ Reliability
Description
Cache hits return the same historicalAuthorData object reference, and handleClientReady embeds it
into clientVars passed to the clientVars hook; any mutation by one join can affect other joiners
within the TTL window.
Code

src/node/handler/PadMessageHandler.ts[R1090-1100]

+  // Per-pad cached author lookup (#7756 connect-handshake cliff). At 200+
+  // authors a fresh burst of 50 simultaneous joiners would otherwise do
+  // 50 * 200 = 10000 ueberdb cache lookups inside the join hot path,
+  // competing for the same event loop as the existing authors' USER_CHANGES
+  // traffic. The Pad-level cache collapses that to a single computation
+  // shared across the simultaneous joins (5-second TTL).
+  const historicalAuthorData: MapArrayType<{
    name: string;
    colorId: string;
-  }> = {};
-  await Promise.all(authors.map(async (authorId: string) => {
-    const author = await authorManager.getAuthor(authorId);
-    if (!author) {
-      messageLogger.error(`There is no author for authorId: ${authorId}. ` +
-          'This is possibly related to https://github.com/ether/etherpad-lite/issues/2802');
-    } else {
-      // Filter author attribs (e.g. don't send author's pads to all clients)
-      historicalAuthorData[authorId] = {name: author.name, colorId: author.colorId};
-    }
-  }));
+  }> = await pad.getHistoricalAuthorData();
Evidence
The cache returns cached.data directly on cache hits, and the handler uses the returned object as
part of clientVars and exposes it to hooks.aCallAll('clientVars', ...), making it reachable by
code that can mutate it.

src/node/db/HistoricalAuthorDataCache.ts[31-36]
src/node/handler/PadMessageHandler.ts[1090-1100]
src/node/handler/PadMessageHandler.ts[1245-1326]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`HistoricalAuthorDataCache.get()` returns the cached object by reference on cache hits. `handleClientReady` then places that object on `clientVars` and passes it to the `clientVars` hook, which can mutate it. Because the same object is shared, mutations can leak across join requests during the TTL.

### Issue Context
Previously, `historicalAuthorData` was rebuilt per `CLIENT_READY`, so per-request mutations were isolated. With caching, the object is shared across multiple joins.

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-36]
- src/node/handler/PadMessageHandler.ts[1090-1100]
- src/node/handler/PadMessageHandler.ts[1245-1326]

### Suggested fix
Pick one boundary to enforce immutability/isolation:
- In `handleClientReady`, clone the returned map before embedding in `clientVars` (recommended to keep cache fast internally):
 - Shallow clone keys + value objects, e.g. `Object.fromEntries(Object.entries(data).map(([k,v]) => [k, {...v}]))`.
- Alternatively, have `HistoricalAuthorDataCache.get()` return a defensive copy on every call (higher overhead for repeated callers).
- Consider `Object.freeze()` only if you’re confident plugins won’t mutate `clientVars` (freezing could break existing plugins).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

4. Missing author logs 🐞 Bug ◔ Observability
Description
HistoricalAuthorDataCache.compute() silently drops authorIds whose global author record is missing,
removing the explicit error log previously emitted during CLIENT_READY and making this
data-integrity issue harder to detect.
Code

src/node/db/HistoricalAuthorDataCache.ts[R53-59]

+  private async compute(): Promise<{[id: string]: AuthorRecord}> {
+    const ids = this.listAuthorIds();
+    const out: {[id: string]: AuthorRecord} = {};
+    await Promise.all(ids.map(async (id) => {
+      const a = await this.getAuthor(id);
+      if (a) out[id] = {name: a.name, colorId: a.colorId};
+    }));
Evidence
The cache’s compute path only writes entries when a is truthy and has no logging, and the
underlying authorManager.getAuthor is a plain DB read without diagnostics, so missing historical
authors are now silently ignored in this code path.

src/node/db/HistoricalAuthorDataCache.ts[53-59]
src/node/db/AuthorManager.ts[218-218]
src/node/handler/PadMessageHandler.ts[1087-1100]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `getAuthor(id)` returns null/undefined for an author ID present in the pad’s author pool, the cache currently omits that author without logging. Previously, `handleClientReady` logged an error in this situation, which helped surface author DB/pool inconsistencies.

### Issue Context
The cache helper is intentionally “pure” and doesn’t depend on logging infrastructure, but the call site can still emit diagnostics.

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[53-59]
- src/node/handler/PadMessageHandler.ts[1090-1100]
- src/node/db/AuthorManager.ts[218-218]

### Suggested fix
- Add an optional callback to `HistoricalAuthorDataCache` (e.g., `onMissingAuthorId?: (id) => void`) and call it when `a` is falsy.
 - Provide the callback from `Pad.getHistoricalAuthorData()` or `handleClientReady` using the existing `messageLogger`.
- If you want to keep the helper strictly pure, do a lightweight integrity check at the call site (e.g., if `pad.getAllAuthors().length !== Object.keys(data).length`, compute the missing IDs and log them).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment on lines +31 to +42
async get(): Promise<{[id: string]: AuthorRecord}> {
const now = this.now();
const cached = this.cached;
if (cached && now - cached.builtAt < this.ttlMs) {
return cached.promise ?? cached.data;
}
const promise = this.compute();
this.cached = {data: {}, promise, builtAt: now};
try {
const data = await promise;
this.cached = {data, builtAt: now};
return data;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Cache recompute race 🐞 Bug ☼ Reliability

HistoricalAuthorDataCache.get() can start a second compute() while a previous compute promise is
still in flight if the first run exceeds ttlMs, causing duplicate lookups and allowing an older
promise to overwrite a newer cached value.
Agent Prompt
### Issue description
`HistoricalAuthorDataCache.get()` uses `builtAt` (set before `compute()` resolves) for the TTL check, and it does not prioritize an in-flight `cached.promise`. If `compute()` takes longer than `ttlMs`, a subsequent `get()` will start a new `compute()`, and whichever promise resolves last will overwrite `this.cached`, potentially reverting the cache to older data.

### Issue Context
This cache is used on the `CLIENT_READY` join path, where slow author fetches under load are plausible; the race undermines the PR’s primary goal (coalescing work under concurrency).

### Fix Focus Areas
- src/node/db/HistoricalAuthorDataCache.ts[31-46]

### Suggested fix
- If `this.cached?.promise` exists, return it **regardless of TTL** (TTL should apply to completed `data`, not to in-flight work).
- Set `builtAt` when the compute finishes (e.g., `builtAt: this.now()` after `await promise`).
- Ensure only the “current” promise can commit results (compare stored promise identity or use a monotonically increasing generation counter):
  - Create `const promise = this.compute(); this.cached = {data: {}, promise, builtAt: now};`
  - After `await promise`, only assign `this.cached = {data, builtAt}` if `this.cached?.promise === promise`.
- Add a unit test that simulates a slow fetch exceeding TTL and asserts only one compute is used and the cache is not overwritten by an older completion.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@JohnMcLear JohnMcLear changed the title feat(pad): per-pad cache for historicalAuthorData (move #7756 join cliff) feat(pad): cache historicalAuthorData (remove join-time thundering herd) May 16, 2026
@JohnMcLear JohnMcLear marked this pull request as draft May 16, 2026 04:52
Four issues raised on the first push:

1. **Cache recompute race.** If compute() ran past ttlMs, a second
   get() would start a duplicate compute and the older resolution
   could clobber the newer cached value. Fixed by piggybacking on
   the in-flight promise regardless of TTL — never start a second
   compute while one is running.

2. **Unexplained conditional type.** `AuthorRecord extends never ?
   never : ...` was a leftover that confused future readers
   without earning anything. Replaced with a plain `CacheState |
   null` typed field with documented members.

3. **Shared cache object.** Cache hits returned the same object
   reference; handleClientReady embeds it in clientVars exposed to
   the clientVars hook, where a mutation by one join could bleed
   into the next. Now defensively shallow-clones on every get().

4. **Missing author logs.** The previous inline Promise.all loop
   in handleClientReady emitted `messageLogger.error('There is no
   author for authorId...')` per missing id; the refactor dropped
   that. Cache now takes an optional onMissingAuthor callback;
   Pad.ts wires it to the same log4js logger so the error still
   surfaces.

Plus: commit-state-promise-identity guard in refresh() — only
commit the resolved data if the in-flight promise is still the
one the state references (covers invalidate() racing a compute).

8/8 tests green: original 6 + new fresh-object guarantee + slow-
compute-past-TTL guarantee.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit that referenced this pull request May 16, 2026
Captures everything learned since the first draft:

- The "250-author cliff" was a measurement artefact from per-IP
  commitRateLimiting + colocated harness. Fixed via the
  etherpad-load-test#105 workflow patch. Real ceiling is ~350-400
  authors on a 4-vCPU GitHub runner.

- apply_mean ballooning at the cliff isn't slow code — it's OS
  preemption (7+ cores of work on 4 vCPU). Application-level JS
  rearrangement can't reach it.

- Two changes hold up under the dive: fan-out serialization
  + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and
  historicalAuthorData cache (#7769, neutral on dive but real
  production thundering-herd fix at join time).

- Four directions didn't pan out: WebSocket-only transport, heap
  bump, message-level batching alone (#7766 closed), and
  rebase-loop prefetch (#7770 closed). Each has a one-line cause
  documented for the record.

- Engine.io transport-level packing (#7767) is the meatiest
  untouched lever — sending multiple packets per WebSocket frame
  the way polling already does via encodePayload.

Qodo-flagged corrections incorporated:
1. The new instruments are Histogram + Counter + Gauge, not
   "three counters" — labelled correctly.
2. The lever-3 line reference now points at updatePadClients
   (lines 985-999) where NEW_CHANGES actually emits, not the
   wrong line 627 (handleSaveRevisionMessage).
3. Lever 3's results are written up against measured data, not
   "deferred".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear
Copy link
Copy Markdown
Member Author

Addressing Qodo's four items from the initial review (commit 478c663):

  1. Cache recompute race → Fixed. get() now piggybacks on the in-flight promise regardless of TTL (`if (s?.promise) return cloneData(await s.promise)`), and the resolve handler only commits state if it's still the active promise (covers `invalidate()` racing a compute). New test a slow compute that runs past TTL still resolves callers; no duplicate fetch starts in flight pins this.
  2. Unexplained conditional type → Fixed. Replaced `AuthorRecord extends never ? never : ...` with a documented `CacheState | null` type, fields commented inline.
  3. Shared cache object → Fixed. All get() returns go through a cloneData() helper that produces a fresh top-level object and fresh per-author records. Mutations by one joiner can't bleed into the next. New test returns a fresh object on every get() pins this.
  4. Missing author logs → Fixed. Cache now accepts an optional onMissingAuthor(id) callback, invoked once per id the fetcher returns falsy for. Pad.getHistoricalAuthorData wires it to a log4js logger emitting the same 'There is no author for authorId: ... related to #2802' message the inline loop used to. New test calls onMissingAuthor exactly once per id the fetcher returns falsy for pins it.

8/8 tests in pad-historical-author-data.test.ts green. All 31 CI checks pass. CI: https://github.com/ether/etherpad/pull/7769/checks

@JohnMcLear
Copy link
Copy Markdown
Member Author

Closing. Triple-run N=3 scoring against this PR shows it's net-negative above ~300 authors vs develop.

p95 envelope comparison (min/med/max across 3 identical sweeps each):

Step develop this PR
200 30 / 37 / 51 29 / 38 / 65
300 38 / 45 / 71 39 / 93 / 240
350 39 / 39 / 122 301 / 488 / 633
400 1758 / 2275 / 2463 3053 / 3203 / 3327

Probable cause: the defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per joining client. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp, plus GC pressure, the cache becomes more expensive than the inline Promise.all it was meant to collapse.

The hypothesis this PR addressed — the 250-author cliff is a join-thundering-herd problem — was also falsified: the 250 cliff turned out to be a per-IP commitRateLimiting artefact from harness colocation (fixed in load-test#105). The cache PR was built on that wrong diagnosis.

What might still be salvageable for production (where many real users from many IPs DO join simultaneously, not measurable in the dive) is the join coalescing idea itself — but without the defensive clone, and without the TTL refresh that adds another N-object-allocation footgun. If anyone wants to revisit, the HistoricalAuthorDataCache module remains a useful template; just rework the public surface to return the data structure the cache owns (with documented "don't mutate" contract) instead of cloning per call.

The 8 unit tests in pad-historical-author-data.test.ts still pass — the implementation is correct, it just doesn't deliver the perf benefit it was scoped to deliver.

@JohnMcLear JohnMcLear closed this May 16, 2026
JohnMcLear added a commit that referenced this pull request May 16, 2026
N=3 scoring of feat/cache-historical-author-data shows it's
net-negative above 300 authors (step 350 p95 envelope
301/488/633ms vs develop baseline 39/39/122ms). Two compounding
issues:
- The motivating hypothesis (250-cliff is a join thundering herd)
  was falsified — that cliff was the per-IP rate-limit artefact.
- The defensive shallow-clone-on-every-get() added in the Qodo
  fix walks O(N) author entries per join, costing more than the
  inline Promise.all it replaced.

Updated recommendations: lever 3 (#7768) is now the only PR worth
merging. lever 6 (#7769) added to the do-not-merge list with
honest data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant