feat(scaling): engine.io socket flush deferral — modest tail-latency reduction at mid-load#7774
feat(scaling): engine.io socket flush deferral — modest tail-latency reduction at mid-load#7774JohnMcLear wants to merge 2 commits into
Conversation
Follow-up to the closed engine.io WS packing prototype (#7772). That patch only modified transport.send(packets[]) and never fired because engine.io's Socket.sendPacket calls flush() synchronously after each push to writeBuffer — and flush drains immediately when transport.writable is true (microseconds on WebSocket). The writeBuffer almost never contained more than one packet. This patch flips that. Socket.prototype.sendPacket is re-implemented to push to writeBuffer and then schedule a single coalesced flush via queueMicrotask. Multiple sendPacket calls in the same task all accumulate; the queued microtask drains the whole batch. The transport.send([packets]) call then sees N > 1 packets in steady state, which is where lever 8 / future engine.io transport packing work has the opportunity to coalesce to one WS frame. Microtask deferral adds zero meaningful wall-clock latency: microtasks drain before the next macrotask, so anything waiting on the next I/O callback / timer still sees the flush completed first. Wire bytes are unchanged. Gated by settings.engineFlushDefer. Default false. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoEngine.io socket flush deferral for mid-load tail-latency reduction
WalkthroughsDescription• Defer engine.io socket flush to microtask for packet coalescing • Reduces tail latency at mid-load (300-350 concurrent authors) • Batches multiple sendPacket calls into single transport.send • Gated by settings.engineFlushDefer (default false) Diagramflowchart LR
A["Multiple sendPacket calls<br/>in same task"] -->|accumulate| B["writeBuffer<br/>with N packets"]
B -->|microtask scheduled| C["Single coalesced flush"]
C -->|transport.send| D["Fewer WebSocket frames<br/>per batch"]
D -->|result| E["Reduced tail latency<br/>at mid-load"]
File Changes1. src/node/utils/EngineFlushDeferral.ts
|
Code Review by Qodo
1.
|
Two findings from rigorous N=3 scoring: 1. Lever 3 (#7768) is NOT a perf win. When you compare like-for- like matrix entries (develop-baseline vs PR-baseline), the per-socket serialization is slightly net-negative across the curve. My earlier "70% drop" was a single-run outlier; the subsequent "tighter envelope" was a cross-matrix-entry comparison confounded by noise. The serialization is still a real correctness fix (race on concurrent fan-outs + lost revisions on emit error) so the PR stays open, but the recommendation is now correctness-only. 2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to the closed lever 8 that actually patches Socket.sendPacket instead of just transport.send. queueMicrotask-coalesced flush gives the transport multi-packet batches to work with at last. N=3 shows tighter tail at step 300-350 (122 → 110 max at 350, 71 → 58 max at 300). Not a cliff-mover. The only PR in this program with N=3-confirmed perf benefit. Final disposition: - Merge: #7774 (modest perf), #7768 (correctness), #7762 (already merged, instruments). - The cliff at 350-400 authors is hardware-bound on a 4-vCPU runner, not code-bound. Production with more cores per host scales proportionally with no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two issues from the initial review: 1. Install guard blocked retries. Setting `installed = true` BEFORE requiring + validating engine.io meant a transient require error permanently disabled the patch for the rest of the process. Now the flag is set only after both checks pass; on failure, the warning is logged and a later boot path can retry. 2. engineFlushDefer was undocumented in settings.json.template. Added with the same prose as Settings.ts, including the "wire bytes are unchanged" / "no meaningful wall-clock latency" notes so operators see the safety case. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Addressing Qodo's 2 items (commit
All 33 CI checks pass. The Windows-without-plugins flake from earlier was a post-job cache-save step, unrelated to this PR. |
Add a "Roadmap for future effort" section ahead of Reproducing, ranking the next concrete options by impact-per-time-spent. Tier 1 (mechanical / <1 day each): - merge ready perf PRs (#7775+#7776+#7774) - implement #7780 room-broadcast fan-out - additional post-fix profile pass Tier 2 (medium, real cliff moves): - selective fan-out / viewport-based broadcast (~2 weeks; cliff ~500 → 1000-1500) - per-pad worker isolation PoC (~1-2 weeks PoC, 1-2 months prod) Tier 3 (large bets): - sticky-session cluster mode (~2-4 weeks PoC) - CRDT migration (months; anti-recommended) Tier 4 (operational): - production telemetry hookup (~3-5 days) - nightly dive in CI (~1 day) Records the recommended sequence (Tier 1.2 → Tier 2.4) so the next person picking this up doesn't need to re-derive it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
Follow-up to the closed engine.io WS packing prototype (#7772). That patch only modified `transport.send(packets[])` and almost never fired because engine.io's `Socket.sendPacket()` calls `flush()` synchronously after each `writeBuffer.push()`, and `flush()` drains immediately when `transport.writable === true` — microseconds on WebSocket. The writeBuffer almost never contained more than one packet.
This patch flips that. `Socket.prototype.sendPacket` is re-implemented to push to writeBuffer and then schedule a single coalesced flush via `queueMicrotask`. Multiple `sendPacket()` calls in the same task all accumulate; the queued microtask drains the whole batch. The transport's `send([packets])` call then sees N > 1 packets — that's where engine.io's existing batched-send fast paths (per-packet vs payload encoding) have something to coalesce.
Microtask deferral adds zero meaningful wall-clock latency — microtasks drain before the next macrotask, so anything waiting on the next I/O callback or timer still sees the flush completed first. Wire bytes are unchanged.
Measured impact (N=3, develop baseline vs setting-on)
Not a cliff-mover. Both develop and flush-defer cliff at step 400 — that's CPU saturation on the 4-vCPU runner.
At step 300-350 the tail latency consistently shrinks. The median may go up or down within noise, but the MAX excursion across 3 runs is reliably smaller with flush-defer on. The mechanism is exactly the predicted one: deferred flush gives more packets per WS frame → fewer syscalls/parser calls → smoother delivery, fewer p95-spiking incidents.
Caveats
Files
No tests yet; the patch is end-to-end measured via the dive workflow. A unit test that constructs an in-memory engine.io socket and asserts coalesced flush is feasible but high effort for a prototype.
Recommendation
Merge as a modest mid-load improvement and a useful primitive for future engine.io-side investigations. Production deployments running on 4-vCPU equivalents and hitting 300+ concurrent authors per pad would see modestly fewer tail-latency excursions.
🤖 Generated with Claude Code