22
33** Authors:** Will Manning
44** Status:** Proposal
5- ** Date:** 2026-04-02
5+ ** Date:** 2026-04-03
66
77## Summary
88
@@ -170,10 +170,12 @@ quantified empirically (see Experimental plan).
170170
171171** SORF approximation caveat.** Theorems 1 and 2 in [ 1] are proved for true
172172random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF
173- construction ` HD₃·HD₂·HD₁ ` [ 5] is a structured approximation. The
174- approximation quality depends on dimension: 3 rounds provides 3 × log₂(B)
175- mixing stages (18 at B=64, 21 at 128, 24 at 256, 30 at 1024). Empirical
176- validation is needed for each candidate B — see Experimental plan.
173+ construction ` HD₃·HD₂·HD₁ ` [ 5] is a structured approximation. The approximation quality depends on dimension: each round of the Walsh-Hadamard
174+ transform mixes all B coordinates through log₂(B) butterfly stages, so 3 rounds
175+ provides 3 × log₂(B) total butterfly stages (18 at B=64, 21 at 128, 24 at 256).
176+ This is a rough heuristic for mixing quality, not a formal convergence metric —
177+ [ 5] does not analyze convergence rate as a function of rounds × dimension.
178+ Empirical validation is needed for each candidate B — see Experimental plan.
177179
178180** Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a
179181B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128:
@@ -202,31 +204,46 @@ independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries.
202204Gaussian matrices work at any dimension, no padding is needed for the QJL stage.
203205Each block's QJL is provably unbiased by Lemma 4 in [ 1] , and the sum over
204206blocks is also unbiased: ` E[<y, correction>] = <y, r> ` . However, the per-block
205- variance is ** d/B times higher** than full-dimension QJL:
207+ variance is ** d/B times higher** than full-dimension QJL.
208+
209+ Lemma 4 gives the variance for QJL of a unit vector. Since QJL is applied to the
210+ residual rₖ (with norm γₖ = ‖rₖ‖, typically ≪ 1), the actual variance scales
211+ by γₖ²:
206212
207213```
208- Per-block (B-dim): Var[<y, correction>] ≤ (π / (2B)) × ‖y‖²
209- Full-dim (d-dim): Var[<y, correction>] ≤ (π / (2d)) × ‖y‖²
214+ Per-block (B-dim): Var[<y, correction>] ≤ (π / (2B)) × ‖r‖² × ‖ y‖²
215+ Full-dim (d-dim): Var[<y, correction>] ≤ (π / (2d)) × ‖r‖² × ‖ y‖²
210216```
211217
212- At d=768, B=128: 6× more variance. Storage: B×B×4 bytes per block (384 KB for
213- k=6 at B=128). Encode/decode cost: O(B²) matmul per block.
218+ The ‖r‖² factor cancels when comparing strategies (same MSE quality → same
219+ residual norms), so the ** relative** variance ratio is d/B regardless. At
220+ d=768, B=128: per-block has 6× more variance than full-dim. The absolute
221+ variance is small — at b=4 MSE, ‖r‖² ≈ 0.01, so the per-block variance is
222+ ≈ 0.01 × (π/(2×128)) × ‖y‖² ≈ 1.2×10⁻⁴ × ‖y‖².
223+
224+ Storage: B×B×4 bytes per block (384 KB for k=6 at B=128). Encode/decode cost:
225+ O(B²) matmul per block.
214226
215227** Per-block SORF QJL** substitutes a B-dim SORF (` HD₃·HD₂·HD₁ ` [ 5] ) for the
216- Gaussian matrix. This is NOT theoretically justified — Lemma 4 requires Gaussian
217- or Haar-orthogonal S, and SORF is only an approximation to Haar measure.
228+ Gaussian matrix. This is NOT theoretically justified — Lemma 4 in [ 1] is proved
229+ specifically for Gaussian S. For Haar-distributed random orthogonal S,
230+ unbiasedness follows from rotational invariance (a separate argument), but the
231+ variance constant may differ from π/(2B). SORF is an approximation to neither
232+ Gaussian nor Haar measure.
218233However, the [ current implementation] [ current-impl ] already uses SORF for QJL at
219234d=1024 with acceptable results (~ 11% mean relative error for power-of-2 dims),
220235demonstrating practical viability. The tradeoff vs Gaussian is compelling:
221- O(B log B) speed (5× faster at B=128), O(B) storage (over 1000× less). Quality
236+ O(B log B) speed (~ 10× faster than Gaussian at B=128), O(B) storage (over
237+ 1000× less). Quality
222238at B=128 needs validation — with only 21 mixing stages, the approximation to
223239Haar measure is weaker than at d=1024 (30 stages).
224240
225241** Full-dimension padded SORF QJL** applies a single SORF at the padded
226242dimension (e.g., 1024 for d=768) to the full residual vector ` r = x - x̂ ` ,
227243matching the [ current implementation] [ current-impl ] . The higher dimension gives
228- better SORF-to-Haar convergence (30 mixing stages at d=1024 vs 21 at B=128) and
229- full-dimension variance ` (π/(2d))·‖y‖² ` , but wastes ` (padded_d - d)/padded_d `
244+ better SORF-to-Haar convergence (30 butterfly stages at d=1024 vs 21 at B=128)
245+ and full-dimension variance ` ~(π/(2·padded_d))·‖r‖²·‖y‖² ` , but wastes
246+ ` (padded_d - d)/padded_d `
230247of the sign bits on zero-padded coordinates (25% at 768→1024). This approach
231248requires computing the full residual from all blocks before applying QJL,
232249adding a full-dimension decode step to the encode path.
@@ -238,24 +255,30 @@ computation.
238255
239256** QJL strategy options** (to be experimentally compared):
240257
241- | Strategy | Theoretical | Variance | Padding waste | Storage | Speed |
242- | -------------------- | ----------------- | ------------------- | --------------- | ------------ | ---------------- |
243- | Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²)/block |
244- | Per-block SORF | Approximate | ~ (π/(2B))·‖y‖² | None | k×3×B bits | O(B log B)/block |
245- | Full-dim padded SORF | Approximate | ~ (π/(2·pad_d))·‖y‖² | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total |
246- | MSE-only | N/A | N/A | N/A | None | 0 |
258+ | Strategy | Theoretical | Variance (×‖r‖²‖y‖²) | Padding waste | Storage | Speed |
259+ | -------------------- | ----------------- | -------------------- | --------------- | ------------ | ---------------- |
260+ | Per-block Gaussian | Correct (Lemma 4) | π/(2B) | None | k×B²×4 bytes | O(B²)/block |
261+ | Per-block SORF | Approximate | ~ π/(2B) | None | k×3×B bits | O(B log B)/block |
262+ | Full-dim padded SORF | Approximate | ~ π/(2·pad_d) | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total |
263+ | MSE-only | N/A | N/A | N/A | None | 0 |
264+
265+ Variance entries show the coefficient of ` ‖r‖²×‖y‖² ` where ` ‖r‖² ` is the
266+ residual MSE (≈ 0.01 at b=4). The ‖r‖² factor is the same across strategies
267+ (same MSE quality), so relative comparisons reduce to the coefficient alone:
268+ per-block is d/B times higher than full-dim (6× at d=768, B=128).
247269
248270Note: the full-dim padded SORF variance bound formally uses ` pad_d ` (e.g.,
249- 1024), not ` d ` (768). However, the ` pad_d - d ` sign bits spent on zero-padded
271+ 1024), not ` d ` (768). The ` pad_d - d ` sign bits spent on zero-padded
250272coordinates carry no information about the residual, so the effective variance
251- reduction may be closer to ` ( π/(2d))·‖y‖² ` . The experiment should measure
252- actual variance to resolve this.
273+ reduction may be closer to ` π/(2d) ` . The experiment should measure actual
274+ variance to resolve this.
253275
254276#### Norm architecture
255277
256278The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms
257- are externalized into a separate child array, following the pattern established
258- by the NormVector encoding (PR #7251 ).
279+ are externalized into a separate child array, following the pattern explored in
280+ the NormVector encoding prototype (PR #7251 , closed — the concept will need to
281+ be implemented as part of this work or adapted from a different source).
259282
260283The per-block norms are stored as a single ` FixedSizeListArray<F> ` with
261284` list_size = num_blocks ` , where ` F ` matches or widens the input element type:
@@ -271,6 +294,13 @@ encoding to them. The cascading compressor treats norms like any other float
271294column and is free to re-encode them with ALP, Pco, FastLanes, or other float
272295compression schemes.
273296
297+ Note: centroids and quantization always operate in f32 (the
298+ [ current implementation] [ current-impl ] converts all input to f32 before
299+ quantization). For f64 input, the decode path produces f32 reconstructions
300+ scaled by f64 norms — a mixed-precision multiply. This preserves the precision
301+ of the norms (which capture the bulk of the vector's magnitude) while accepting
302+ f32 precision for the unit-direction reconstruction.
303+
274304#### Quantized-domain operations with per-block norms
275305
276306All quantized-domain operations require reading the block norms for both
@@ -285,12 +315,19 @@ centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product
285315 (sum of B centroid products), then weight by both vectors' block norms.
286316- ** Cosine similarity** : `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) /
287317(√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from
288- block norms. The norms tensor should be read once per scan query and cached.
318+ block norms.
319+ - ** L2 distance** (squared Euclidean): `‖a-b‖² = ‖a‖² + ‖b‖² - 2<a,b>
320+ = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2 × Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ`. Reuses the
321+ per-block dot product and per-block norms; this is the primary ANN metric.
322+
323+ The norms tensor should be read once per scan query and cached.
289324
290325#### Encoding algorithm
291326
292327```
293- Input: x ∈ ℝ^d, bit_width b, block_size B (power of 2)
328+ Input: x ∈ ℝ^d, total_bits b, block_size B (power of 2)
329+ b_mse = b - 1 for QJL strategies, b_mse = b for MSE-only
330+ num_centroids = 2^b_mse
294331k = ⌈d/B⌉
295332
296333# Block split and normalize
@@ -306,7 +343,7 @@ for i in 0..k:
306343for i in 0..k:
307344 if nᵢ > 0:
308345 rᵢ = SORFᵢ(ûᵢ) (3-round HD, independent per block)
309- cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook)
346+ cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook, num_centroids levels )
310347 else:
311348 cᵢ[j] = 0
312349
@@ -316,7 +353,11 @@ Store: codes (k × B per vector), block_norms (k per vector),
316353# QJL stage (optional, one of four strategies)
317354
318355# --- Per-block strategies (Gaussian or SORF) ---
319- # Operate in unit-norm space, per block:
356+ # Operate in unit-norm space, per block. Note: the current implementation
357+ # computes the QJL residual in original scale (r = x - x̂). With externalized
358+ # norms, we instead compute the unit-norm residual (rᵢ = ûᵢ - x̂_unitᵢ) and
359+ # let denormalization handle the scaling. These are mathematically equivalent:
360+ # nᵢ × correctionᵢ gives the same result either way.
320361for i in 0..k:
321362 if nᵢ > 0:
322363 x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ)
@@ -355,6 +396,7 @@ for i in 0..k:
355396# QJL correction (if present)
356397
357398# --- Per-block strategies (Gaussian or SORF) ---
399+ # Scale factor uses B (block dimension) because Lemma 4 applies per-block.
358400for i in 0..k:
359401 if γᵢ > 0:
360402 correctionᵢ = (√(π/2) / B) × γᵢ × Projᵢᵀ × sᵢ
@@ -528,48 +570,53 @@ giving ratio ≈ 5.8×. At N=1M, ratio ≈ 5.8×.
528570All configurations use 5 total bits per coordinate. For QJL strategies, this is
5295714-bit MSE + 1-bit QJL. For MSE-only, all 5 bits go to MSE (32 centroids).
530572
531- | Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes |
532- | ------------------------------------- | --- | ------------ | -------------- | ----------------------------- |
533- | Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products |
534- | Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead |
535- | Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Correct; matrices amortize |
536- | [ Current] [ current-impl ] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste |
573+ | Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes |
574+ | ------------------------------------- | --- | ------------ | -------------- | ---------------------------------- |
575+ | Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products |
576+ | Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead |
577+ | Block + full-dim padded SORF QJL | 128 | 5.7× | 5.7× | Lower variance; padded_d signs/vec |
578+ | Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Paper-correct; matrices amortize |
579+ | [ Current] [ current-impl ] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste |
537580
538581Per-block SORF QJL has the best ratio at all column sizes (SORF signs are
539- negligible overhead). Per-block Gaussian QJL is competitive only for large
540- columns where the B²×k×4 byte matrices amortize. For small columns, MSE-only
541- or per-block SORF QJL is preferable.
582+ negligible overhead). Full-dim padded SORF QJL is close behind (the extra
583+ padded_d − d = 256 sign bits per vector are a small cost). Per-block Gaussian
584+ QJL is competitive only for large columns where the B²×k×4 byte matrices
585+ amortize.
542586
543587## Performance analysis
544588
545589### Encode throughput
546590
547591With k blocks at B-dim, encoding requires per block:
548592
549- | Operation | FLOPs (B=128) |
550- | --------------------------- | ------------------------------------- |
551- | MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 |
552- | Centroid lookup | 128 binary searches |
553- | QJL Gaussian matmul (S × r) | B² = 16,384 |
554- | Norm computation | 128 FMA + sqrt ≈ 129 |
593+ | Operation | FLOPs (B=128) |
594+ | ---------------------------- | ------------------------------------- |
595+ | MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 |
596+ | Centroid lookup | 128 binary searches |
597+ | QJL Gaussian matmul (S × r) | 2B² = 32,768 (multiply + add) |
598+ | QJL SORF (if per-block SORF) | ≈ 3,072 (same as MSE) |
599+ | Norm computation | 128 FMA + sqrt ≈ 129 |
555600
556- For d=768, k=6: MSE total ≈ 18K FLOPs, QJL matmul ≈ 98K FLOPs. The QJL
557- Gaussian matmul dominates encode cost — ~ 5× more expensive than the
558- [ current] [ current-impl ] SORF-based QJL. Acceptable for offline encoding.
601+ For d=768, k=6: MSE total ≈ 18K FLOPs. QJL depends on strategy: Gaussian
602+ matmul ≈ 197K FLOPs (~ 10× more than SORF QJL at ≈ 18K). The Gaussian QJL
603+ dominates encode cost. SORF QJL adds negligible overhead. Acceptable for
604+ offline encoding in both cases.
559605
560606### Decode throughput
561607
562608| Operation | FLOPs per block (B=128) |
563609| -------------------------------- | ----------------------- |
564610| Codebook lookup | 128 table reads |
565611| Inverse SORF | ≈ 3,072 |
566- | QJL Gaussian matmul (Sᵀ × signs) | B² = 16,384 |
612+ | QJL Gaussian matmul (Sᵀ × signs) | 2B² = 32,768 |
613+ | QJL SORF (if per-block SORF) | ≈ 3,072 |
567614| Denormalize | 128 multiplies |
568615
569- For d=768: MSE decode ≈ 18K FLOPs, QJL decode ≈ 98K FLOPs. QJL decode is
570- significantly more expensive due to the dense matmul. For scan workloads that
571- only need inner products (not full reconstruction), the fused distance
572- computation path avoids full decode entirely.
616+ For d=768, k=6 : MSE decode ≈ 18K FLOPs. QJL decode: Gaussian ≈ 197K FLOPs,
617+ SORF ≈ 18K FLOPs. Gaussian QJL decode is ~ 10× more expensive than SORF QJL.
618+ For scan workloads that only need inner products (not full reconstruction), the
619+ fused distance computation path avoids full decode entirely.
573620
574621### Scan throughput (PDX, Stage 2)
575622
@@ -614,7 +661,7 @@ Compare all four strategies at d=768 with B ∈ {64, 128, 256}:
614661 B. Quantify the quality cost of the SORF approximation at small block
615662 dimensions. Test at 3, 4, 5 SORF rounds.
616663- ** Full-dimension padded SORF QJL** (current approach): measure for comparison.
617- Higher dimension gives better SORF-to-Haar convergence (30 mixing stages at
664+ Higher dimension gives better SORF-to-Haar convergence (30 butterfly stages at
618665 d=1024) which may compensate for the padding waste. This is the key
619666 comparison — does the better convergence of full-dim SORF outweigh the 25%
620667 wasted sign bits?
0 commit comments