Skip to content

Commit 02d6c9d

Browse files
committed
review
Signed-off-by: Will Manning <will@willmanning.io>
1 parent b064e00 commit 02d6c9d

1 file changed

Lines changed: 101 additions & 54 deletions

File tree

proposed/0033-block-turboquant.md

Lines changed: 101 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Authors:** Will Manning
44
**Status:** Proposal
5-
**Date:** 2026-04-02
5+
**Date:** 2026-04-03
66

77
## Summary
88

@@ -170,10 +170,12 @@ quantified empirically (see Experimental plan).
170170

171171
**SORF approximation caveat.** Theorems 1 and 2 in [1] are proved for true
172172
random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF
173-
construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The
174-
approximation quality depends on dimension: 3 rounds provides 3 × log₂(B)
175-
mixing stages (18 at B=64, 21 at 128, 24 at 256, 30 at 1024). Empirical
176-
validation is needed for each candidate B — see Experimental plan.
173+
construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The approximation quality depends on dimension: each round of the Walsh-Hadamard
174+
transform mixes all B coordinates through log₂(B) butterfly stages, so 3 rounds
175+
provides 3 × log₂(B) total butterfly stages (18 at B=64, 21 at 128, 24 at 256).
176+
This is a rough heuristic for mixing quality, not a formal convergence metric —
177+
[5] does not analyze convergence rate as a function of rounds × dimension.
178+
Empirical validation is needed for each candidate B — see Experimental plan.
177179

178180
**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a
179181
B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128:
@@ -202,31 +204,46 @@ independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries.
202204
Gaussian matrices work at any dimension, no padding is needed for the QJL stage.
203205
Each block's QJL is provably unbiased by Lemma 4 in [1], and the sum over
204206
blocks is also unbiased: `E[<y, correction>] = <y, r>`. However, the per-block
205-
variance is **d/B times higher** than full-dimension QJL:
207+
variance is **d/B times higher** than full-dimension QJL.
208+
209+
Lemma 4 gives the variance for QJL of a unit vector. Since QJL is applied to the
210+
residual rₖ (with norm γₖ = ‖rₖ‖, typically ≪ 1), the actual variance scales
211+
by γₖ²:
206212

207213
```
208-
Per-block (B-dim): Var[<y, correction>] ≤ (π / (2B)) × ‖y‖²
209-
Full-dim (d-dim): Var[<y, correction>] ≤ (π / (2d)) × ‖y‖²
214+
Per-block (B-dim): Var[<y, correction>] ≤ (π / (2B)) × ‖r‖² × ‖y‖²
215+
Full-dim (d-dim): Var[<y, correction>] ≤ (π / (2d)) × ‖r‖² × ‖y‖²
210216
```
211217

212-
At d=768, B=128: 6× more variance. Storage: B×B×4 bytes per block (384 KB for
213-
k=6 at B=128). Encode/decode cost: O(B²) matmul per block.
218+
The ‖r‖² factor cancels when comparing strategies (same MSE quality → same
219+
residual norms), so the **relative** variance ratio is d/B regardless. At
220+
d=768, B=128: per-block has 6× more variance than full-dim. The absolute
221+
variance is small — at b=4 MSE, ‖r‖² ≈ 0.01, so the per-block variance is
222+
≈ 0.01 × (π/(2×128)) × ‖y‖² ≈ 1.2×10⁻⁴ × ‖y‖².
223+
224+
Storage: B×B×4 bytes per block (384 KB for k=6 at B=128). Encode/decode cost:
225+
O(B²) matmul per block.
214226

215227
**Per-block SORF QJL** substitutes a B-dim SORF (`HD₃·HD₂·HD₁` [5]) for the
216-
Gaussian matrix. This is NOT theoretically justified — Lemma 4 requires Gaussian
217-
or Haar-orthogonal S, and SORF is only an approximation to Haar measure.
228+
Gaussian matrix. This is NOT theoretically justified — Lemma 4 in [1] is proved
229+
specifically for Gaussian S. For Haar-distributed random orthogonal S,
230+
unbiasedness follows from rotational invariance (a separate argument), but the
231+
variance constant may differ from π/(2B). SORF is an approximation to neither
232+
Gaussian nor Haar measure.
218233
However, the [current implementation][current-impl] already uses SORF for QJL at
219234
d=1024 with acceptable results (~11% mean relative error for power-of-2 dims),
220235
demonstrating practical viability. The tradeoff vs Gaussian is compelling:
221-
O(B log B) speed (5× faster at B=128), O(B) storage (over 1000× less). Quality
236+
O(B log B) speed (~10× faster than Gaussian at B=128), O(B) storage (over
237+
1000× less). Quality
222238
at B=128 needs validation — with only 21 mixing stages, the approximation to
223239
Haar measure is weaker than at d=1024 (30 stages).
224240

225241
**Full-dimension padded SORF QJL** applies a single SORF at the padded
226242
dimension (e.g., 1024 for d=768) to the full residual vector `r = x - x̂`,
227243
matching the [current implementation][current-impl]. The higher dimension gives
228-
better SORF-to-Haar convergence (30 mixing stages at d=1024 vs 21 at B=128) and
229-
full-dimension variance `(π/(2d))·‖y‖²`, but wastes `(padded_d - d)/padded_d`
244+
better SORF-to-Haar convergence (30 butterfly stages at d=1024 vs 21 at B=128)
245+
and full-dimension variance `~(π/(2·padded_d))·‖r‖²·‖y‖²`, but wastes
246+
`(padded_d - d)/padded_d`
230247
of the sign bits on zero-padded coordinates (25% at 768→1024). This approach
231248
requires computing the full residual from all blocks before applying QJL,
232249
adding a full-dimension decode step to the encode path.
@@ -238,24 +255,30 @@ computation.
238255

239256
**QJL strategy options** (to be experimentally compared):
240257

241-
| Strategy | Theoretical | Variance | Padding waste | Storage | Speed |
242-
| -------------------- | ----------------- | ------------------- | --------------- | ------------ | ---------------- |
243-
| Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²)/block |
244-
| Per-block SORF | Approximate | ~(π/(2B))·‖y‖² | None | k×3×B bits | O(B log B)/block |
245-
| Full-dim padded SORF | Approximate | ~(π/(2·pad_d))·‖y‖² | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total |
246-
| MSE-only | N/A | N/A | N/A | None | 0 |
258+
| Strategy | Theoretical | Variance (×‖r‖²‖y‖²) | Padding waste | Storage | Speed |
259+
| -------------------- | ----------------- | -------------------- | --------------- | ------------ | ---------------- |
260+
| Per-block Gaussian | Correct (Lemma 4) | π/(2B) | None | k×B²×4 bytes | O(B²)/block |
261+
| Per-block SORF | Approximate | ~π/(2B) | None | k×3×B bits | O(B log B)/block |
262+
| Full-dim padded SORF | Approximate | ~π/(2·pad_d) | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total |
263+
| MSE-only | N/A | N/A | N/A | None | 0 |
264+
265+
Variance entries show the coefficient of `‖r‖²×‖y‖²` where `‖r‖²` is the
266+
residual MSE (≈ 0.01 at b=4). The ‖r‖² factor is the same across strategies
267+
(same MSE quality), so relative comparisons reduce to the coefficient alone:
268+
per-block is d/B times higher than full-dim (6× at d=768, B=128).
247269

248270
Note: the full-dim padded SORF variance bound formally uses `pad_d` (e.g.,
249-
1024), not `d` (768). However, the `pad_d - d` sign bits spent on zero-padded
271+
1024), not `d` (768). The `pad_d - d` sign bits spent on zero-padded
250272
coordinates carry no information about the residual, so the effective variance
251-
reduction may be closer to `(π/(2d))·‖y‖²`. The experiment should measure
252-
actual variance to resolve this.
273+
reduction may be closer to `π/(2d)`. The experiment should measure actual
274+
variance to resolve this.
253275

254276
#### Norm architecture
255277

256278
The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms
257-
are externalized into a separate child array, following the pattern established
258-
by the NormVector encoding (PR #7251).
279+
are externalized into a separate child array, following the pattern explored in
280+
the NormVector encoding prototype (PR #7251, closed — the concept will need to
281+
be implemented as part of this work or adapted from a different source).
259282

260283
The per-block norms are stored as a single `FixedSizeListArray<F>` with
261284
`list_size = num_blocks`, where `F` matches or widens the input element type:
@@ -271,6 +294,13 @@ encoding to them. The cascading compressor treats norms like any other float
271294
column and is free to re-encode them with ALP, Pco, FastLanes, or other float
272295
compression schemes.
273296

297+
Note: centroids and quantization always operate in f32 (the
298+
[current implementation][current-impl] converts all input to f32 before
299+
quantization). For f64 input, the decode path produces f32 reconstructions
300+
scaled by f64 norms — a mixed-precision multiply. This preserves the precision
301+
of the norms (which capture the bulk of the vector's magnitude) while accepting
302+
f32 precision for the unit-direction reconstruction.
303+
274304
#### Quantized-domain operations with per-block norms
275305

276306
All quantized-domain operations require reading the block norms for both
@@ -285,12 +315,19 @@ centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product
285315
(sum of B centroid products), then weight by both vectors' block norms.
286316
- **Cosine similarity**: `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) /
287317
(√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from
288-
block norms. The norms tensor should be read once per scan query and cached.
318+
block norms.
319+
- **L2 distance** (squared Euclidean): `‖a-b‖² = ‖a‖² + ‖b‖² - 2<a,b>
320+
= Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2 × Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ`. Reuses the
321+
per-block dot product and per-block norms; this is the primary ANN metric.
322+
323+
The norms tensor should be read once per scan query and cached.
289324

290325
#### Encoding algorithm
291326

292327
```
293-
Input: x ∈ ℝ^d, bit_width b, block_size B (power of 2)
328+
Input: x ∈ ℝ^d, total_bits b, block_size B (power of 2)
329+
b_mse = b - 1 for QJL strategies, b_mse = b for MSE-only
330+
num_centroids = 2^b_mse
294331
k = ⌈d/B⌉
295332
296333
# Block split and normalize
@@ -306,7 +343,7 @@ for i in 0..k:
306343
for i in 0..k:
307344
if nᵢ > 0:
308345
rᵢ = SORFᵢ(ûᵢ) (3-round HD, independent per block)
309-
cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook)
346+
cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook, num_centroids levels)
310347
else:
311348
cᵢ[j] = 0
312349
@@ -316,7 +353,11 @@ Store: codes (k × B per vector), block_norms (k per vector),
316353
# QJL stage (optional, one of four strategies)
317354
318355
# --- Per-block strategies (Gaussian or SORF) ---
319-
# Operate in unit-norm space, per block:
356+
# Operate in unit-norm space, per block. Note: the current implementation
357+
# computes the QJL residual in original scale (r = x - x̂). With externalized
358+
# norms, we instead compute the unit-norm residual (rᵢ = ûᵢ - x̂_unitᵢ) and
359+
# let denormalization handle the scaling. These are mathematically equivalent:
360+
# nᵢ × correctionᵢ gives the same result either way.
320361
for i in 0..k:
321362
if nᵢ > 0:
322363
x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ)
@@ -355,6 +396,7 @@ for i in 0..k:
355396
# QJL correction (if present)
356397
357398
# --- Per-block strategies (Gaussian or SORF) ---
399+
# Scale factor uses B (block dimension) because Lemma 4 applies per-block.
358400
for i in 0..k:
359401
if γᵢ > 0:
360402
correctionᵢ = (√(π/2) / B) × γᵢ × Projᵢᵀ × sᵢ
@@ -528,48 +570,53 @@ giving ratio ≈ 5.8×. At N=1M, ratio ≈ 5.8×.
528570
All configurations use 5 total bits per coordinate. For QJL strategies, this is
529571
4-bit MSE + 1-bit QJL. For MSE-only, all 5 bits go to MSE (32 centroids).
530572

531-
| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes |
532-
| ------------------------------------- | --- | ------------ | -------------- | ----------------------------- |
533-
| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products |
534-
| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead |
535-
| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Correct; matrices amortize |
536-
| [Current][current-impl] (padded SORF) || 4.7× | 4.7× | 33% padding waste |
573+
| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes |
574+
| ------------------------------------- | --- | ------------ | -------------- | ---------------------------------- |
575+
| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products |
576+
| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead |
577+
| Block + full-dim padded SORF QJL | 128 | 5.7× | 5.7× | Lower variance; padded_d signs/vec |
578+
| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Paper-correct; matrices amortize |
579+
| [Current][current-impl] (padded SORF) || 4.7× | 4.7× | 33% padding waste |
537580

538581
Per-block SORF QJL has the best ratio at all column sizes (SORF signs are
539-
negligible overhead). Per-block Gaussian QJL is competitive only for large
540-
columns where the B²×k×4 byte matrices amortize. For small columns, MSE-only
541-
or per-block SORF QJL is preferable.
582+
negligible overhead). Full-dim padded SORF QJL is close behind (the extra
583+
padded_d − d = 256 sign bits per vector are a small cost). Per-block Gaussian
584+
QJL is competitive only for large columns where the B²×k×4 byte matrices
585+
amortize.
542586

543587
## Performance analysis
544588

545589
### Encode throughput
546590

547591
With k blocks at B-dim, encoding requires per block:
548592

549-
| Operation | FLOPs (B=128) |
550-
| --------------------------- | ------------------------------------- |
551-
| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 |
552-
| Centroid lookup | 128 binary searches |
553-
| QJL Gaussian matmul (S × r) | B² = 16,384 |
554-
| Norm computation | 128 FMA + sqrt ≈ 129 |
593+
| Operation | FLOPs (B=128) |
594+
| ---------------------------- | ------------------------------------- |
595+
| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 |
596+
| Centroid lookup | 128 binary searches |
597+
| QJL Gaussian matmul (S × r) | 2B² = 32,768 (multiply + add) |
598+
| QJL SORF (if per-block SORF) | ≈ 3,072 (same as MSE) |
599+
| Norm computation | 128 FMA + sqrt ≈ 129 |
555600

556-
For d=768, k=6: MSE total ≈ 18K FLOPs, QJL matmul ≈ 98K FLOPs. The QJL
557-
Gaussian matmul dominates encode cost — ~5× more expensive than the
558-
[current][current-impl] SORF-based QJL. Acceptable for offline encoding.
601+
For d=768, k=6: MSE total ≈ 18K FLOPs. QJL depends on strategy: Gaussian
602+
matmul ≈ 197K FLOPs (~10× more than SORF QJL at ≈ 18K). The Gaussian QJL
603+
dominates encode cost. SORF QJL adds negligible overhead. Acceptable for
604+
offline encoding in both cases.
559605

560606
### Decode throughput
561607

562608
| Operation | FLOPs per block (B=128) |
563609
| -------------------------------- | ----------------------- |
564610
| Codebook lookup | 128 table reads |
565611
| Inverse SORF | ≈ 3,072 |
566-
| QJL Gaussian matmul (Sᵀ × signs) | B² = 16,384 |
612+
| QJL Gaussian matmul (Sᵀ × signs) | 2B² = 32,768 |
613+
| QJL SORF (if per-block SORF) | ≈ 3,072 |
567614
| Denormalize | 128 multiplies |
568615

569-
For d=768: MSE decode ≈ 18K FLOPs, QJL decode ≈ 98K FLOPs. QJL decode is
570-
significantly more expensive due to the dense matmul. For scan workloads that
571-
only need inner products (not full reconstruction), the fused distance
572-
computation path avoids full decode entirely.
616+
For d=768, k=6: MSE decode ≈ 18K FLOPs. QJL decode: Gaussian ≈ 197K FLOPs,
617+
SORF ≈ 18K FLOPs. Gaussian QJL decode is ~10× more expensive than SORF QJL.
618+
For scan workloads that only need inner products (not full reconstruction), the
619+
fused distance computation path avoids full decode entirely.
573620

574621
### Scan throughput (PDX, Stage 2)
575622

@@ -614,7 +661,7 @@ Compare all four strategies at d=768 with B ∈ {64, 128, 256}:
614661
B. Quantify the quality cost of the SORF approximation at small block
615662
dimensions. Test at 3, 4, 5 SORF rounds.
616663
- **Full-dimension padded SORF QJL** (current approach): measure for comparison.
617-
Higher dimension gives better SORF-to-Haar convergence (30 mixing stages at
664+
Higher dimension gives better SORF-to-Haar convergence (30 butterfly stages at
618665
d=1024) which may compensate for the padding waste. This is the key
619666
comparison — does the better convergence of full-dim SORF outweigh the 25%
620667
wasted sign bits?

0 commit comments

Comments
 (0)