Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction

wpnews.pro

All posts Late interaction models like Wholembed v3 make retrieval much more precise, because they preserve fine-grained information instead of compressing a whole document into one vector. But they also change the storage economics. A single document produces more then one embedding, depending on the complexity of the document it can produce hundreds or thousands of vectors. Each vector has to be stored and later used for retrieval.

Mixedbread Search runs on silo, our retrieval engine for multimodal late interaction at billion-document scale. Silo stores vectors for more than 2.5 billion documents in object storage and hydrates them into faster tiers as queries need them. At that scale, every extra byte per document is repeated billions of times, and it shows up directly in cost per stored document, shard cold-start time, and the bytes each query has to read. We need to work around the tradeoff making the whole system cheap while maining the quality which makes late interaction worth running.

This post walks through asymmetric quantization. One of the optimizations that makes running late interaction practical in production. We keep the query vectors at higher precision and store the document vectors as binary signs. In our internal benchmark suite that cuts raw document-vector storage on average by 32x from 393 KiB to 12.28 KiB per document, while holding retrieval quality at 89.65 NDCG@10 versus 90.26 for fp32.

TL;DR: Documents are stored for a long time; queries are executed once. So we store document vectors as 1-bit signs and keep the query at int8. That shrinks per-document storage 32x and loses only 0.61 NDCG@10 (90.26 to 89.65) on our internal benchmarks.

Quantization: Making Multi-Vector Storage PracticalLink to section Quantization means representing high-precision floating point vectors with lower-precision values such as int8, or even 1-bit signs. The goal is to preserve ranking quality while reducing payload size. This matters especially for silo. Object storage gives us durable, low-cost persistence. In order to make it suitable for real workloads, we need compact indexes to serve it fast enough. And on the document side, payload size is what dominates the cost.

Naive late interaction is expensive because it stores more vectors. A standard single-vector embedding with 3072 dimensions in fp32 takes 12 KiB per document. A multi-vector representation with 786 vectors of 128 dimensions carries much more information, but uncompressed it is about 33x larger.

Representation	Dimensions	Storage per document	Relative to 3072-d fp32 single vector
Single vector, fp32	3072	12,288 B / 12 KiB	1.0x
Single vector, int8	3072	3,072 B / 3 KiB	0.25x
Multi-vector, fp32	786 × 128	402,432 B / 393 KiB	32.75x
Multi-vector, int8	786 × 128	100,608 B / 98.25 KiB	8.19x
Multi-vector, binary	786 × 128	12,576 B / 12.28 KiB	1.02x

Storage numbers here refer to raw vector payloads only. Production indexes also include document IDs, metadata, and layout overhead.

With binary document vectors, a 786-token multi-vector document is only about 2% larger than a 3072-dimensional fp32 single vector. Which means, that you can pay roughly single-vector storage and get late interaction quality. This helps us to change the tradeoff. Late interaction becomes practical to run by default, instead of something reserved for cases that justify the storage.

This is not a new direction for late interaction, ColBERTv2 showed that aggressive compression can reduce the footprint of late interaction models while preserving quality. PLAID showed that late interaction retrieval can be engineered down to practical latency using optimized retrieval and pruning. For production systems, both lessons matter: the model has to be precise, and the representation has to be cheap enough to move through hardware.

Why Asymmetric QuantizationLink to section Compressing the document vectors saves storage, IO, cache space, and cold-start time across the entire corpus. Compressing the query vectors saves almost nothing because the query is small, short-lived, and never stored in the index.

This is also why we do not binarize both sides. Fully binary retrieval is the most compact option, but dropping the query to single bits throws away the magnitude information the ranking depends on, and it costs far more quality than binarizing documents alone (as shown later).

So we keep the query in int8 and store only the document vectors as binary signs. The query stays precise enough to preserve ranking, while the document side gets the storage reduction that matters for serving.

The Scoring TrickLink to section Binary document vectors are smaller and thus cheaper to store.

For int8 x int8 scoring, modern ARM CPUs give us direct support through NEON dot-product instructions. Our AArch64 kernel uses SDOT to accumulate sixteen int8 multiplications into int32 lanes, then horizontally reduces the result with vaddvq_s32 . For int8 x binary scoring, the useful identity is simpler. If each document dimension is stored as a sign bit, with b_i

in {-1, +1} , then:

So, scoring does not require a full multiply for every dimension. We need the sum of query values selected by the positive document bits, and the total query sum.

In the kernel, document signs are packed into bits. For the 128-dimensional path, we precompute query sums and eight query bit-planes. Each document token is loaded as 16 packed bytes, shifted and masked into eight 0/1 masks with NEON integer operations, then scored with SDOT against the query planes. The final score uses the identity above: 2 * selected_query_sum - query_sum

.

Binary x binary is even cheaper computationally since it can use hamming distance, but the quality loss is too large for our main retrieval path.

Retrieval QualityLink to section We evaluated several precision pairings across our internal retrieval benchmark suite. Scores are NDCG@10 averaged across the suite, scaled to 0–100. NDCG@10 (Normalized Discounted Cumulative Gain at rank 10) measures how well the top 10 results are ordered against the ideal ranking, rewarding relevant documents more when they appear higher, with 100 being a perfect ranking. The full-precision baseline averages 90.26. Int8 query against binary documents averages 89.65, a 0.61 point drop, while reducing document-vector storage by 32x. Part of the minimal performance drop, is that Wholembed v3 is trained with silo's tradeoff in mind, so it is robust to the quantization.

For runtime, we benchmarked the scoring kernel on an ARM machine with a 33 × 128 query over a list of 1000 documents, each 786 × 128. The table reports median latency across 9 measured runs after 2 warmups, plus speedup relative to the fp32 baseline.

Query format	Document format	Avg NDCG@10	Δ vs baseline	Doc storage per doc	Median latency	Speedup vs fp32
Float	Float	90.26	–	393 KiB	14.20 ms	1.00x
Int8	Int8	90.27	+0.01	98.25 KiB	4.44 ms	3.20x
Int8	Binary	89.65	-0.61	12.28 KiB	3.71 ms	3.82x
Binary	Binary	83.06	-7.20	12.28 KiB	3.32 ms	4.27x

There are two useful operating points.

If we want maximum quality with lower bandwidth and faster integer scoring, int8 × int8 is essentially lossless in this setup. It is slightly ahead of the fp32 baseline within measurement noise, while cutting document storage by 4x and running 3.2x faster in this benchmark.

If we want the best storage economics, int8 × binary is the more interesting point. It keeps most of the ranking quality while shrinking document vectors by 32x and running 3.8x faster than fp32. For an object-storage-backed system, that is a direct cut in corpus-side bytes.

Binary × binary looks appealing on paper. It uses the same 12.28 KiB of document storage as int8 × binary, and at 4.3x it is the fastest option here. But it drops 7.20 points against the baseline, more than ten times the int8 × binary drop, despite reading exactly the same document bytes. The only thing that changed is the query. In practice, it removes too much query signal.

What This UnlocksLink to section Asymmetric quantization works because retrieval systems do not pay for query and document precision in the same way. The document vectors dominate the long-term cost of the system: they are stored, replicated, cached, evicted, rehydrated, and scored repeatedly. The query vectors are short lived, so spending a few more bits on the query and saving bits on every stored document is the right tradeoff.

For silo, this makes late interaction retrieval much easier to serve at large scale. A lower cost per stored document, faster shard cold-start, and a hardware path that spends more time scoring and thus allowing higher qps and less time moving bytes around. This allows us to get the quality of multi-vector representations without treating every document as a large fp32 object.

If this is the kind of systems problem you want to work on, we are [hiring](https://mixedbread.com/careers).

source & further reading

mixedbread.com — original article

Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction

Run your AI side-project on zahid.host