cd /news/machine-learning/asymmetric-quantization-near-lossles… · home topics machine-learning article
[ARTICLE · art-43933] src=mixedbread.com ↗ pub= topic=machine-learning verified=true sentiment=↑ positive

Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction

Mixedbread Search introduced asymmetric quantization for late interaction retrieval, storing document vectors as 1-bit binary signs while keeping query vectors at higher precision, achieving a 32x reduction in storage per document from 393 KiB to 12.28 KiB with only a 0.61 NDCG@10 drop (90.26 to 89.65) on internal benchmarks. This makes late interaction models practical for billion-scale retrieval by reducing storage costs to near single-vector levels.

read8 min views1 publishedJun 29, 2026
Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction
Image: source

All posts Late interaction models like Wholembed v3 make retrieval much more precise, because they preserve fine-grained information instead of compressing a whole document into one vector. But they also change the storage economics. A single document produces more then one embedding, depending on the complexity of the document it can produce hundreds or thousands of vectors. Each vector has to be stored and later used for retrieval.

Mixedbread Search runs on silo, our retrieval engine for multimodal late interaction at billion-document scale. Silo stores vectors for more than 2.5 billion documents in object storage and hydrates them into faster tiers as queries need them. At that scale, every extra byte per document is repeated billions of times, and it shows up directly in cost per stored document, shard cold-start time, and the bytes each query has to read. We need to work around the tradeoff making the whole system cheap while maining the quality which makes late interaction worth running.

This post walks through asymmetric quantization. One of the optimizations that makes running late interaction practical in production. We keep the query vectors at higher precision and store the document vectors as binary signs. In our internal benchmark suite that cuts raw document-vector storage on average by 32x from 393 KiB to 12.28 KiB per document, while holding retrieval quality at 89.65 NDCG@10 versus 90.26 for fp32.

TL;DR: Documents are stored for a long time; queries are executed once. So we store document vectors as 1-bit signs and keep the query at int8. That shrinks per-document storage 32x and loses only 0.61 NDCG@10 (90.26 to 89.65) on our internal benchmarks.

Quantization: Making Multi-Vector Storage PracticalLink to section Quantization means representing high-precision floating point vectors with lower-precision values such as int8, or even 1-bit signs. The goal is to preserve ranking quality while reducing payload size. This matters especially for silo. Object storage gives us durable, low-cost persistence. In order to make it suitable for real workloads, we need compact indexes to serve it fast enough. And on the document side, payload size is what dominates the cost.

Naive late interaction is expensive because it stores more vectors. A standard single-vector embedding with 3072 dimensions in fp32 takes 12 KiB per document. A multi-vector representation with 786 vectors of 128 dimensions carries much more information, but uncompressed it is about 33x larger.

Representation Dimensions Storage per document Relative to 3072-d fp32 single vector
Single vector, fp32 3072 12,288 B / 12 KiB 1.0x
Single vector, int8 3072 3,072 B / 3 KiB 0.25x
Multi-vector, fp32 786 × 128 402,432 B / 393 KiB 32.75x
Multi-vector, int8 786 × 128 100,608 B / 98.25 KiB 8.19x
Multi-vector, binary 786 × 128 12,576 B / 12.28 KiB 1.02x

Storage numbers here refer to raw vector payloads only. Production indexes also include document IDs, metadata, and layout overhead.

With binary document vectors, a 786-token multi-vector document is only about 2% larger than a 3072-dimensional fp32 single vector. Which means, that you can pay roughly single-vector storage and get late interaction quality. This helps us to change the tradeoff. Late interaction becomes practical to run by default, instead of something reserved for cases that justify the storage.

This is not a new direction for late interaction, ColBERTv2 showed that aggressive compression can reduce the footprint of late interaction models while preserving quality. PLAID showed that late interaction retrieval can be engineered down to practical latency using optimized retrieval and pruning. For production systems, both lessons matter: the model has to be precise, and the representation has to be cheap enough to move through hardware.

Why Asymmetric QuantizationLink to section Compressing the document vectors saves storage, IO, cache space, and cold-start time across the entire corpus. Compressing the query vectors saves almost nothing because the query is small, short-lived, and never stored in the index.

This is also why we do not binarize both sides. Fully binary retrieval is the most compact option, but dropping the query to single bits throws away the magnitude information the ranking depends on, and it costs far more quality than binarizing documents alone (as shown later).

So we keep the query in int8 and store only the document vectors as binary signs. The query stays precise enough to preserve ranking, while the document side gets the storage reduction that matters for serving.

The Scoring TrickLink to section Binary document vectors are smaller and thus cheaper to store.

For int8 x int8 scoring, modern ARM CPUs give us direct support through NEON dot-product instructions. Our AArch64 kernel uses SDOT to accumulate sixteen int8 multiplications into int32 lanes, then horizontally reduces the result with vaddvq_s32 . For int8 x binary scoring, the useful identity is simpler. If each document dimension is stored as a sign bit, with b_i

in {-1, +1} , then:

So, scoring does not require a full multiply for every dimension. We need the sum of query values selected by the positive document bits, and the total query sum.

In the kernel, document signs are packed into bits. For the 128-dimensional path, we precompute query sums and eight query bit-planes. Each document token is loaded as 16 packed bytes, shifted and masked into eight 0/1 masks with NEON integer operations, then scored with SDOT against the query planes. The final score uses the identity above: 2 * selected_query_sum - query_sum

.

Binary x binary is even cheaper computationally since it can use hamming distance, but the quality loss is too large for our main retrieval path.

Retrieval QualityLink to section We evaluated several precision pairings across our internal retrieval benchmark suite. Scores are NDCG@10 averaged across the suite, scaled to 0–100. NDCG@10 (Normalized Discounted Cumulative Gain at rank 10) measures how well the top 10 results are ordered against the ideal ranking, rewarding relevant documents more when they appear higher, with 100 being a perfect ranking. The full-precision baseline averages 90.26. Int8 query against binary documents averages 89.65, a 0.61 point drop, while reducing document-vector storage by 32x. Part of the minimal performance drop, is that Wholembed v3 is trained with silo's tradeoff in mind, so it is robust to the quantization.

For runtime, we benchmarked the scoring kernel on an ARM machine with a 33 × 128 query over a list of 1000 documents, each 786 × 128. The table reports median latency across 9 measured runs after 2 warmups, plus speedup relative to the fp32 baseline.

Query format Document format Avg NDCG@10 Δ vs baseline Doc storage per doc Median latency Speedup vs fp32
Float Float 90.26 393 KiB 14.20 ms 1.00x
Int8 Int8 90.27 +0.01 98.25 KiB 4.44 ms 3.20x
Int8 Binary 89.65 -0.61 12.28 KiB 3.71 ms 3.82x
Binary Binary 83.06 -7.20 12.28 KiB 3.32 ms 4.27x

There are two useful operating points.

If we want maximum quality with lower bandwidth and faster integer scoring, int8 × int8 is essentially lossless in this setup. It is slightly ahead of the fp32 baseline within measurement noise, while cutting document storage by 4x and running 3.2x faster in this benchmark.

If we want the best storage economics, int8 × binary is the more interesting point. It keeps most of the ranking quality while shrinking document vectors by 32x and running 3.8x faster than fp32. For an object-storage-backed system, that is a direct cut in corpus-side bytes.

Binary × binary looks appealing on paper. It uses the same 12.28 KiB of document storage as int8 × binary, and at 4.3x it is the fastest option here. But it drops 7.20 points against the baseline, more than ten times the int8 × binary drop, despite reading exactly the same document bytes. The only thing that changed is the query. In practice, it removes too much query signal.

What This UnlocksLink to section Asymmetric quantization works because retrieval systems do not pay for query and document precision in the same way. The document vectors dominate the long-term cost of the system: they are stored, replicated, cached, evicted, rehydrated, and scored repeatedly. The query vectors are short lived, so spending a few more bits on the query and saving bits on every stored document is the right tradeoff.

For silo, this makes late interaction retrieval much easier to serve at large scale. A lower cost per stored document, faster shard cold-start, and a hardware path that spends more time scoring and thus allowing higher qps and less time moving bytes around. This allows us to get the quality of multi-vector representations without treating every document as a large fp32 object.

If this is the kind of systems problem you want to work on, we are [hiring](https://mixedbread.com/careers).
── more in #machine-learning 4 stories · sorted by recency
── more on @mixedbread search 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/asymmetric-quantizat…] indexed:0 read:8min 2026-06-29 ·