Show HN: Clark Hash, 32x smaller searchable sketches for embeddings

Clark Labs Inc. released Clark Hash, a Rust package that compresses 384-dimensional neural embeddings into 48-byte searchable sketches, achieving a 32x reduction in vector memory. The stateless codec uses sparse Johnson-Lindenstrauss projection with fixed scalar quantization to encode vectors independently without training or calibration, enabling cheaper storage and online semantic memory for large text streams, retrieval prefilters, and edge deployments.

Clark Hash is a Rust package for compact, searchable sketches of neural embeddings. It packages a stateless sparse Johnson-Lindenstrauss projection with fixed scalar quantization, so each database vector can be encoded independently and searched later with an asymmetric floating-point query sketch. The core codec was originally developed under the internal name SQuaJL . The Rust API keeps the SQuaJL and SQuaJLConfig names for compatibility, and also exports ClarkHash and ClarkHashConfig aliases for new code. - Crate: crates.io/crates/clark-hash https://crates.io/crates/clark-hash - API docs: docs.rs/clark-hash https://docs.rs/clark-hash/latest/clark hash/ - Source: github.com/clark-labs-inc/clark-hash https://github.com/clark-labs-inc/clark-hash - Paper sources: arxiv submission/ /clark-labs-inc/clark-hash/blob/main/arxiv submission Cheaper embedding memory: store 384-dimensional f32 sentence embeddings as 48-byte searchable sketches in the default profile. Online semantic memory: encode vectors as they arrive, without training a codebook or recalibrating on the whole corpus. Large text streams: map documents, chunks, logs, conversations, or agent traces into compact semantic tokens for cheaper storage, movement, and scan. Retrieval prefilters: use compressed sketch scores as a low-cost first pass before reranking with dense vectors, text, or a stronger retrieval model. Local and edge search: keep more semantic state in RAM, local disk, browser storage, or customer-controlled deployments where bandwidth and sync size matter. This repository is now focused on the Clark Hash embedding codec: - Stateless sparse-JL sketching and scalar quantization for dense embeddings. - Bit-packed database-side vectors and floating-point query sketches. - A simple flat compressed-scan index for evaluation and small deployments. - Optional fastembed integration for local text-embedding examples. - Reproducible sentence-similarity benchmarks and paper sources. Model-runtime compression experiments are intentionally outside this package. The library surface here is the embedding sketch codec and its benchmark harnesses. A common 384-dimensional f32 sentence embedding costs 1,536 bytes per vector. The default Clark Hash profile stores the same vector as a 48-byte cosine sketch: | Representation | Bytes per vector | Storage ratio | |---|---|---| Dense f32 , 384 dimensions | 1,536 | 1.0000 | Clark Hash, m = 96 , b = 4 | 48 | 0.03125 | That is 32x smaller, or 96.875% less vector memory, for this configuration. The quality tradeoff depends on the embedding model, sketch dimension, bit width, hash count, and retrieval workload; the benchmark section below shows measured results rather than a universal guarantee. Clark Hash is useful when embeddings arrive continuously and you do not want a training or calibration pass before storing each vector: - Encode one vector at a time with a deterministic seed. - Store compact bit-packed sketches for hot memory, local cache, disk, or object storage. - Keep query vectors in floating point for asymmetric scoring. - Avoid corpus-specific codebooks, centroids, rotations, or learned quantization tables. - Use the same codec in simple flat scans, evaluation harnesses, and larger retrieval systems. From crates.io: dependencies clark-hash = "0.1" With local text embedding support through fastembed : dependencies clark-hash = { version = "0.1", features = "fastembed" } With serialization support for quantized codes: dependencies clark-hash = { version = "0.1", features = "serde" } In Rust code, the crate is imported as clark hash . use clark hash::{ClarkHash, ClarkHashConfig, FlatIndex, SimilarityMetric}; fn main - clark hash::Result< { let codec = ClarkHash::new ClarkHashConfig::new 384 .with sketch dim 96 .with bits 4 .with hashes per input 4 .with metric SimilarityMetric::Cosine , ?; let doc a = vec 0.1 f32; 384 ; let doc b = vec 0.2 f32; 384 ; let query = vec 0.15 f32; 384 ; let mut index = FlatIndex::new codec ; index.add vector &doc a ?; index.add vector &doc b ?; let hits = index.search &query, 2 ?; println "{hits: ?}" ; Ok } Enable the fastembed feature when you want local text embeddings and immediate quantization in one pipeline. use clark hash::{ClarkHash, ClarkHashConfig, FastEmbedQuantizer, FlatIndex}; use fastembed::EmbeddingModel; fn main - clark hash::Result< { let codec = ClarkHash::new ClarkHashConfig::new 384 .with sketch dim 96 .with bits 4 .with hashes per input 4 , ?; let mut pipeline = FastEmbedQuantizer::new EmbeddingModel::AllMiniLML6V2, codec ?; let documents = vec "passage: Rust is a systems programming language.", "passage: Embeddings can preserve semantic similarity.", "passage: Quantization reduces memory usage.", ; let codes = pipeline.quantize texts &documents, Some 32 ?; let query = pipeline.embed query "query: semantic vector compression" ?; let index = FlatIndex::from encoded pipeline.codec .clone , codes ?; println "{: ?}", index.search prepared &query, 3 ? ; Ok } Run the example: cargo run --release --features fastembed --example fastembed quantize For an input vector x in R^d , the codec: - Computes the input norm. - Projects the normalized vector into a lower-dimensional sparse signed JL sketch. - Rescales the projected coordinates by sqrt sketch dim . - Clips and uniformly quantizes every sketch coordinate into 1..=8 bits. - Optionally stores a two-byte norm channel for raw dot-product scoring. The database side stores a QuantizedVector . The query side uses a floating-point QuerySketch . Scoring happens in sketch space, which is a natural fit for cosine similarity over normalized sentence embeddings. For the compact mathematical note and paper, see: Regenerate the PDF with: typst compile docs/CLARK HASH PAPER.typ docs/Clark Hash Paper.pdf For common 384-dimensional sentence embeddings, start here: ClarkHashConfig::new 384 .with sketch dim 96 .with bits 4 .with hashes per input 4 .with metric SimilarityMetric::Cosine Useful tuning directions: sketch dim = 64 with bits = 2 or 3 gives more aggressive compression. sketch dim = 128 with bits = 4 or 6 gives better quality. SimilarityMetric::Cosine is best for normalized semantic embeddings. SimilarityMetric::Dot stores a small norm channel and is better when raw inner product matters. seed controls the deterministic projection, so keep it stable across indexed data. Run the core encode and scan Criterion benchmark: cargo bench --bench throughput Run the local text embedding plus quantization benchmark: cargo bench --features fastembed --bench fastembed pipeline Run the synthetic retrieval sanity check: cargo run --release --example quality report The real-text benchmark downloads multilingual sentence-similarity corpora from Hugging Face, embeds each unique sentence once, quantizes the embeddings, and compares score correlations. Default all-MiniLM-L6-v2 run: cargo run --release --features fastembed --example hf sentence similarity Multilingual model run: cargo run --release --features fastembed --example hf sentence similarity -- \ --model ParaphraseMLMiniLML12V2 \ --report target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json Fast smoke run: cargo run --release --features fastembed --example hf sentence similarity -- \ --max-pairs-per-subset 200 The benchmark currently uses: mteb/sts17-crosslingual-sts mteb/sts22-crosslingual-sts It reports: - Dense cosine score vs. human similarity correlation. - Clark Hash approximate score vs. human similarity correlation. - Quantized score vs. dense score correlation. - Macro averages across language-pair subsets. These results were produced locally on April 23, 2026 with: sketch dim = 96 bits = 4 hashes per input = 4 - cosine scoring - 48 bytes per stored vector - 0.03125 compression ratio vs. dense f32 The full benchmark used 9,304 labeled sentence pairs across 29 multilingual subsets and 17,000 unique sentences. | Model | Dataset | Dense Spearman | Sketch Spearman | Sketch Loss | Sketch vs Dense Pearson | |---|---|---|---|---|---| all-MiniLM-L6-v2 | mteb/sts17-crosslingual-sts | 0.3644 | 0.2719 | -0.0926 | 0.7242 | all-MiniLM-L6-v2 | mteb/sts22-crosslingual-sts | 0.4168 | 0.2876 | -0.1292 | 0.8531 | paraphrase-multilingual-MiniLM-L12-v2 | mteb/sts17-crosslingual-sts | 0.8144 | 0.7460 | -0.0684 | 0.9099 | paraphrase-multilingual-MiniLM-L12-v2 | mteb/sts22-crosslingual-sts | 0.2973 | 0.2472 | -0.0501 | 0.9460 | The main readout is that model fit matters more than quantization in this test. The English-centric all-MiniLM-L6-v2 model is weak on many cross-lingual subsets. The multilingual MiniLM backbone is much stronger on STS17, and the sketch preserves a large part of that ranking signal while storing each vector in 48 bytes. STS22 is a harder and more mixed corpus. The multilingual model is not universally better there, but the quantized sketches still track dense scores more closely than they did with the English MiniLM baseline. Full JSON reports from the local run: target/hf-sts-report.json target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json Core types: ClarkHash / SQuaJL : stateless codec used to encode vectors, sketch queries, and score codes. ClarkHashConfig / SQuaJLConfig : sketch size, bit width, hash count, clip range, seed, and metric. QuantizedVector : bit-packed database-side sketch. QuerySketch : floating-point query-side sketch. FlatIndex : reference exact scan over compressed vectors. FastEmbedQuantizer : optional text embedding and quantization pipeline. - Clark Hash is a quantization library, not a full approximate-nearest-neighbor engine. FlatIndex scans compressed vectors exactly and is meant for evaluation and simple deployments.- Quality depends on the embedding model, sketch dimension, bit width, and workload. - No fixed sketch dimension can preserve every future pair in an adversarial unbounded stream. - This package does not claim that Johnson-Lindenstrauss transforms, feature hashing, scalar quantization, or compressed retrieval are new. It documents and implements one practical stateless combination for Clark's embedding and memory workloads. MLA: Clark Labs Inc., Autoresearch, and Stanislav Kirdey. "Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings." Clark Labs Inc., 2026, GitHub, https://github.com/clark-labs-inc/clark-hash . BibTeX: @misc{clark hash 2026, author = {{Clark Labs Inc.} and {Autoresearch} and {Stanislav Kirdey}}, title = {Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings}, year = {2026}, publisher = {Clark Labs Inc.}, url = {https://github.com/clark-labs-inc/clark-hash} } cargo fmt --all -- --check cargo clippy --all-targets --all-features -- -D warnings cargo test --all-features cargo bench --bench throughput --no-run The fastembed benchmark and examples may download models on first use. Licensed under either of: - Apache License, Version 2.0 - MIT license at your option.