{"slug": "show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings", "title": "Show HN: Clark Hash, 32x smaller searchable sketches for embeddings", "summary": "Clark Labs Inc. released Clark Hash, a Rust package that compresses 384-dimensional neural embeddings into 48-byte searchable sketches, achieving a 32x reduction in vector memory. The stateless codec uses sparse Johnson-Lindenstrauss projection with fixed scalar quantization to encode vectors independently without training or calibration, enabling cheaper storage and online semantic memory for large text streams, retrieval prefilters, and edge deployments.", "body_md": "Clark Hash is a Rust package for compact, searchable sketches of neural embeddings. It packages a stateless sparse Johnson-Lindenstrauss projection with fixed scalar quantization, so each database vector can be encoded independently and searched later with an asymmetric floating-point query sketch.\n\nThe core codec was originally developed under the internal name `SQuaJL`\n\n. The Rust API\nkeeps the `SQuaJL`\n\nand `SQuaJLConfig`\n\nnames for compatibility, and also exports\n`ClarkHash`\n\nand `ClarkHashConfig`\n\naliases for new code.\n\n- Crate:\n[crates.io/crates/clark-hash](https://crates.io/crates/clark-hash) - API docs:\n[docs.rs/clark-hash](https://docs.rs/clark-hash/latest/clark_hash/) - Source:\n[github.com/clark-labs-inc/clark-hash](https://github.com/clark-labs-inc/clark-hash) - Paper sources:\n[arxiv_submission/](/clark-labs-inc/clark-hash/blob/main/arxiv_submission)\n\n**Cheaper embedding memory:** store 384-dimensional`f32`\n\nsentence embeddings as 48-byte searchable sketches in the default profile.**Online semantic memory:** encode vectors as they arrive, without training a codebook or recalibrating on the whole corpus.**Large text streams:** map documents, chunks, logs, conversations, or agent traces into compact semantic tokens for cheaper storage, movement, and scan.**Retrieval prefilters:** use compressed sketch scores as a low-cost first pass before reranking with dense vectors, text, or a stronger retrieval model.**Local and edge search:** keep more semantic state in RAM, local disk, browser storage, or customer-controlled deployments where bandwidth and sync size matter.\n\nThis repository is now focused on the Clark Hash embedding codec:\n\n- Stateless sparse-JL sketching and scalar quantization for dense embeddings.\n- Bit-packed database-side vectors and floating-point query sketches.\n- A simple flat compressed-scan index for evaluation and small deployments.\n- Optional\n`fastembed`\n\nintegration for local text-embedding examples. - Reproducible sentence-similarity benchmarks and paper sources.\n\nModel-runtime compression experiments are intentionally outside this package. The library surface here is the embedding sketch codec and its benchmark harnesses.\n\nA common 384-dimensional `f32`\n\nsentence embedding costs 1,536 bytes per vector. The\ndefault Clark Hash profile stores the same vector as a 48-byte cosine sketch:\n\n| Representation | Bytes per vector | Storage ratio |\n|---|---|---|\nDense `f32` , 384 dimensions |\n1,536 | 1.0000 |\nClark Hash, `m = 96` , `b = 4` |\n48 | 0.03125 |\n\nThat is 32x smaller, or 96.875% less vector memory, for this configuration. The quality tradeoff depends on the embedding model, sketch dimension, bit width, hash count, and retrieval workload; the benchmark section below shows measured results rather than a universal guarantee.\n\nClark Hash is useful when embeddings arrive continuously and you do not want a training or calibration pass before storing each vector:\n\n- Encode one vector at a time with a deterministic seed.\n- Store compact bit-packed sketches for hot memory, local cache, disk, or object storage.\n- Keep query vectors in floating point for asymmetric scoring.\n- Avoid corpus-specific codebooks, centroids, rotations, or learned quantization tables.\n- Use the same codec in simple flat scans, evaluation harnesses, and larger retrieval systems.\n\nFrom crates.io:\n\n```\n[dependencies]\nclark-hash = \"0.1\"\n```\n\nWith local text embedding support through `fastembed`\n\n:\n\n```\n[dependencies]\nclark-hash = { version = \"0.1\", features = [\"fastembed\"] }\n```\n\nWith serialization support for quantized codes:\n\n```\n[dependencies]\nclark-hash = { version = \"0.1\", features = [\"serde\"] }\n```\n\nIn Rust code, the crate is imported as `clark_hash`\n\n.\n\n```\nuse clark_hash::{ClarkHash, ClarkHashConfig, FlatIndex, SimilarityMetric};\n\nfn main() -> clark_hash::Result<()> {\n    let codec = ClarkHash::new(\n        ClarkHashConfig::new(384)\n            .with_sketch_dim(96)\n            .with_bits(4)\n            .with_hashes_per_input(4)\n            .with_metric(SimilarityMetric::Cosine),\n    )?;\n\n    let doc_a = vec![0.1_f32; 384];\n    let doc_b = vec![0.2_f32; 384];\n    let query = vec![0.15_f32; 384];\n\n    let mut index = FlatIndex::new(codec);\n    index.add_vector(&doc_a)?;\n    index.add_vector(&doc_b)?;\n\n    let hits = index.search(&query, 2)?;\n    println!(\"{hits:#?}\");\n\n    Ok(())\n}\n```\n\nEnable the `fastembed`\n\nfeature when you want local text embeddings and immediate\nquantization in one pipeline.\n\n```\nuse clark_hash::{ClarkHash, ClarkHashConfig, FastEmbedQuantizer, FlatIndex};\nuse fastembed::EmbeddingModel;\n\nfn main() -> clark_hash::Result<()> {\n    let codec = ClarkHash::new(\n        ClarkHashConfig::new(384)\n            .with_sketch_dim(96)\n            .with_bits(4)\n            .with_hashes_per_input(4),\n    )?;\n\n    let mut pipeline = FastEmbedQuantizer::new(EmbeddingModel::AllMiniLML6V2, codec)?;\n\n    let documents = vec![\n        \"passage: Rust is a systems programming language.\",\n        \"passage: Embeddings can preserve semantic similarity.\",\n        \"passage: Quantization reduces memory usage.\",\n    ];\n\n    let codes = pipeline.quantize_texts(&documents, Some(32))?;\n    let query = pipeline.embed_query(\"query: semantic vector compression\")?;\n    let index = FlatIndex::from_encoded(pipeline.codec().clone(), codes)?;\n\n    println!(\"{:#?}\", index.search_prepared(&query, 3)?);\n    Ok(())\n}\n```\n\nRun the example:\n\n```\ncargo run --release --features fastembed --example fastembed_quantize\n```\n\nFor an input vector `x in R^d`\n\n, the codec:\n\n- Computes the input norm.\n- Projects the normalized vector into a lower-dimensional sparse signed JL sketch.\n- Rescales the projected coordinates by\n`sqrt(sketch_dim)`\n\n. - Clips and uniformly quantizes every sketch coordinate into\n`1..=8`\n\nbits. - Optionally stores a two-byte norm channel for raw dot-product scoring.\n\nThe database side stores a `QuantizedVector`\n\n. The query side uses a floating-point\n`QuerySketch`\n\n. Scoring happens in sketch space, which is a natural fit for cosine\nsimilarity over normalized sentence embeddings.\n\nFor the compact mathematical note and paper, see:\n\nRegenerate the PDF with:\n\n```\ntypst compile docs/CLARK_HASH_PAPER.typ docs/Clark_Hash_Paper.pdf\n```\n\nFor common 384-dimensional sentence embeddings, start here:\n\n```\nClarkHashConfig::new(384)\n    .with_sketch_dim(96)\n    .with_bits(4)\n    .with_hashes_per_input(4)\n    .with_metric(SimilarityMetric::Cosine)\n```\n\nUseful tuning directions:\n\n`sketch_dim = 64`\n\nwith`bits = 2`\n\nor`3`\n\ngives more aggressive compression.`sketch_dim = 128`\n\nwith`bits = 4`\n\nor`6`\n\ngives better quality.`SimilarityMetric::Cosine`\n\nis best for normalized semantic embeddings.`SimilarityMetric::Dot`\n\nstores a small norm channel and is better when raw inner product matters.`seed`\n\ncontrols the deterministic projection, so keep it stable across indexed data.\n\nRun the core encode and scan Criterion benchmark:\n\n```\ncargo bench --bench throughput\n```\n\nRun the local text embedding plus quantization benchmark:\n\n```\ncargo bench --features fastembed --bench fastembed_pipeline\n```\n\nRun the synthetic retrieval sanity check:\n\n```\ncargo run --release --example quality_report\n```\n\nThe real-text benchmark downloads multilingual sentence-similarity corpora from Hugging Face, embeds each unique sentence once, quantizes the embeddings, and compares score correlations.\n\nDefault `all-MiniLM-L6-v2`\n\nrun:\n\n```\ncargo run --release --features fastembed --example hf_sentence_similarity\n```\n\nMultilingual model run:\n\n```\ncargo run --release --features fastembed --example hf_sentence_similarity -- \\\n  --model ParaphraseMLMiniLML12V2 \\\n  --report target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json\n```\n\nFast smoke run:\n\n```\ncargo run --release --features fastembed --example hf_sentence_similarity -- \\\n  --max-pairs-per-subset 200\n```\n\nThe benchmark currently uses:\n\n`mteb/sts17-crosslingual-sts`\n\n`mteb/sts22-crosslingual-sts`\n\nIt reports:\n\n- Dense cosine score vs. human similarity correlation.\n- Clark Hash approximate score vs. human similarity correlation.\n- Quantized score vs. dense score correlation.\n- Macro averages across language-pair subsets.\n\nThese results were produced locally on April 23, 2026 with:\n\n`sketch_dim = 96`\n\n`bits = 4`\n\n`hashes_per_input = 4`\n\n- cosine scoring\n- 48 bytes per stored vector\n- 0.03125 compression ratio vs. dense\n`f32`\n\nThe full benchmark used 9,304 labeled sentence pairs across 29 multilingual subsets and 17,000 unique sentences.\n\n| Model | Dataset | Dense Spearman | Sketch Spearman | Sketch Loss | Sketch vs Dense Pearson |\n|---|---|---|---|---|---|\n`all-MiniLM-L6-v2` |\n`mteb/sts17-crosslingual-sts` |\n0.3644 | 0.2719 | -0.0926 | 0.7242 |\n`all-MiniLM-L6-v2` |\n`mteb/sts22-crosslingual-sts` |\n0.4168 | 0.2876 | -0.1292 | 0.8531 |\n`paraphrase-multilingual-MiniLM-L12-v2` |\n`mteb/sts17-crosslingual-sts` |\n0.8144 | 0.7460 | -0.0684 | 0.9099 |\n`paraphrase-multilingual-MiniLM-L12-v2` |\n`mteb/sts22-crosslingual-sts` |\n0.2973 | 0.2472 | -0.0501 | 0.9460 |\n\nThe main readout is that model fit matters more than quantization in this test. The\nEnglish-centric `all-MiniLM-L6-v2`\n\nmodel is weak on many cross-lingual subsets. The\nmultilingual MiniLM backbone is much stronger on STS17, and the sketch preserves a large\npart of that ranking signal while storing each vector in 48 bytes.\n\nSTS22 is a harder and more mixed corpus. The multilingual model is not universally better there, but the quantized sketches still track dense scores more closely than they did with the English MiniLM baseline.\n\nFull JSON reports from the local run:\n\n`target/hf-sts-report.json`\n\n`target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json`\n\nCore types:\n\n`ClarkHash`\n\n/`SQuaJL`\n\n: stateless codec used to encode vectors, sketch queries, and score codes.`ClarkHashConfig`\n\n/`SQuaJLConfig`\n\n: sketch size, bit width, hash count, clip range, seed, and metric.`QuantizedVector`\n\n: bit-packed database-side sketch.`QuerySketch`\n\n: floating-point query-side sketch.`FlatIndex`\n\n: reference exact scan over compressed vectors.`FastEmbedQuantizer`\n\n: optional text embedding and quantization pipeline.\n\n- Clark Hash is a quantization library, not a full approximate-nearest-neighbor engine.\n`FlatIndex`\n\nscans compressed vectors exactly and is meant for evaluation and simple deployments.- Quality depends on the embedding model, sketch dimension, bit width, and workload.\n- No fixed sketch dimension can preserve every future pair in an adversarial unbounded stream.\n- This package does not claim that Johnson-Lindenstrauss transforms, feature hashing, scalar quantization, or compressed retrieval are new. It documents and implements one practical stateless combination for Clark's embedding and memory workloads.\n\nMLA:\n\nClark Labs Inc., Autoresearch, and Stanislav Kirdey. \"Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings.\" Clark Labs Inc., 2026, GitHub,\n\n[https://github.com/clark-labs-inc/clark-hash].\n\nBibTeX:\n\n```\n@misc{clark_hash_2026,\n  author = {{Clark Labs Inc.} and {Autoresearch} and {Stanislav Kirdey}},\n  title = {Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings},\n  year = {2026},\n  publisher = {Clark Labs Inc.},\n  url = {https://github.com/clark-labs-inc/clark-hash}\n}\ncargo fmt --all -- --check\ncargo clippy --all-targets --all-features -- -D warnings\ncargo test --all-features\ncargo bench --bench throughput --no-run\n```\n\nThe `fastembed`\n\nbenchmark and examples may download models on first use.\n\nLicensed under either of:\n\n- Apache License, Version 2.0\n- MIT license\n\nat your option.", "url": "https://wpnews.pro/news/show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings", "canonical_source": "https://github.com/clark-labs-inc/clark-hash", "published_at": "2026-05-27 08:44:13+00:00", "updated_at": "2026-05-27 09:17:49.882133+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-tools", "ai-infrastructure", "natural-language-processing"], "entities": ["Clark Hash", "SQuaJL", "Clark Labs Inc", "Johnson-Lindenstrauss", "Rust"], "alternates": {"html": "https://wpnews.pro/news/show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings", "markdown": "https://wpnews.pro/news/show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings.md", "text": "https://wpnews.pro/news/show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings.txt", "jsonld": "https://wpnews.pro/news/show-hn-clark-hash-32x-smaller-searchable-sketches-for-embeddings.jsonld"}}