{"slug": "show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data", "title": "Show HN: cuSBF – Faster GPU Bloom Filter for Sequence Data", "summary": "A new GPU-accelerated Bloom filter implementation, cuSBF, achieves up to 234 times faster k-mer queries and 92 times faster insertions compared to CPU-based Super Bloom filters for DNA and protein sequence analysis. Developed for NVIDIA GPUs with compute capability 8.0 or higher, the header-only C++ library uses minimizer-based shard selection and findere false-positive reduction to optimize memory bandwidth for streaming sequence data. Benchmarks on an RTX PRO 6000 Blackwell GPU show cuSBF outperforming existing GPU filters including GBBF, Cuckoo-GPU, TCF, and GQF by 7.6 to 3,427 times across various metrics.", "body_md": "cuSBF is a high-performance GPU implementation of the [Super Bloom filter](https://www.biorxiv.org/content/10.64898/2026.03.17.712354v1.article-info), optimized for high-throughput batch k-mer insertion and query on nucleotide (DNA) and protein sequences (or any other sequence type as long as a valid alphabet is provided).\n\nIt exploits the streaming nature of sequence-derived k-mers by using **minimizers** to group consecutive k-mers sharing the same minimiser into super-k-mers, assigning all k-mers of a super-k-mer to the same 256-bit memory shard. This amortizes random memory accesses across consecutive k-mer queries, reducing memory-bandwidth pressure. The **findere** scheme further reduces false positives dramatically by inserting overlapping s-mers and requiring a full run of consecutive s-mer matches.\n\n- CUDA-accelerated batch k-mer insert and query from sequences\n- Configurable k-mer length, minimiser width, s-mer width, and hash function count\n- Minimizer-based shard selection for cache-efficient streaming queries\n- Findere false-positive reduction via overlapping s-mer membership\n- Header-only library design\n- FASTA/FASTQ stream and file support\n\nBenchmarks use `Config<31, 28, 16, 4>`\n\non an NVIDIA RTX PRO 6000 Blackwell GPU. CPU Super Bloom runs on an Intel Xeon W9-3595X with 120 threads.\n\nCompared against:\n\n[CPU Super Bloom](https://github.com/EtienneC-K/SuperBloom)[GPU Blocked Bloom filter (GBBF)](https://github.com/NVIDIA/cuCollections)[GPU Cuckoo-GPU](https://github.com/tdortman/Cuckoo-GPU)[GPU Bulk Two-Choice Filter (TCF)](https://github.com/saltsystemslab/gpu-filters/tree/main/bulk-tcf)[GPU Counting Quotient Filter (GQF)](https://github.com/saltsystemslab/gpu-filters/tree/main/gqf)\n\n| Comparison | Insert | Query |\n|---|---|---|\n| cuSBF vs Super Bloom | 92× faster | 234× faster |\n| cuSBF vs GBBF | 9.1× faster | 7.7× faster |\n| cuSBF vs Cuckoo-GPU | 80× faster | 8.0× faster |\n| cuSBF vs TCF | 12× faster | 52× faster |\n| cuSBF vs GQF | 69× faster | 13× faster |\n\n| Comparison | Insert | Query |\n|---|---|---|\n| cuSBF vs Super Bloom | 59× faster | 165× faster |\n| cuSBF vs GBBF | 8.2× faster | 7.6× faster |\n| cuSBF vs Cuckoo-GPU | 3427× faster | 7.8× faster |\n| cuSBF vs TCF | 12× faster | 67× faster |\n| cuSBF vs GQF | 42× faster | 11× faster |\n\n| Bits/k-mer | cuSBF `s=28` |\ncuSBF `s=30` |\ncuSBF `s=31` |\nGBBF |\n|---|---|---|---|---|\n| 21.4 | 0.848% | 0.951% | 1.593% | 3.069% |\n| 85.7 | 0.091% | 0.107% | 0.210% | 0.126% |\n| 342.6 | 0.0095% | 0.0114% | 0.0264% | 0.0273% |\n\n- Linux (x86_64 or aarch64) with an NVIDIA GPU and driver\n- CUDA Toolkit >= 13.1\n- GCC or Clang host compiler (C++20)\n- Meson and Ninja\n- NVIDIA GPU with compute capability 8.0+ (Ampere, Lovelace, Hopper, Blackwell)\n\ncuSBF is developed and tested on **Linux** only.\n\n**WSL2** on Windows with is a reasonable dev environment (See[NVIDIA docs](https://docs.nvidia.com/cuda/wsl-user-guide/index.html)).**Native Windows and macOS** are not supported or tested. The build uses Linux-specific FASTX paths (for example`mmap`\n\n) and host tooling assumptions (GCC/Clang, GNU statement expressions in`CUSBF_TRY`\n\n/`CUSBF_UNWRAP`\n\n).\n\n```\nmeson setup build\nninja -C build\n```\n\nWhen this repo is the root Meson project, **benchmarks**, **tests**, and **examples** build by default. As a subproject they are skipped unless you force them on.\n\n| Option | Type | Default | Description |\n|---|---|---|---|\n`benchmarks` |\nfeature | `auto` |\nGoogle Benchmark binaries |\n`tests` |\nfeature | `auto` |\nGoogleTest suite |\n`examples` |\nfeature | `auto` |\nExample CLI |\n`param_sweep` |\nfeature | `disabled` |\nParameter-sweep binaries (large, see below) |\n`param_sweep_alphabet` |\ncombo | `dna` |\n`dna` or `protein` when `param_sweep` is enabled |\n`large_fastx_tests` |\nfeature | `disabled` |\nLarge generated FASTX test (`CUSBF_LARGE_FASTX_*` env vars) |\n\nEach **feature** option accepts `auto`\n\n, `enabled`\n\n, or `disabled`\n\n:\n\n`auto`\n\n— on for a standalone checkout, off when cuSBF is a subproject`enabled`\n\n/`disabled`\n\n— override regardless of project layout\n\nImportant\n\nEnabling `param_sweep`\n\nbuilds many binaries (208 for the DNA alphabet). Leave it disabled unless you need that sweep.\n\n```\n# Default standalone build\nmeson setup build\n\n# Faster configure: library + examples only\nmeson setup build -Dbenchmarks=disabled -Dtests=disabled\n\n# Subproject consumer forcing tests on\nmeson setup build -Dtests=enabled\n\n# Parameter sweep\nmeson setup build -Dparam_sweep=enabled\nmeson setup build -Dparam_sweep=enabled -Dparam_sweep_alphabet=protein\n```\n\nFallible APIs return `cusbf::Result<T>`\n\n(a thin wrapper over `cuda::std::expected<T, Error>`\n\n). Use `return Err(error)`\n\n(`cuda::std::unexpected<Error>`\n\n, deduces `Result<T>`\n\n) or `return Ok()`\n\n/ `return {}`\n\nfor `Result<void>`\n\n. For success with a value, `return value`\n\nis enough. Two helpers unwrap results:\n\n| Macro | On failure | Use when |\n|---|---|---|\n`CUSBF_TRY(expr)` |\nCopies the error, then `return cuda::std::unexpected<Error>(...)` from the enclosing function |\nThe caller returns `Result` (library glue, `examples/cusbf-main` ) |\n`CUSBF_UNWRAP(expr)` |\n`throw std::runtime_error(message())` |\nTests, `main` , or other code that does not return `Result` |\n\nBoth work as statements or in initializers (`auto x = CUSBF_UNWRAP(...)`\n\n). For full control (typed errors, exit codes), use `if (!result)`\n\ninstead.\n\n```\n#include <cusbf/filter.cuh>\n\nusing Config = cusbf::Config<31, 28, 16, 4>;\n\nint main() {\n    cusbf::filter<Config> filter(1 << 24);\n\n    CUSBF_UNWRAP(filter.insert_sequence(\"ACGTACGTACGTACGTACGTACGTACGTACGT\"));\n    const auto hits = CUSBF_UNWRAP(filter.contains_sequence(\"ACGTACGTACGTACGTACGTACGTACGTACGT\"));\n\n    CUSBF_UNWRAP(filter.insert_fastx_file(\"reference.fasta\"));\n    const auto summary = CUSBF_UNWRAP(filter.query_fastx_file(\"queries.fastq\"));\n\n    (void)hits;\n    (void)summary;\n    return 0;\n}\n```\n\nWhen the caller already returns `Result`\n\n, use `CUSBF_TRY`\n\nso failures propagate without exceptions:\n\n```\n[[nodiscard]] cusbf::Result<void> run(cusbf::filter<Config>& filter) {\n    CUSBF_TRY(filter.insert_fastx_file(\"reference.fasta\"));\n    const auto summary = CUSBF_TRY(filter.query_fastx_file(\"queries.fastq\"));\n    (void)summary;\n    return cusbf::Ok();\n}\n```\n\nAsync device APIs, record batches, and streaming FASTX callbacks follow the same pattern. `filter.load_factor()`\n\nand `filter.filter_bits()`\n\nare synchronous and do not return `Result`\n\n.\n\n``` js\nif (const auto result = filter.query_fastx_file(\"queries.fastq\"); !result) {\n    const cusbf::Error& err = result.error();\n    std::cerr << err.message() << '\\n';\n    if (const cusbf::FastxParseError* parse = err.as_fastx_parse()) {\n        // parse->location.file / .line / .column\n    }\n    return 1;\n}\n```\n\n`CUSBF_CUDA_TRY`\n\nwraps CUDA runtime calls into `Result<void>`\n\n; `CUSBF_CUDA_CALL`\n\n/ `CUSBF_CUDA_ABORT`\n\nare for throw/abort paths only.\n\nThe `Config`\n\ntemplate accepts the following parameters:\n\n| Parameter | Description | Default |\n|---|---|---|\n`K` |\nk-mer length (max depends on alphabet) | - |\n`S` |\ns-mer width for findere Bloom hash seed (1-K) | - |\n`M` |\nMinimiser width for shard selection (1-K) | - |\n`HashCount` |\nNumber of independent Bloom hash functions (4,8,12,16) | 4 |\n`CudaBlockSize` |\nCUDA threads per block | 256 |\n`Alphabet` |\nSymbol encoding (DNA or protein) | `DnaAlphabet` |\n\n```\n#include <cusbf/filter.cuh>\n\nusing ProteinConfig = cusbf::Config<12, 10, 6, 4, 256, cusbf::ProteinAlphabet>;\n\n[[nodiscard]] cusbf::Result<void> run_protein() {\n    cusbf::filter<ProteinConfig> filter(1 << 24);\n    CUSBF_TRY(filter.insert_sequence(\"ACDEFGHIKLMNPQRSTVWY\"));\n    const auto hits = CUSBF_TRY(filter.contains_sequence(\"ACDEFGHIKLMNPQRSTVWY\"));\n    (void)hits;\n    return cusbf::Ok();\n}\n```\n\n- E. Conchon-Kerjan, T. Rouzé, L. Robidou, F. Ingels, and A. Limasset, “Super Bloom: Fast and precise filter for streaming k-mer queries,” bioRxiv, 2026, doi: 10.64898/2026.03.17.712354.\n- D. Jünger, K. Kristensen, Y. Wang, X. Yu, and B. Schmidt, “Optimizing Bloom Filters for Modern GPU Architectures.” 2025. [Online]. Available:\n[https://arxiv.org/abs/2512.15595](https://arxiv.org/abs/2512.15595)", "url": "https://wpnews.pro/news/show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data", "canonical_source": "https://github.com/tdortman/cuSBF", "published_at": "2026-05-27 17:30:53+00:00", "updated_at": "2026-05-27 17:46:05.553903+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-research", "ai-tools", "ai-products"], "entities": ["NVIDIA", "cuSBF", "Super Bloom", "GBBF", "Cuckoo-GPU", "Intel", "Xeon W9-3595X", "RTX PRO 6000 Blackwell"], "alternates": {"html": "https://wpnews.pro/news/show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data", "markdown": "https://wpnews.pro/news/show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data.md", "text": "https://wpnews.pro/news/show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data.txt", "jsonld": "https://wpnews.pro/news/show-hn-cusbf-faster-gpu-bloom-filter-for-sequence-data.jsonld"}}