Show HN: cuSBF – Faster GPU Bloom Filter for Sequence Data

A new GPU-accelerated Bloom filter implementation, cuSBF, achieves up to 234 times faster k-mer queries and 92 times faster insertions compared to CPU-based Super Bloom filters for DNA and protein sequence analysis. Developed for NVIDIA GPUs with compute capability 8.0 or higher, the header-only C++ library uses minimizer-based shard selection and findere false-positive reduction to optimize memory bandwidth for streaming sequence data. Benchmarks on an RTX PRO 6000 Blackwell GPU show cuSBF outperforming existing GPU filters including GBBF, Cuckoo-GPU, TCF, and GQF by 7.6 to 3,427 times across various metrics.

cuSBF is a high-performance GPU implementation of the Super Bloom filter https://www.biorxiv.org/content/10.64898/2026.03.17.712354v1.article-info , optimized for high-throughput batch k-mer insertion and query on nucleotide DNA and protein sequences or any other sequence type as long as a valid alphabet is provided . It exploits the streaming nature of sequence-derived k-mers by using minimizers to group consecutive k-mers sharing the same minimiser into super-k-mers, assigning all k-mers of a super-k-mer to the same 256-bit memory shard. This amortizes random memory accesses across consecutive k-mer queries, reducing memory-bandwidth pressure. The findere scheme further reduces false positives dramatically by inserting overlapping s-mers and requiring a full run of consecutive s-mer matches. - CUDA-accelerated batch k-mer insert and query from sequences - Configurable k-mer length, minimiser width, s-mer width, and hash function count - Minimizer-based shard selection for cache-efficient streaming queries - Findere false-positive reduction via overlapping s-mer membership - Header-only library design - FASTA/FASTQ stream and file support Benchmarks use Config<31, 28, 16, 4 on an NVIDIA RTX PRO 6000 Blackwell GPU. CPU Super Bloom runs on an Intel Xeon W9-3595X with 120 threads. Compared against: CPU Super Bloom https://github.com/EtienneC-K/SuperBloom GPU Blocked Bloom filter GBBF https://github.com/NVIDIA/cuCollections GPU Cuckoo-GPU https://github.com/tdortman/Cuckoo-GPU GPU Bulk Two-Choice Filter TCF https://github.com/saltsystemslab/gpu-filters/tree/main/bulk-tcf GPU Counting Quotient Filter GQF https://github.com/saltsystemslab/gpu-filters/tree/main/gqf | Comparison | Insert | Query | |---|---|---| | cuSBF vs Super Bloom | 92× faster | 234× faster | | cuSBF vs GBBF | 9.1× faster | 7.7× faster | | cuSBF vs Cuckoo-GPU | 80× faster | 8.0× faster | | cuSBF vs TCF | 12× faster | 52× faster | | cuSBF vs GQF | 69× faster | 13× faster | | Comparison | Insert | Query | |---|---|---| | cuSBF vs Super Bloom | 59× faster | 165× faster | | cuSBF vs GBBF | 8.2× faster | 7.6× faster | | cuSBF vs Cuckoo-GPU | 3427× faster | 7.8× faster | | cuSBF vs TCF | 12× faster | 67× faster | | cuSBF vs GQF | 42× faster | 11× faster | | Bits/k-mer | cuSBF s=28 | cuSBF s=30 | cuSBF s=31 | GBBF | |---|---|---|---|---| | 21.4 | 0.848% | 0.951% | 1.593% | 3.069% | | 85.7 | 0.091% | 0.107% | 0.210% | 0.126% | | 342.6 | 0.0095% | 0.0114% | 0.0264% | 0.0273% | - Linux x86 64 or aarch64 with an NVIDIA GPU and driver - CUDA Toolkit = 13.1 - GCC or Clang host compiler C++20 - Meson and Ninja - NVIDIA GPU with compute capability 8.0+ Ampere, Lovelace, Hopper, Blackwell cuSBF is developed and tested on Linux only. WSL2 on Windows with is a reasonable dev environment See NVIDIA docs https://docs.nvidia.com/cuda/wsl-user-guide/index.html . Native Windows and macOS are not supported or tested. The build uses Linux-specific FASTX paths for example mmap and host tooling assumptions GCC/Clang, GNU statement expressions in CUSBF TRY / CUSBF UNWRAP . meson setup build ninja -C build When this repo is the root Meson project, benchmarks , tests , and examples build by default. As a subproject they are skipped unless you force them on. | Option | Type | Default | Description | |---|---|---|---| benchmarks | feature | auto | Google Benchmark binaries | tests | feature | auto | GoogleTest suite | examples | feature | auto | Example CLI | param sweep | feature | disabled | Parameter-sweep binaries large, see below | param sweep alphabet | combo | dna | dna or protein when param sweep is enabled | large fastx tests | feature | disabled | Large generated FASTX test CUSBF LARGE FASTX env vars | Each feature option accepts auto , enabled , or disabled : auto — on for a standalone checkout, off when cuSBF is a subproject enabled / disabled — override regardless of project layout Important Enabling param sweep builds many binaries 208 for the DNA alphabet . Leave it disabled unless you need that sweep. Default standalone build meson setup build Faster configure: library + examples only meson setup build -Dbenchmarks=disabled -Dtests=disabled Subproject consumer forcing tests on meson setup build -Dtests=enabled Parameter sweep meson setup build -Dparam sweep=enabled meson setup build -Dparam sweep=enabled -Dparam sweep alphabet=protein Fallible APIs return cusbf::Result<T a thin wrapper over cuda::std::expected<T, Error . Use return Err error cuda::std::unexpected<Error , deduces Result<T or return Ok / return {} for Result<void . For success with a value, return value is enough. Two helpers unwrap results: | Macro | On failure | Use when | |---|---|---| CUSBF TRY expr | Copies the error, then return cuda::std::unexpected<Error ... from the enclosing function | The caller returns Result library glue, examples/cusbf-main | CUSBF UNWRAP expr | throw std::runtime error message | Tests, main , or other code that does not return Result | Both work as statements or in initializers auto x = CUSBF UNWRAP ... . For full control typed errors, exit codes , use if result instead. include <cusbf/filter.cuh using Config = cusbf::Config<31, 28, 16, 4 ; int main { cusbf::filter<Config filter 1 << 24 ; CUSBF UNWRAP filter.insert sequence "ACGTACGTACGTACGTACGTACGTACGTACGT" ; const auto hits = CUSBF UNWRAP filter.contains sequence "ACGTACGTACGTACGTACGTACGTACGTACGT" ; CUSBF UNWRAP filter.insert fastx file "reference.fasta" ; const auto summary = CUSBF UNWRAP filter.query fastx file "queries.fastq" ; void hits; void summary; return 0; } When the caller already returns Result , use CUSBF TRY so failures propagate without exceptions: nodiscard cusbf::Result<void run cusbf::filter<Config & filter { CUSBF TRY filter.insert fastx file "reference.fasta" ; const auto summary = CUSBF TRY filter.query fastx file "queries.fastq" ; void summary; return cusbf::Ok ; } Async device APIs, record batches, and streaming FASTX callbacks follow the same pattern. filter.load factor and filter.filter bits are synchronous and do not return Result . js if const auto result = filter.query fastx file "queries.fastq" ; result { const cusbf::Error& err = result.error ; std::cerr << err.message << '\n'; if const cusbf::FastxParseError parse = err.as fastx parse { // parse- location.file / .line / .column } return 1; } CUSBF CUDA TRY wraps CUDA runtime calls into Result<void ; CUSBF CUDA CALL / CUSBF CUDA ABORT are for throw/abort paths only. The Config template accepts the following parameters: | Parameter | Description | Default | |---|---|---| K | k-mer length max depends on alphabet | - | S | s-mer width for findere Bloom hash seed 1-K | - | M | Minimiser width for shard selection 1-K | - | HashCount | Number of independent Bloom hash functions 4,8,12,16 | 4 | CudaBlockSize | CUDA threads per block | 256 | Alphabet | Symbol encoding DNA or protein | DnaAlphabet | include <cusbf/filter.cuh using ProteinConfig = cusbf::Config<12, 10, 6, 4, 256, cusbf::ProteinAlphabet ; nodiscard cusbf::Result<void run protein { cusbf::filter<ProteinConfig filter 1 << 24 ; CUSBF TRY filter.insert sequence "ACDEFGHIKLMNPQRSTVWY" ; const auto hits = CUSBF TRY filter.contains sequence "ACDEFGHIKLMNPQRSTVWY" ; void hits; return cusbf::Ok ; } - E. Conchon-Kerjan, T. Rouzé, L. Robidou, F. Ingels, and A. Limasset, “Super Bloom: Fast and precise filter for streaming k-mer queries,” bioRxiv, 2026, doi: 10.64898/2026.03.17.712354. - D. Jünger, K. Kristensen, Y. Wang, X. Yu, and B. Schmidt, “Optimizing Bloom Filters for Modern GPU Architectures.” 2025. Online . Available: https://arxiv.org/abs/2512.15595 https://arxiv.org/abs/2512.15595