FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

FlashLib, a new GPU library for classical machine learning operators, achieves speedups of up to 208× over cuML on Hopper GPUs for algorithms including KMeans, KNN, and PCA. The library is designed to accelerate the "intelligence assembly" bottleneck in agentic AI systems, where classical ML operators like clustering and retrieval are becoming critical online primitives rather than offline utilities. FlashLib also provides a flash-informative API that predicts runtime and memory footprint in microseconds without GPU profiling, using heuristic kernel selection for fast cold starts and multi-GPU support.

FlashLib: Bringing Flash Magic to Classical Machine Learning Operators Introducing FlashLib — a GPU library for classical machine learning operators on modern hardwares, rebuilt for today's ML workloads and emerging agentic AI systems. Here are a few headline results from the first release: - Significant wins over cuML on Hopper GPUs: up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB. - Flash informative API: Predict runtime, memory footprint, and overhead for any workload in ~5 µs on pure CPU , with no GPU profiling required. - Fast cold start, built to scale: FlashLib uses heuristic kernel selection to avoid long autotune loops, and already supports multi-GPU execution for large workloads. - Toward optimal hardware utilization: FlashLib drives kernels much closer to the limits of modern GPUs, with Flash-KMeans reaching up to 61% of peak FLOPs and Flash-KNN reaching up to 85.2% of peak HBM bandwidth on H200. The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc. But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift. However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates a four-way mismatch. First, many operators carry natural implementations that are unfriendly to GPUs. Second, many libraries ship one static kernel implementation across all workloads and hardware tiers, leaving modern GPU hardware features unexploited. Third, many libraries are unaware of the user's precision needs: they expose no way to declare a precision budget, leaving users unable to ask for the fastest algorithm that meets their tolerance. Fourth, the performance is black box: costly to profile, hard to modify, and impossible to budget without first reading the codebase, which leaves both developers and LLM-based agents in the dark. FlashLib is our attempt to close these gaps and accelerate this emerging substrate, making it fast enough to sit inside the loop of agentic AI. It transforms classical ML operators from slow, offline utilities into fast, online ML primitives. Moreover, FlashLib exposes flash-informative APIs that reveal the cost, tolerance, and execution behavior of these primitives to higher-level agentic pipelines, thus enabling better scheduling and orchestration. We would also like to point out that while FlashLib is motivated by the emerging needs of LLM-centric and agentic AI systems, we also recognize that classical ML algorithms remain widely used across today's machine learning stacks. Beyond generative AI, operators such as KMeans, KNN, PCA, SVD, t-SNE, and HDBSCAN etc are still core building blocks for recommendation systems, retrieval pipelines, scientific computing, anomaly detection, visualization, and preprocessing for downstream ML models. FlashLib provides a fast, easy-to-use, and adaptive software stack that covers these diverse applications with plug-and-play GPU acceleration. We built FlashLib around four design principles. First, we reshape the algorithm to fit the hardware while achieving mathematical equivalence. Second, we build kernel variants per operator to fully exploit different workloads on different hardwares using modern hardware features. Third, we let users declare a precision budget and route to the fastest algorithm that meets it. Fourth, we keep the entire library transparent enough that users and LLM agents can easily read, compose, and modify the kernels. Mathematically Equivalent Reformulation: Rewriting Operators to Be GPU-Friendly Many classical ML operators have natural implementations that are unfriendly to GPUs: they materialize large intermediates in HBM, introduce atomic contention, or run reductions along dimensions that don't tile well. FlashLib's first principle is to rewrite these into mathematically equivalent forms that are friendly to modern accelerators. KMeans assign is the clearest example: the natural implementation forms an N × K distance matrix in HBM and runs an argmin per row, but the streaming-fused version keeps the running local minima in registers and never materializes the matrix. The same pattern recurs across the library: KNN's fused top-K skips the ‖ x ‖2 term in ‖ x − y ‖2 = ‖ x ‖2 + ‖ y ‖2 − 2⟨ x , y ⟩, PCA's dual-Gram routing picks the smaller of X ⊤ X D × D and X X ⊤ N × N , avoiding the wasted O max N , D 3 eigh that cuML's fixed D × D path runs on wide data, MultinomialNB changes atomic scatter for segment-level reduction, and t-SNE's gradient never materializes the N × N Q matrix. Hardware-Aware Implementation: Kernel Variants for Different Hardware and Workloads To map these mathematical formulations directly to silicon, FlashLib builds multiple kernel variants that adapt to both the hardware and the workload. Flash-KNN illustrates this approach. First, at the backend layer, we ship a portable Triton implementation for both Ampere and Hopper. For Hopper, an opt-in CuteDSL FA3 backend additionally unlocks modern features like TMA fetching and warp-specialized pipelines. Second, at the kernel layer, the design adapts to the workload. For large queries, the kernel mirrors standard FlashAttention to maximize TensorCore utilization. For small queries against a huge corpus, it mirrors Flash-Decoding: a split-k layout distributes work across the corpus dimension to prevent SM idling. Third, at the heuristic layer, we choose hyperparameters like tile sizes and warp counts based on hardware characteristics such as cache size and register capacity. As a result, even a Q =1 query against a 100M-vector corpus holds the kernel at 85.2% of the H200's peak HBM bandwidth . Tolerance-Driven Dispatch: Routing to the Fastest Algorithm within a Precision Budget FlashLib also exposes the speed-accuracy tradeoff as a user choice. Classical scientific computing often demands high precision in FP32 or even FP64 — for solving PDEs, certifying numerical methods, or anywhere a small error cascades into a wrong answer. Many AI workloads have no such requirement: a clustering pass over embedding vectors, a top-K retrieval, or a regression on noisy data can absorb a small declared residual for a substantial speedup. FlashLib makes that distinction the user's to draw, through a single per-call argument, tol . At tol=None , reductions stay in exact precision and the call rides the kernel-fusion wins from above. At tol 0 , the dispatcher routes through a Pareto frontier of precision emulation fused variants like 3xbf16 and Ozaki-II INT8 and algorithm substitution Halko subspace iteration , picking whichever has the highest throughput within the declared residual. php GEMM: same call, different tol - different variant. flashlib.gemm A, B exact fp32 flashlib.gemm A, B, tol=1e-3 bf16 flashlib.gemm A, B, tol=1e-5 3xbf16 cute-fused flashlib.gemm A, B, tol=1e-7 ozaki2 cute s=8 : tighter AND faster flashlib.gemm A, B, tol=1e-12 ozaki2 int8 s=14 : FP64-grade PCA: tol unlocks an algorithm substitution, not just precision. flashlib.flash pca X, K=32 exact eigh on Gram / cov matrix flashlib.flash pca X, K=32, tol=1e-4 Halko subspace: ~30x faster Agent-Native API: Transparent Source and Predictable Cost for Users and Agents In an era when LLM-based agents increasingly read, call, and modify performance code, the cost of a library is not just its kernel throughput but how legible its cost model and source are. FlashLib is written in Triton and CuteDSL with no opaque binaries — every kernel from flash kmeans ... down to the tl.dot call is editable. And every primitive ships a GPU-free cost-prediction surface: flashlib.info.estimate ... takes a shape and a tolerance and returns a recursive tree of runtime, FLOPs, HBM bytes, and bound regime in ~5 microseconds on pure CPU, never importing torch, triton, or cutlass. An LLM agent can compose a pipeline of ten primitives, walk that cost tree, and decide whether the budget fits before spending a single FLOP. python import flashlib.info as info pure stdlib -- no torch/triton/cutlass. Predict cost without touching the GPU -- ~5 microseconds on pure CPU. est = info.estimate "pca", shape= 1 000 000, 512 , params={"K": 32}, device="H200" print est.summary line pca 13.18 ms bound=compute 42 TF 83% peak res~1e-7 roofline est.print tree walk the recursive call-stack tree pca 13.18 ms 2.18 GB compute 42 TF res~1e-7 ├── cov gemm 10.49 ms 2.05 GB compute 50 TF ├── eigh 0.12 ms 0.00 GB compute 3 TF └── transform 2.57 ms 2.18 GB compute 13 TF Pareto-optimal GEMM variants for a 4Kx4Kx4K matmul on H200: for v in info.pareto "gemm", shape= 4096, 4096, 4096 , device="H200" : print v Variant 'gemm fp16' : 0.2 ms residual~8e-04 Variant 'gemm tf32' : 0.4 ms residual~3e-04 Variant 'gemm 3xfp16' : 0.5 ms residual~2e-04 Variant 'gemm fp16 x3 kahan' : 0.6 ms residual~5e-07 Variant 'gemm ozaki2 cute' : 0.8 ms residual~2e-15 Benchmarks All results below are measured on a single NVIDIA H200 SM90, 150 GB HBM3e with CUDA 13.0 , driver 580.126 , PyTorch 2.11 , Triton 3.6 , against cuML 25.10 . Every cell is the median over 5 iterations with the first call discarded for JIT amortization; inputs are GPU-resident on both sides; matched-algorithm rows same algorithm , method , svd solver are paired with reduced-precision and algorithmic-shortcut rows so the comparison is fair at every shape. 1. Breadth: speedup over cuML across 13 primitives The first benchmark is a broad sweep: 13 primitives × 194 shape, dtype, hyperparameter cells, all run against cuml 25.10 on the same H200. Every cell here is an apples-to-apples comparison: matched algorithm, matched precision, matched hyperparameters — FlashLib is forbidden from using any reduced-precision GEMM no bf16/fp16/Ozaki or algorithmic shortcut no Halko, no FFT t-SNE, no NN-Descent KNN . Because of this, the bars below are strictly lower than the headline numbers at the top of the post: the hero stats are FlashLib's best ceiling speedups on each primitive which, where applicable, do let the user trade precision or algorithmic exactness for throughput via the tol knob from Principle 03 , whereas the broad sweep deliberately removes those degrees of freedom to isolate the pure kernel-engineering win. The bar chart below collapses each primitive's 8–34 cells into a single bar — Geomean shows the geometric-mean speedup across all of that primitive's cells, and Max shows the per-primitive ceiling on the most favourable cell. FlashLib is at least as fast as cuML on 193 / 194 cells; 126 cells cross 5×, and 11 cross 50×. 2. Depth: precision × runtime trade-off inside one primitive GEMM The second benchmark zooms into a single primitive, GEMM at 4096³ on H200, to show what tolerance routing looks like in practice. FlashLib ships ~10 GEMM variants in flashlib.linalg.gemm — bf16, fp16, tf32, fused multi-pass 3xbf16 , fp16 x3 kahan , and the Ozaki-II INT8 family ozaki2 cute , ozaki2 int8 — all behind the single tol -routed dispatcher from Principle 03. The scatter below plots each variant on RMS relative error vs an FP64 reference against per-call runtime. The dashed curve is the Pareto frontier : variants below it dominate the rest. The interesting point is ozaki2 cute s=8 : it sits below and to the left of fp32 , meaning it is simultaneously tighter and ~2× faster than the native fp32 path on this shape. 3on H200 3. Agent loops: Does FlashLib accelerate the agent loop? Our third benchmark returns to the core motivation of FlashLib: many agentic workflows are bottlenecked not only by LLM rollout, but also by auxiliary operators such as retrieval, clustering, vector search, and numerical routines. We therefore evaluate FlashLib in a meta setting: whether giving a coding agent access to FlashLib helps it ship a faster end-to-end system. Concretely, we task Claude Code Opus 4.7 with building a GPU vector-search backend, based on the GPU port of KCORES/vector-db-bench https://github.com/KCORES/vector-db-bench , under a 1M-token budget. The goal is to maximize QPS subject to a strict recall constraint of at least 0.999 on SIFT-1M. We evaluate the resulting system across three workloads: offline batch search, online single-query search, and streaming search. We run the same task twice, changing only whether the agent has access to FlashLib. On the two settings where flashlib unlocks a structural win offline batch and streaming , the flashlib-prompted agent reaches 5–6× higher QPS by directly adopting flash knn instead of laboriously rediscovering the cuVS path and slowly bolting on TF32. On online single, where per-query launch overhead dominates either way, both variants converge to roughly the same ceiling. Just as importantly, the flashlib-prompted agent finishes the task naturally at 904k tokens with budget to spare; the default agent runs out at 1.07M tokens with no further ideas on hand — it hit a 50k QPS plateau on the dominant offline-batch task and could not escape it within the budget. Example: Flash-KMeans in a few lines FlashLib ships on PyPI and installs with a single command. The smart dispatcher in flashlib.primitives.kmeans takes a GPU tensor of shape N, D or B, N, D and returns the cluster IDs and centroids; it picks between the Triton path, the Hopper-only CuteDSL FA3-style fused-assign path, and a pure-torch fallback automatically based on the shape. bash $ pip install flashlib python import torch from flashlib.primitives.kmeans import flash kmeans 1M points, 128 dims on a single H200 -- runs in a few ms. x = torch.randn 1 000 000, 128, device="cuda", dtype=torch.float32 labels, centroids = flash kmeans x, n clusters=1024, max iters=20 labels: 1, 1 000 000 int64 -- cluster assignment per point centroids: 1, 1024, 128 float32 -- final cluster centers If you want to know what the call will cost before launching it — useful for an agent budgeting a pipeline of ten primitives — the same shape goes through the pure-stdlib cost API from Principle 04: python import flashlib.info as info est = info.estimate "kmeans", shape= 1 000 000, 128 , params={"K": 1024, "max iters": 20} print est.summary line kmeans ~7 ms bound=compute ~140 TF 64% peak calibrated Other primitives follow the same shape: flash knn , flash pca , flash hdbscan , flash tsne , flashlib.gemm , and the rest of the catalog are all one-call drop-ins with the same tol / backend knobs and the same flashlib.info.estimate ... cost surface. Limitations and Future Work FlashLib's current release covers an important but incomplete slice of the classical-ML operator landscape — primarily the clustering, nearest-neighbour, dimensionality-reduction, and dense linear-algebra primitives most central to our own agentic-AI workloads. Extending the catalog is one of our top priorities; upcoming releases will deepen coverage of the existing families and add new ones Gaussian mixtures, kernel methods, graph-based primitives, sparse inputs . A second limitation is hardware coverage: development and benchmarking have so far concentrated on H200. The dispatcher and kernels are written to be portable, but broader measurement and tuning across a wider range of hardware is still needed. Acknowledgements FlashLib was deeply inspired by several open-source efforts that pioneered the “flash” design playbook on which this work is built: FlashAttention https://github.com/Dao-AILab/flash-attention , FlashDecoding https://crfm.stanford.edu/2023/10/12/flashdecoding.html , NVIDIA's cuVS https://github.com/rapidsai/cuvs and cuML https://github.com/rapidsai/cuml , FlashLinearAttention https://github.com/fla-org/flash-linear-attention , and FlashInfer https://github.com/flashinfer-ai/flashinfer . We are also grateful to Dacheng Li UC Berkeley , Shiyi Cao UC Berkeley , Melissa Pan UC Berkeley , and Zihao Ye University of Washington for many helpful discussions, and to the flash-kmeans and Sparse VideoGen communities for valuable feedback on early prototypes. Citation If FlashLib is useful in your research or product, please cite the project as: @misc{yang2026flashlib, title = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators}, author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and Han, Song and Xu, Chenfeng and Stoica, Ion}, year = {2026}, url = {https://flashml-org.github.io/}, }