# FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

> Source: <https://flashml-org.github.io/>
> Published: 2026-05-27 04:12:45+00:00

# FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

Introducing **FlashLib** — a GPU library for classical machine learning operators on modern hardwares, rebuilt for today's ML workloads and emerging agentic AI systems. Here are a few headline results from the first release:

- Significant wins over cuML on Hopper GPUs: up to
**26×** on KMeans,**19×** on KNN,**40×** on HDBSCAN,**208×** on TruncatedSVD,**47×** on PCA,**147×** on exact t-SNE, and**49×** on MultinomialNB. - Flash informative API: Predict runtime, memory footprint, and overhead for any workload in
**~5 µs on pure CPU**, with no GPU profiling required. - Fast cold start, built to scale: FlashLib uses heuristic kernel selection to avoid long autotune loops, and already supports multi-GPU execution for large workloads.
- Toward optimal hardware utilization: FlashLib drives kernels much closer to the limits of modern GPUs, with Flash-KMeans reaching up to
**61% of peak FLOPs** and Flash-KNN reaching up to**85.2% of peak HBM bandwidth** on H200.

The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc.

But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift.

However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates a four-way mismatch. First, many operators carry natural implementations that are unfriendly to GPUs. Second, many libraries ship one static kernel implementation across all workloads and hardware tiers, leaving modern GPU hardware features unexploited. Third, many libraries are unaware of the user's precision needs: they expose no way to declare a precision budget, leaving users unable to ask for the fastest algorithm that meets their tolerance. Fourth, the performance is black box: costly to profile, hard to modify, and impossible to budget without first reading the codebase, which leaves both developers and LLM-based agents in the dark.

FlashLib is our attempt to close these gaps and accelerate this emerging substrate, making it fast enough to sit inside the loop of agentic AI. It transforms classical ML operators from slow, offline utilities into fast, online ML primitives. Moreover, FlashLib exposes flash-informative APIs that reveal the cost, tolerance, and execution behavior of these primitives to higher-level agentic pipelines, thus enabling better scheduling and orchestration. We would also like to point out that while FlashLib is motivated by the emerging needs of LLM-centric and agentic AI systems, we also recognize that classical ML algorithms remain widely used across today's machine learning stacks. Beyond generative AI, operators such as KMeans, KNN, PCA, SVD, t-SNE, and HDBSCAN etc are still core building blocks for recommendation systems, retrieval pipelines, scientific computing, anomaly detection, visualization, and preprocessing for downstream ML models. FlashLib provides a fast, easy-to-use, and adaptive software stack that covers these diverse applications with plug-and-play GPU acceleration.

We built FlashLib around four design principles. First, we reshape the algorithm to fit the hardware while achieving mathematical equivalence. Second, we build kernel variants per operator to fully exploit different workloads on different hardwares using modern hardware features. Third, we let users declare a precision budget and route to the fastest algorithm that meets it. Fourth, we keep the entire library transparent enough that users and LLM agents can easily read, compose, and modify the kernels.

## Mathematically Equivalent Reformulation: Rewriting Operators to Be GPU-Friendly

Many classical ML operators have natural implementations that are unfriendly to GPUs: they materialize large intermediates in HBM, introduce atomic contention, or run reductions along dimensions that don't tile well. FlashLib's first principle is to rewrite these into mathematically equivalent forms that are friendly to modern accelerators. KMeans assign is the clearest example: the natural implementation forms an *N*×*K* distance matrix in HBM and runs an `argmin`

per row, but the streaming-fused version keeps the running local minima in registers and never materializes the matrix. The same pattern recurs across the library: KNN's fused top-K skips the ‖*x*‖2 term in ‖*x* − *y*‖2 = ‖*x*‖2 + ‖*y*‖2 − 2⟨*x*, *y*⟩, PCA's dual-Gram routing picks the smaller of *X*⊤*X* (*D*×*D*) and *X X*⊤ (*N*×*N*), avoiding the wasted O(max(*N*,*D*)3) `eigh`

that cuML's fixed *D*×*D* path runs on wide data, MultinomialNB changes atomic scatter for segment-level reduction, and t-SNE's gradient never materializes the *N*×*N* Q matrix.

## Hardware-Aware Implementation: Kernel Variants for Different Hardware and Workloads

To map these mathematical formulations directly to silicon, FlashLib builds multiple kernel variants that adapt to both the hardware and the workload. Flash-KNN illustrates this approach. First, at the backend layer, we ship a portable Triton implementation for both Ampere and Hopper. For Hopper, an opt-in CuteDSL FA3 backend additionally unlocks modern features like TMA fetching and warp-specialized pipelines. Second, at the kernel layer, the design adapts to the workload. For large queries, the kernel mirrors standard FlashAttention to maximize TensorCore utilization. For small queries against a huge corpus, it mirrors Flash-Decoding: a split-k layout distributes work across the corpus dimension to prevent SM idling. Third, at the heuristic layer, we choose hyperparameters like tile sizes and warp counts based on hardware characteristics such as cache size and register capacity. As a result, even a *Q*=1 query against a 100M-vector corpus holds the kernel at **85.2% of the H200's peak HBM bandwidth**.

## Tolerance-Driven Dispatch: Routing to the Fastest Algorithm within a Precision Budget

FlashLib also exposes the speed-accuracy tradeoff as a user choice. Classical scientific computing often demands high precision in FP32 or even FP64 — for solving PDEs, certifying numerical methods, or anywhere a small error cascades into a wrong answer. Many AI workloads have no such requirement: a clustering pass over embedding vectors, a top-K retrieval, or a regression on noisy data can absorb a small declared residual for a substantial speedup. FlashLib makes that distinction the user's to draw, through a single per-call argument, `tol`

. At `tol=None`

, reductions stay in exact precision and the call rides the kernel-fusion wins from above. At `tol > 0`

, the dispatcher routes through a Pareto frontier of precision emulation (fused variants like `3xbf16`

and Ozaki-II INT8) and algorithm substitution (Halko subspace iteration), picking whichever has the highest throughput within the declared residual.

``` php
# GEMM: same call, different tol -> different variant.
flashlib.gemm(A, B)               # exact fp32
flashlib.gemm(A, B, tol=1e-3)     # bf16
flashlib.gemm(A, B, tol=1e-5)     # 3xbf16 (cute-fused)
flashlib.gemm(A, B, tol=1e-7)     # ozaki2_cute(s=8): tighter AND faster
flashlib.gemm(A, B, tol=1e-12)    # ozaki2_int8(s=14): FP64-grade

# PCA: tol unlocks an algorithm substitution, not just precision.
flashlib.flash_pca(X, K=32)            # exact eigh on Gram / cov matrix
flashlib.flash_pca(X, K=32, tol=1e-4)  # Halko subspace: ~30x faster
```

## Agent-Native API: Transparent Source and Predictable Cost for Users and Agents

In an era when LLM-based agents increasingly read, call, and modify performance code, the cost of a library is not just its kernel throughput but how legible its cost model and source are. FlashLib is written in Triton and CuteDSL with no opaque binaries — every kernel from `flash_kmeans(...)`

down to the `tl.dot`

call is editable. And every primitive ships a GPU-free cost-prediction surface: `flashlib.info.estimate(...)`

takes a shape and a tolerance and returns a recursive tree of runtime, FLOPs, HBM bytes, and bound regime in ~5 microseconds on pure CPU, never importing torch, triton, or cutlass. An LLM agent can compose a pipeline of ten primitives, walk that cost tree, and decide whether the budget fits before spending a single FLOP.

``` python
import flashlib.info as info   # pure stdlib -- no torch/triton/cutlass.

# Predict cost without touching the GPU -- ~5 microseconds on pure CPU.
est = info.estimate("pca", shape=(1_000_000, 512), params={"K": 32}, device="H200")

print(est.summary_line())
# pca   13.18 ms  bound=compute   42 TF  (83% peak)  res~1e-7  [roofline]

est.print_tree()              # walk the recursive call-stack tree
# pca           13.18 ms  2.18 GB  compute   42 TF  res~1e-7
# ├── cov_gemm  10.49 ms  2.05 GB  compute   50 TF
# ├── eigh       0.12 ms  0.00 GB  compute    3 TF
# └── transform  2.57 ms  2.18 GB  compute   13 TF

# Pareto-optimal GEMM variants for a 4Kx4Kx4K matmul on H200:
for v in info.pareto("gemm", shape=(4096, 4096, 4096), device="H200"):
    print(v)
# Variant('gemm_fp16'           : 0.2 ms  residual~8e-04)
# Variant('gemm_tf32'           : 0.4 ms  residual~3e-04)
# Variant('gemm_3xfp16'         : 0.5 ms  residual~2e-04)
# Variant('gemm_fp16_x3_kahan'  : 0.6 ms  residual~5e-07)
# Variant('gemm_ozaki2_cute'    : 0.8 ms  residual~2e-15)
```

## Benchmarks

All results below are measured on a single NVIDIA H200 (SM90, 150 GB HBM3e) with `CUDA 13.0`

, driver `580.126`

, `PyTorch 2.11`

, `Triton 3.6`

, against `cuML 25.10`

. Every cell is the median over 5 iterations with the first call discarded for JIT amortization; inputs are GPU-resident on both sides; matched-algorithm rows (same `algorithm`

, `method`

, `svd_solver`

) are paired with reduced-precision and algorithmic-shortcut rows so the comparison is fair at every shape.

### 1. Breadth: speedup over cuML across 13 primitives

The first benchmark is a broad sweep: 13 primitives × 194 (shape, dtype, hyperparameter) cells, all run against `cuml 25.10`

on the same H200. **Every cell here is an apples-to-apples comparison: matched algorithm, matched precision, matched hyperparameters — FlashLib is forbidden from using any reduced-precision GEMM (no bf16/fp16/Ozaki) or algorithmic shortcut (no Halko, no FFT t-SNE, no NN-Descent KNN).** Because of this, the bars below are strictly *lower* than the headline numbers at the top of the post: the hero stats are FlashLib's best ceiling speedups on each primitive (which, where applicable, do let the user trade precision or algorithmic exactness for throughput via the `tol`

knob from Principle 03), whereas the broad sweep deliberately removes those degrees of freedom to isolate the pure kernel-engineering win. The bar chart below collapses each primitive's 8–34 cells into a single bar — **Geomean** shows the geometric-mean speedup across all of that primitive's cells, and **Max** shows the per-primitive ceiling on the most favourable cell. FlashLib is at least as fast as cuML on **193 / 194** cells; **126** cells cross 5×, and **11** cross 50×.

### 2. Depth: precision × runtime trade-off inside one primitive (GEMM)

The second benchmark zooms into a single primitive, GEMM at 4096³ on H200, to show what tolerance routing looks like in practice. FlashLib ships ~10 GEMM variants in `flashlib.linalg.gemm`

— bf16, fp16, tf32, fused multi-pass (`3xbf16`

, `fp16_x3_kahan`

), and the Ozaki-II INT8 family (`ozaki2_cute`

, `ozaki2_int8`

) — all behind the single `tol`

-routed dispatcher from Principle 03. The scatter below plots each variant on RMS relative error (vs an FP64 reference) against per-call runtime. The dashed curve is the *Pareto frontier*: variants below it dominate the rest. The interesting point is `ozaki2_cute(s=8)`

: it sits below *and* to the left of `fp32`

, meaning it is simultaneously tighter and ~2× faster than the native fp32 path on this shape.

3on H200

### 3. Agent loops: Does FlashLib accelerate the agent loop?

Our third benchmark returns to the core motivation of FlashLib: many agentic workflows are bottlenecked not only by LLM rollout, but also by auxiliary operators such as retrieval, clustering, vector search, and numerical routines. We therefore evaluate FlashLib in a meta setting: whether giving a coding agent access to FlashLib helps it ship a faster end-to-end system.

Concretely, we task Claude Code Opus 4.7 with building a GPU vector-search backend, based on the GPU port of [KCORES/vector-db-bench](https://github.com/KCORES/vector-db-bench), under a 1M-token budget. The goal is to maximize QPS subject to a strict recall constraint of at least 0.999 on SIFT-1M. We evaluate the resulting system across three workloads: offline batch search, online single-query search, and streaming search. We run the same task twice, changing only whether the agent has access to FlashLib.

On the two settings where flashlib unlocks a structural win (offline batch and streaming), the flashlib-prompted agent reaches **5–6× higher QPS** by directly adopting `flash_knn`

instead of laboriously rediscovering the cuVS path and slowly bolting on TF32. On online single, where per-query launch overhead dominates either way, both variants converge to roughly the same ceiling. Just as importantly, the flashlib-prompted agent finishes the task *naturally* at 904k tokens with budget to spare; the default agent runs out at 1.07M tokens with no further ideas on hand — it hit a 50k QPS plateau on the dominant offline-batch task and could not escape it within the budget.

## Example: Flash-KMeans in a few lines

FlashLib ships on PyPI and installs with a single command. The smart dispatcher in `flashlib.primitives.kmeans`

takes a GPU tensor of shape `(N, D)`

or `(B, N, D)`

and returns the cluster IDs and centroids; it picks between the Triton path, the Hopper-only CuteDSL FA3-style fused-assign path, and a pure-torch fallback automatically based on the shape.

``` bash
$ pip install flashlib
python
import torch
from flashlib.primitives.kmeans import flash_kmeans

# 1M points, 128 dims on a single H200 -- runs in a few ms.
x = torch.randn(1_000_000, 128, device="cuda", dtype=torch.float32)
labels, centroids = flash_kmeans(x, n_clusters=1024, max_iters=20)
# labels:    (1, 1_000_000) int64   -- cluster assignment per point
# centroids: (1, 1024, 128) float32 -- final cluster centers
```

If you want to know what the call will cost *before* launching it — useful for an agent budgeting a pipeline of ten primitives — the same shape goes through the pure-stdlib cost API from Principle 04:

``` python
import flashlib.info as info

est = info.estimate("kmeans",
                    shape=(1_000_000, 128),
                    params={"K": 1024, "max_iters": 20})
print(est.summary_line())
# kmeans  ~7 ms  bound=compute  ~140 TF  (64% peak)  [calibrated]
```

Other primitives follow the same shape: `flash_knn`

, `flash_pca`

, `flash_hdbscan`

, `flash_tsne`

, `flashlib.gemm`

, and the rest of the catalog are all one-call drop-ins with the same `tol`

/ `backend`

knobs and the same `flashlib.info.estimate(...)`

cost surface.

## Limitations and Future Work

FlashLib's current release covers an important but incomplete slice of the classical-ML operator landscape — primarily the clustering, nearest-neighbour, dimensionality-reduction, and dense linear-algebra primitives most central to our own agentic-AI workloads. Extending the catalog is one of our top priorities; upcoming releases will deepen coverage of the existing families and add new ones (Gaussian mixtures, kernel methods, graph-based primitives, sparse inputs). A second limitation is hardware coverage: development and benchmarking have so far concentrated on H200. The dispatcher and kernels are written to be portable, but broader measurement and tuning across a wider range of hardware is still needed.

## Acknowledgements

FlashLib was deeply inspired by several open-source efforts that pioneered the “flash” design playbook on which this work is built: [FlashAttention](https://github.com/Dao-AILab/flash-attention), [FlashDecoding](https://crfm.stanford.edu/2023/10/12/flashdecoding.html), NVIDIA's [cuVS](https://github.com/rapidsai/cuvs) and [cuML](https://github.com/rapidsai/cuml), [FlashLinearAttention](https://github.com/fla-org/flash-linear-attention), and [FlashInfer](https://github.com/flashinfer-ai/flashinfer). We are also grateful to **Dacheng Li** (UC Berkeley), **Shiyi Cao** (UC Berkeley), **Melissa Pan** (UC Berkeley), and **Zihao Ye** (University of Washington) for many helpful discussions, and to the *flash-kmeans* and *Sparse VideoGen* communities for valuable feedback on early prototypes.

## Citation

If FlashLib is useful in your research or product, please cite the project as:

```
@misc{yang2026flashlib,
  title  = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators},
  author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and
            Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and
            Han, Song and Xu, Chenfeng and Stoica, Ion},
  year   = {2026},
  url    = {https://flashml-org.github.io/},
}
```


