cd /news/machine-learning/sparser-faster-lighter-transformer-l… · home topics machine-learning article
[ARTICLE · art-22848] src=sakana.ai pub= topic=machine-learning verified=true sentiment=↑ positive

Sparser, Faster, Lighter Transformer Language Models

Researchers from NVIDIA have developed a new sparse data format and custom GPU kernels, called TwELL, that reshape unstructured sparsity in transformer language models to align with GPU architecture, achieving over 20% speedups in inference and training. The method dynamically routes highly sparse tokens through a fast path while using a dense backup matrix for rare heavy tokens, reducing peak memory and energy consumption. The work, presented at ICML 2026, addresses the paradox where making models do less math often slows them down due to hardware mismatches.

read2 min publishedMay 8, 2026

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with NVIDIA: “Sparser, Faster, Lighter Transformer Language Models”. This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a “Hybrid” format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Our contribution is twofold:

  • We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution.
  • We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at ICML 2026. Please check out our blog and technical paper for a deep dive!

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/sparser-faster-light…] indexed:0 read:2min 2026-05-08 ·