Sparser, Faster, Lighter Transformer Language Models

Researchers from NVIDIA have developed a new sparse data format and custom GPU kernels, called TwELL, that reshape unstructured sparsity in transformer language models to align with GPU architecture, achieving over 20% speedups in inference and training. The method dynamically routes highly sparse tokens through a fast path while using a dense backup matrix for rare heavy tokens, reducing peak memory and energy consumption. The work, presented at ICML 2026, addresses the paradox where making models do less math often slows them down due to hardware mismatches.

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU ⚡️ Excited to share our new ICML2026 paper in collaboration with NVIDIA: “Sparser, Faster, Lighter Transformer Language Models”. This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too 95% of neurons in feedforward layers stay silent for any given word , but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a “Hybrid” format that reshapes the sparsity to fit the GPU. Our sparsity format TwELL dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Our contribution is twofold: - We introduce TwELL Tile-wise ELLPACK , a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. - We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating 20% speedups and even higher savings in peak memory and energy. This work will be presented at ICML 2026. Please check out our blog and technical paper for a deep dive