Sparser, Faster, Lighter Transformer Language Models

wpnews.pro

cd /news/machine-learning/sparser-faster-lighter-transformer-l… · home › topics › machine-learning › article

[ARTICLE · art-22848] src=sakana.ai ↗ pub=2026-05-08T15:00Z topic=machine-learning verified=true sentiment=↑ positive

Sparser, Faster, Lighter Transformer Language Models

Researchers from NVIDIA have developed a new sparse data format and custom GPU kernels, called TwELL, that reshape unstructured sparsity in transformer language models to align with GPU architecture, achieving over 20% speedups in inference and training. The method dynamically routes highly sparse tokens through a fast path while using a dense backup matrix for rare heavy tokens, reducing peak memory and energy consumption. The work, presented at ICML 2026, addresses the paradox where making models do less math often slows them down due to hardware mismatches.

read2 min views8 publishedMay 8, 2026

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with NVIDIA: “Sparser, Faster, Lighter Transformer Language Models”. This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a “Hybrid” format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Our contribution is twofold:

We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution.
We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at ICML 2026. Please check out our blog and technical paper for a deep dive!

source & further reading

sakana.ai — original article Introducing Fugu-Cyber: our new orchestration model that achieves state-of-the-art performance on real-world cybersecurity benchmarks Sakana AI Teams With NVIDIA to Advance Open Model Innovation from Japan Smart Cellular Bricks: Towards Collective Intelligence for the Physical World

~/api · this article 200

$curl api.wpnews.pro/v1/news/sparser-faster-lighter-t…

Read original on sakana.ai → sakana.ai/twell/

mentioned entities

NVIDIA

ICML

TwELL

metadata

slugsparser-faster-lighter-transformer-language-models

topic#machine-learning

secondary4 topics

sentimentpositive

canonicalsakana.ai

navigation

← prevConversation tree branching in @…

next →How Silicon Valley sold Washingt…

── more in #machine-learning 4 stories · sorted by recency

jonready.com · 22 Jul · #machine-learning

Agent swarms are great for local AI

byteiota.com · 22 Jul · #machine-learning

NVIDIA Cosmos 3 Edge: On-Device Robot AI for Developers

marktechpost.com · 22 Jul · #machine-learning

Poolside Releases Laguna S 2.1, an Open-Weight Agentic Coding Model Punching Above Its Weight Class on SWE-Bench Multilingual

twitter.com · 22 Jul · #machine-learning

Gigatoken: Fastest Tokenizer

── more on @nvidia 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required