Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

wpnews.pro

cd /news/machine-learning/gated-deltanet-2-decoupling-erase-an… · home › topics › machine-learning › article

[ARTICLE · art-20585] src=research.nvidia.com ↗ pub=2026-06-03T15:50Z topic=machine-learning verified=true sentiment=· neutral

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Researchers introduced Gated DeltaNet-2, a linear attention model that decouples the erase and write operations in recurrent state updates using separate channel-wise gates. The model outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants in language modeling, reasoning, and retrieval tasks at 1.3 billion parameters trained on 100 billion tokens. Its strongest gains appear in long-context needle-in-a-haystack benchmarks, particularly in multi-key retrieval scenarios.

read1 min views16 publishedJun 3, 2026

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings.

source & further reading

research.nvidia.com — original article This could be the largest synthetic code dataset yet CoFrGeNets replace the ‘bones’ of transformer-based models How training environments can teach AI models to misbehave

~/api · this article 200

$curl api.wpnews.pro/v1/news/gated-deltanet-2-decoupl…

Read original on research.nvidia.com → research.nvidia.com/publication/2026-05_gated-de…

mentioned entities

Gated DeltaNet-2

Kimi Delta Attention

Mamba-2

Gated DeltaNet

FineWeb-Edu

metadata

sluggated-deltanet-2-decoupling-erase-and-write-in-linear-attention

topic#machine-learning

secondary4 topics

sentimentneutral

canonicalresearch.nvidia.com

navigation

← prevAmazon will show AI product imag…

next →Eu sets out plans to reduce reli…

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 21 Jul · #machine-learning

Kimi Linear: An Expressive, Efficient Attention Architecture

alganet.github.io · 21 Jul · #machine-learning

ML-pokedex – experiments trying to train some models directly on RDF

gazecapture.csail.mit.edu · 21 Jul · #machine-learning

Eye Tracking on Your Cellphone

pub.towardsai.net · 21 Jul · #machine-learning

AI Fundamentals: What is a Neuron?

── more on @gated deltanet-2 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 7 Jul · #artificial-intelligence

In the age of AI, Hong Kong’s strategy as a ‘superconnector’ is progressing

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required