Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

wpnews.pro

cd /news/machine-learning/gaussian-mixture-attention-linear-ti… · home › topics › machine-learning › article

[ARTICLE · art-32093] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=machine-learning verified=true sentiment=· neutral

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Researchers introduced Gaussian Mixture Attention (GMA), a probabilistic attention mechanism that replaces pairwise query-key comparisons with routing through K learned Gaussian mixture components, achieving linear-time memory scaling of O(NK) instead of O(N^2). GMA matches baselines on long-context classification and outperforms linear/random-feature attention on WikiText-103, though it lags behind optimized softmax attention and Mamba. The work offers an interpretable, fixed-K linear-time alternative for Transformer architectures.

read1 min views1 publishedJun 18, 2026

arXiv:2606.18283v1 Announce Type: new Abstract: The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbf{Gaussian Mixture Attention (GMA)}, a probabilistic attention-style sequence mixer that replaces explicit pairwise query--key comparison with routing through $K$ learned Gaussian mixture components. Queries and keys are mapped to posterior \textit{responsibility} vectors over a shared latent routing space; their overlap defines an implicit responsibility-space affinity, while values are written into and read from a $K$-slot latent memory. By exploiting the associativity of matrix multiplication, GMA avoids materializing the induced $N\times N$ affinity matrix and instead uses two responsibility matrices whose dominant activation storage scales as $\mathcal{O}(NK)$ rather than $\mathcal{O}(N^2)$ for fixed $K$. We formulate bidirectional and causal variants of GMA, provide an end-to-end differentiable parameterization of the Gaussian mixture components, and analyze its responsibility-modulated gradient structure, constrained non-negative low-rank affinity interpretation, and local routing stability. Empirically, GMA exhibits the intended fixed-$K$ linear memory scaling and is competitive with attention-style baselines on long-context classification, while causal GMA improves over tested linear/random-feature attention variants on WikiText-103 but remains behind optimized causal SDPA and Mamba in the current implementation. Analysis of learned responsibilities further shows broad component usage and moderate alignment with surface-form token categories, supporting GMA as a probabilistic, interpretable, fixed-$K$ linear-time attention-style alternative rather than a universal replacement for optimized softmax attention or state-space models.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/gaussian-mixture-attenti…

Read original on arxiv.org → arxiv.org/abs/2606.18283

mentioned entities

Gaussian Mixture Attention

Transformer

WikiText-103

Mamba

metadata

sluggaussian-mixture-attention-linear-time-sequence-mixing-via-probabilistic-latent

topic#machine-learning

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 18 Jun · #machine-learning

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

letsdatascience.com · 18 Jun · #machine-learning

ML-Predicted Nitrate Improves Phytoplankton Forecasts in Shelf Sea

dev.to · 18 Jun · #machine-learning

Show HN: GNEISS – GNN-powered CLI that detects architectural decay in Java repos

letsdatascience.com · 18 Jun · #machine-learning

XAI Analyses Drivers and Interdependencies in European Electricity Markets

── more on @gaussian mixture attention 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required