From Lightning to Sparse: How MiniMax M3 Reads a Million Tokens Without Reading Them All

MiniMax introduces M3, a sparse attention mechanism that efficiently processes up to a million tokens by selectively reading only relevant parts of the input, overcoming production failures of prior efficient attention methods.

A concept-first tour of MiniMax Sparse Attention — why “efficient attention” kept failing in production, and the surprisingly simple idea… Continue reading on Towards AI »