This article was originally published on GetYourDozAi.
MiniMax M3 — the first open-weight model to combine frontier coding, a 1M-token context window, and native multimodal input (text + image + video).
MiniMax Sparse Attention (MSA) — a novel mechanism that reduces compute by 28.4x at 1M context by attending only to the 2,048 most relevant tokens per query, backed by a peer-reviewed arXiv paper.
Priced at 5-10% of rivals — $0.30/M input tokens (promo) vs $5.00 for Opus 4.8 and GPT-5.5, making it the best dollar-for-dollar coding model available through an API today.
Caveat emptor — benchmarks are self-reported, licensing restricts commercial self-hosting, and abstract reasoning remains a weakness.
On June 1, 2026, Shanghai-based AI lab MiniMax released M3 — the first open-weight model to deliver three frontier capabilities simultaneously: 59.0% on SWE-Bench Pro (edging GPT-5.5's 58.6%), a 1M-token context window, and native understanding of text, images, and video from the ground up.
The enabler of this trifecta is MiniMax Sparse Attention (MSA) — a novel architecture that makes 1M-token inference computationally practical. Without it, running full attention over a million tokens would be prohibitively expensive on any hardware available today.
Standard softmax attention scales quadratically with context length — doubling the context quadruples the compute. At 1M tokens, a single forward pass becomes impossible. The industry has explored sparse attention patterns, KV-cache compression, and linear attention variants, but each introduces tradeoffs.
MSA's approach is elegantly practical: instead of attending to all tokens, it identifies the few that actually matter for each query and computes attention over those alone. As other open-weight models like Switzerland's Apertus 70B face the same scaling laws, this breakthrough matters far beyond MiniMax.
MSA operates in two stages. First, an Index Branch divides the KV cache into 128-token blocks and selects the top 16 most relevant per GQA group — group-specific sparsity that differentiates MSA from uniform approaches. Then the Main Branch runs exact attention over only those ~2,048 KV tokens, a fixed budget regardless of context length. The result is sub-quadratic scaling: compute stays constant as context grows.
To translate sparsity into real speedups, MiniMax built a custom kernel with exp-free top-k selection, KV-outer sparse attention (batching queries that need the same block), and contiguous memory access — each block read once.
MSA represents a genuine architectural fork from DeepSeek's Multi-head Latent Attention (MLA). While DeepSeek compresses KV data into a latent space (better memory efficiency, precision tradeoff), MSA operates on uncompressed KV data — preserving long-context retrieval accuracy at higher memory cost. The MSA paper (arXiv 2606.13392) provides 30 pages of peer-reviewed detail for the community.
MiniMax also published three impressive real-world demos: an autonomous ICLR 2025 paper reproduction (12 hours, 18 commits), a CUDA FP8 GEMM kernel achieving 9.4x speedup (24 hours, 147 submissions), and fully autonomous model training across 4 untrained base models in 12 hours.
A typical coding task (500K input, 100K output) costs $0.27 at promo pricing — roughly 5% of Opus 4.8. Even at standard rates ($0.54/task), M3 is an order of magnitude cheaper for high-volume workflows.
M3 uses the MiniMax Community License (CC BY-NC 4.0). Commercial use requires a separate agreement with MiniMax. Do not deploy in production without legal verification.
All scores come from MiniMax's own infrastructure, and comparisons used Opus 4.7 (64.3%), not the current Opus 4.8 (69.2%). The gap to today's frontier is ~10 points wider than headlines suggest. Independent Chatbot Arena results are still pending.
ARC-AGI-2 scores are "low single digits." Independent reviewer Thomas Wiegold reported M3 spent 30-40 minutes on a poker simulation with only mediocre results. This is a competent executor, not a general reasoning replacement.
Cheap per-token pricing doesn't mean cheap per-task pricing if the model overthinks on complex problems. Additionally, MiniMax is Shanghai-based — Chinese data laws apply to all API traffic regardless of user location.
MiniMax M3 is the strongest dollar-for-dollar coding model available through an API today. Its MSA architecture is a genuine breakthrough — the peer-reviewed arXiv paper provides real depth for the community to build on. For developers who need frontier coding, massive context windows, and multimodal input at a fraction of the price of proprietary alternatives, M3 is a compelling choice.
But it's not a magic bullet. Licensing restricts self-hosting, abstract reasoning lags frontier models, and the benchmarks need independent validation. What M3 represents is proof that sparse attention can deliver frontier capability at practical costs — a roadmap for the next generation of long-context models.
Want more on the open-weight landscape? Check out our deep dive on Apertus 70B and our complete guide to RAG.
Featured image: MiniMax Sparse Attention (MSA) architecture diagram. Source: MiniMax Official Blog.
External Sources:
Cross-posted from GetYourDozAi