Modular: Day Zero: MiniMax M3 Open Weights on Modular Cloud

wpnews.pro

Hippocratic AI + Modular to power real-time patient conversations. Read More →

Inference Products

Shared Endpoints

Access frontier models via an API

Dedicated Endpoints

Mission critical reliability

Custom models

Your model, peak performance

Deployment Options

Our Cloud

Fully managed, pay by usage

Your Cloud

Modular stack in your VPC

Pricing

Flexible plans for every team

Models

DeepSeek V4 Pro

FLUX.2 Klein 9B

Kimi K2.6

MiniMax M2.7

Wan 2.2 T2V A14B

View All

Text to audio

Turn text into natural speech

Image generation

Generate images from text prompts

Code generation

Generate production-ready code

Video generation

Generate video from text + image

Agentic

Deploy AI agents anywhere

Custom Models

Kernel-level model control

Case Studies

Proven results from real customers

MAX Framework

GenAI native modeling & serving

Mojo Language

The best GPU & CPU performance

Self-Hosted

MAX+Mojo self-hosted by you

Community

Build the future of AI together

Mojo Agent Skills

Official AI agent skills from Modular

Docs

Deploy GenAI models, our cloud or yours

Model Library

Latest supported open models

Mojo Docs

Write high-performance kernels for CPUs and GPUs

About

Build AI for anyone, anywhere.

Careers

👋 We’re currently hiring!

Culture

What we believe

Contact Us

Request a demo

June 12, 2026

Modular Team

Company

MiniMax M3 is the newest open-weights model that has been optimized for coding, agentic work, and native multimodality for MiniMax. A few things that make this a frontier model are:

Behind M3 is a new MiniMax Sparse Attention (MSA) operation. MSA is what enables a 1M context to be served, and a big part of what makes M3 demanding to run well. But, if optimized, MSA’s design allows it to cut the per-token attention compute to roughly 1/20th of its full-attention predecessor. This results in around 9.7× speedup on prefill and 15.6× speedup on decode, while matching full attention across the vast majority of workloads.

MSA splits every attention layer into two parts: which KV to look at, and how to attend to it. The first is solved by introducing an indexing layer. For each query, the indexer scores candidate KV blocks and chooses the top-k blocks. The indexer also maintains a cache of index keys with a single shared head and a small head dimension. By focusing only on top scoring KV cache blocks, MSA only computes the attention of the relevant 128 tokens in the KV caches rather than the full block.

s = (Q_idx @ K_idx.T) * idx_scale      # single shared index head, tiny d_idx -- nearly free
S = block_max_pool(s, B=128)           # token scores -> 128-token block scores
S[:, :init_blocks]  = INF             # force-select the attention-sink blocks
S[:, local_window:] = INF - eps       # force-select the recent window
I = topk_per_kv_group(S, k)            # ONE selection, shared by every head in the GQA group
O = softmax_attention(Q, K[I], V[I])   # ordinary GQA over the REAL K/V of the selected blocks

The model produces selection in query-major form: for each query, a list of top-k block IDs. The natural kernel follows that shape — loop over queries, gather their selected KV blocks, and then attend. Executing in query-major order would mean each query independently gathers its selected blocks, the same KV block may be fetched from HBM many times (which is not very efficient).

for q_tile in queries:                    # parallel across threadblocks
    for blk in I[q_tile]:                 # this tile's top-k blocks
        K_blk, V_blk = load_block(blk)    # hot blocks re-fetched by EVERY threadblock that picked them
        online_softmax_update(q_tile, K_blk, V_blk)

To avoid the repeated loads, MSA inverts the mapping by grouping the queries by the KV block they selected; i.e. executing in key-block-major form and what MiniMax calls “KV outer gather Q”. As a result, we can improve the arithmetic intensity since the blocks are loaded once, before computing partial attention for all of those queries, and then merging the partial results.

k2q  = invert(I)                  # row = (seq, kv_block); entries = queries that selected it
work = chunk_rows(k2q, q_budget)  # split hot rows for load balance (more below)

blk, q_list = work[work_id]
K_blk, V_blk = bulk_load(blk)              # ONE contiguous load; resident for the threadblock's lifetime
for q_tile in tiles(q_list, BM):           # stream the selecting queries through it
    Q_t = gather_rows(Q, q_tile)           # gather the queries (scattered rows)
    O_p, lse = attend_one_block(Q_t, K_blk, V_blk)   # single-tile softmax -- next section
    O_partial[q_tile, slot(q_tile, blk)]   = O_p     # per-(query, block) partials,
    LSE_partial[q_tile, slot(q_tile, blk)] = lse     # merged by a separate combine pass

This structure has an added benefit of simplifying the online softmax computation. Remember that in query-major attention one needs to perform online softmax. But in the block-major format, a thread block only ever sees one KV block per query group. Thus the softmax can be performed on a single tile without the need for an online correction. This is very much similar to the split-kv reduction step in flash decoding.

The MiniMax M3 model bring novel innovations that require whole stack optimizations - from kernels to cloud. This is only possible in the Modular platform. MiniMax M3 is available on Modular Cloud today for enterprise customers. Talk to our AI engineers to request access today.

Discover what Modular can do for you

Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations

May 18, 2026

Modular Opens Edinburgh & San Francisco Offices

April 10, 2026

Modverse #54: From GTC to Edinburgh, a Community Building Momentum

March 31, 2026

Build the future of AI with Modular

Sign up today

Signup to our Cloud Platform today to get started easily.

Browse open models

Browse our model catalog, or deploy your own custom model

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.

Thanks for signing up to our newsletter! 🚀

Thank you,

Modular Sales Team

source & further reading

modular.com — original article Modular: Why LLM Inference Needs a New Kind of Router - Part 3 Three Trends from MLSys 2026 Modular: Why LLM Inference Needs a New Kind of Router - Part 2

Modular: Day Zero: MiniMax M3 Open Weights on Modular Cloud

Run your AI side-project on zahid.host