{"slug": "modular-day-zero-minimax-m3-open-weights-on-modular-cloud", "title": "Modular: Day Zero: MiniMax M3 Open Weights on Modular Cloud", "summary": "MiniMax released the open-weights MiniMax M3 model on Modular Cloud, featuring a new Sparse Attention operation that achieves up to 15.6x speedup on decode while maintaining a 1 million token context window. The model is optimized for coding, agentic tasks, and native multimodality, with Modular providing deployment options on its cloud or in customer VPCs.", "body_md": "Hippocratic AI + Modular to power real-time patient conversations. Read More →\n\nInference Products\n\nShared Endpoints\n\nAccess frontier models via an API\n\nDedicated Endpoints\n\nMission critical reliability\n\nCustom models\n\nYour model, peak performance\n\nDeployment Options\n\nOur Cloud\n\nFully managed, pay by usage\n\nYour Cloud\n\nModular stack in your VPC\n\nPricing\n\nFlexible plans for every team\n\nModels\n\nDeepSeek V4 Pro\n\nFLUX.2 Klein 9B\n\nKimi K2.6\n\nMiniMax M2.7\n\nWan 2.2 T2V A14B\n\nView All\n\nText to audio\n\nTurn text into natural speech\n\nImage generation\n\nGenerate images from text prompts\n\nCode generation\n\nGenerate production-ready code\n\nVideo generation\n\nGenerate video from text + image\n\nAgentic\n\nDeploy AI agents anywhere\n\nCustom Models\n\nKernel-level model control\n\nCase Studies\n\nProven results from real customers\n\nMAX Framework\n\nGenAI native modeling & serving\n\nMojo Language\n\nThe best GPU & CPU performance\n\nSelf-Hosted\n\nMAX+Mojo self-hosted by you\n\nCommunity\n\nBuild the future of AI together\n\nMojo Agent Skills\n\nOfficial AI agent skills from Modular\n\nDocs\n\nDeploy GenAI models, our cloud or yours\n\nModel Library\n\nLatest supported open models\n\nMojo Docs\n\nWrite high-performance kernels for CPUs and GPUs\n\nAbout\n\nBuild AI for anyone, anywhere.\n\nCareers\n\n👋 We’re currently hiring!\n\nCulture\n\nWhat we believe\n\nContact Us\n\nRequest a demo\n\nJune 12, 2026\n\nModular Team\n\nCompany\n\nMiniMax M3 is the newest open-weights model that has been optimized for coding, agentic work, and native multimodality for MiniMax. A few things that make this a frontier model are:\n\nBehind M3 is a new MiniMax Sparse Attention (MSA) operation. MSA is what enables a 1M context to be served, and a big part of what makes M3 demanding to run well. But, if optimized, MSA’s design allows it to cut the per-token attention compute to roughly 1/20th of its full-attention predecessor. This results in around 9.7× speedup on prefill and 15.6× speedup on decode, while matching full attention across the vast majority of workloads.\n\nMSA splits every attention layer into two parts: which KV to look at, and how to attend to it. The first is solved by introducing an indexing layer. For each query, the indexer scores candidate KV blocks and chooses the top-k blocks. The indexer also maintains a cache of index keys with a single shared head and a small head dimension. By focusing only on top scoring KV cache blocks, MSA only computes the attention of the relevant 128 tokens in the KV caches rather than the full block.\n\n```\n# One MSA layer, conceptually\ns = (Q_idx @ K_idx.T) * idx_scale      # single shared index head, tiny d_idx -- nearly free\nS = block_max_pool(s, B=128)           # token scores -> 128-token block scores\nS[:, :init_blocks]  = INF             # force-select the attention-sink blocks\nS[:, local_window:] = INF - eps       # force-select the recent window\nI = topk_per_kv_group(S, k)            # ONE selection, shared by every head in the GQA group\nO = softmax_attention(Q, K[I], V[I])   # ordinary GQA over the REAL K/V of the selected blocks\n```\n\nThe model produces selection in query-major form: for each query, a list of top-k block IDs. The natural kernel follows that shape — loop over queries, gather their selected KV blocks, and then attend. Executing in query-major order would mean each query independently gathers its selected blocks, the same KV block may be fetched from HBM many times (which is not very efficient).\n\n```\n# Query-major: the natural schedule\nfor q_tile in queries:                    # parallel across threadblocks\n    for blk in I[q_tile]:                 # this tile's top-k blocks\n        K_blk, V_blk = load_block(blk)    # hot blocks re-fetched by EVERY threadblock that picked them\n        online_softmax_update(q_tile, K_blk, V_blk)\n```\n\nTo avoid the repeated loads, MSA inverts the mapping by grouping the queries by the KV block they selected; i.e. executing in key-block-major form and what MiniMax calls “KV outer gather Q”. As a result, we can improve the arithmetic intensity since the blocks are loaded once, before computing partial attention for all of those queries, and then merging the partial results.\n\n```\n# Once per step: transpose the selection (a sparse-matrix transpose into CSR)\nk2q  = invert(I)                  # row = (seq, kv_block); entries = queries that selected it\nwork = chunk_rows(k2q, q_budget)  # split hot rows for load balance (more below)\n\n# Block-major forward: one threadblock per work item -- each KV byte leaves HBM once\nblk, q_list = work[work_id]\nK_blk, V_blk = bulk_load(blk)              # ONE contiguous load; resident for the threadblock's lifetime\nfor q_tile in tiles(q_list, BM):           # stream the selecting queries through it\n    Q_t = gather_rows(Q, q_tile)           # gather the queries (scattered rows)\n    O_p, lse = attend_one_block(Q_t, K_blk, V_blk)   # single-tile softmax -- next section\n    O_partial[q_tile, slot(q_tile, blk)]   = O_p     # per-(query, block) partials,\n    LSE_partial[q_tile, slot(q_tile, blk)] = lse     # merged by a separate combine pass\n```\n\nThis structure has an added benefit of simplifying the online softmax computation. Remember that in query-major attention one needs to perform online softmax. But in the block-major format, a thread block only ever sees one KV block per query group. Thus the softmax can be performed on a single tile without the need for an online correction. This is very much similar to the split-kv reduction step in flash decoding.\n\nThe MiniMax M3 model bring novel innovations that require whole stack optimizations - from kernels to cloud. This is only possible in the Modular platform. MiniMax M3 is available on Modular Cloud today for enterprise customers. Talk to our AI engineers to request access today.\n\nDiscover what Modular can do for you\n\nHippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations\n\nMay 18, 2026\n\nModular Opens Edinburgh & San Francisco Offices\n\nApril 10, 2026\n\nModverse #54: From GTC to Edinburgh, a Community Building Momentum\n\nMarch 31, 2026\n\nBuild the future of AI with Modular\n\nSign up today\n\nSignup to our Cloud Platform today to get started easily.\n\nBrowse open models\n\nBrowse our model catalog, or deploy your own custom model\n\nGet all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.\n\n⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.\n\nThanks for signing up to our newsletter! 🚀\n\nThank you,\n\nModular Sales Team", "url": "https://wpnews.pro/news/modular-day-zero-minimax-m3-open-weights-on-modular-cloud", "canonical_source": "https://www.modular.com/blog/day-zero-minimax-m3-open-weights-on-modular-cloud", "published_at": "2026-06-12 00:00:00+00:00", "updated_at": "2026-06-12 14:09:16.051477+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-products", "ai-infrastructure", "ai-startups"], "entities": ["Modular", "MiniMax", "Hippocratic AI", "DeepSeek", "FLUX", "Kimi", "Wan"], "alternates": {"html": "https://wpnews.pro/news/modular-day-zero-minimax-m3-open-weights-on-modular-cloud", "markdown": "https://wpnews.pro/news/modular-day-zero-minimax-m3-open-weights-on-modular-cloud.md", "text": "https://wpnews.pro/news/modular-day-zero-minimax-m3-open-weights-on-modular-cloud.txt", "jsonld": "https://wpnews.pro/news/modular-day-zero-minimax-m3-open-weights-on-modular-cloud.jsonld"}}