Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

NVIDIA and MiniMax released the M3 multimodal AI model on NVIDIA accelerated infrastructure, including Blackwell GPUs, enabling long-context reasoning and agentic workflows. The 428-billion parameter Mixture-of-Experts model supports up to 1 million tokens of context and native video, image, and text input, with production deployment paths through TensorRT LLM, SGLang, and vLLM. The model's MiniMax Sparse Attention architecture delivers 9x faster prefill and 15x faster decoding at 1M-token context compared to its predecessor, eliminating the need for fragmented pipelines across separate text, vision, and code models.

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and code—leading to added complexity, higher costs, and slower iteration. MiniMax M3—available on NVIDIA accelerated infrastructure including NVIDIA Blackwell—changes this by enabling a single multimodal system capable of long-context reasoning, agentic workflows, and creative tasks. The 428B parameter MoE supports up to 1M tokens and native multimodal input. Developers can build applications like long video understanding, extended coding sessions 8+ hours , and high-quality design workflows—all with a unified model and production-ready deployment paths on NVIDIA platforms. Name | MiniMax M3 | | Input modalities | Video, image, text | | Total parameters | 428B | | Visual encoder parameters | 600M | | Active parameters | 22B | | Context length | 1M | | Experts | Total 128, 4 experts activated per token | | Precision format | BF16, MXFP8 | Table 1. MiniMax M3 a VLM MoE model specs MiniMax M3’s core architectural innovation is MiniMax Sparse Attention MSA , which replaces standard quadratic attention with a pre-filtering stage that identifies relevant context blocks and attends only to those. At the operator level, each KV cache block is read once with contiguous memory access—more than 4x faster than existing sparse attention implementations. This yields 1/20th the per-token compute of M2 at 1M-token context, with 9x faster prefill and 15x faster decoding, all without compressing key-values or sacrificing precision. The model also trains text, images, and video natively from step 0 across ~100 trillion interleaved tokens, rather than adding multimodality post-training. Open source inference Developers can use accelerated computing with their open source inference engine of choice, such as NVIDIA TensorRT LLM text-only , SGLang or vLLM. Deploying with NVIDIA TensorRT LLM The optimizations are available on the NVIDIA TensorRT LLM GitHub repository https://github.com/NVIDIA/TensorRT-LLM . Follow the quick start guide https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html to stand up a high-performance server—it covers downloading model checkpoints from Hugging Face, a ready-to-run Docker container, and configuration options for both low-latency and max-throughput serving. NVIDIA also collaborated on the developer experience through the Transformers library. Deploying with SGLang Users deploying models with the SGLang serving framework can use the following instructions. See the SGLang documentation https://docs.sglang.io/cookbook/autoregressive/MiniMax/MiniMax-M3 for more information and configuration options. bash 8 GPUs node case $ python -m sglang.launch server \ --model-path MiniMaxAI/MiniMax-M3 \ --dtype bfloat16 \ --tp-size 8 \ --ep-size 8 \ --trust-remote-code \ --mem-fraction-static 0.8 \ --enable-multimodal \ --quantization mxfp8 \ --attention-backend flashinfer \ --mm-attention-backend flashinfer cudnn \ --moe-runner-backend deep gemm \ --chunked-prefill-size 8192 \ --reasoning-parser minimax-m3 \ --tool-call-parser minimax-m3-nom --tr Deploying with vLLM When deploying models with the vLLM serving framework, use the following instructions. For more information, see the vLLM Recipe https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3 . vllm serve MiniMaxAI/MiniMax-M3 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --block-size 128 \ --mm-encoder-attn-backend FLASHINFER \ --mm-processor-cache-type shm \ --tool-call-parser minimax m3 \ --enable-auto-tool-choice \ --reasoning-parser minimax m3 \ --trust-remote-code Scaling with NVIDIA Dynamo Dynamo is an open source distributed inference serving platform for developers to deploy frontier models like MiniMax M3 for large-scale applications. Deploying MiniMax M3 using Dynamo with TensorRT LLM improves performance for long input sequence lengths without sacrificing throughput or increasing GPU budget. At 32k ISL, Dynamo delivers a 4x improvement in interactivity on NVIDIA Blackwell through disaggregated serving—a technique that separates the prefill and decode phases of inference across distinct GPUs to increase system efficiency. Dynamo integrates with all major inference engines and frameworks, including PyTorch, SGLang, TensorRT LLM, and vLLM, and offers LLM-aware routing, elastic autoscaling, and low-latency data transfer. Developers can follow the deployment guide https://github.com/ai-dynamo/dynamo/tree/release/1.3.0-minimax-m3-dev.1/recipes/minimax-m3 to run MiniMax M3 with Dynamo. Customize with NVIDIA NeMo Framework MiniMax M3 can be customized and fine-tuned with the open source NVIDIA NeMo Framework. https://github.com/NVIDIA-NeMo/ Users can: - Use NVIDIA NeMo AutoModel https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/minimax-m3.mdx for out-of-the-box fine-tuning both SFT and LoRA over Hugging Face checkpoints without any conversion, with high-throughput acceleration from full N-D parallelism. Specifically, context parallel support is available for sequence lengths up to 128k. - Use NVIDIA NeMo RL https://github.com/NVIDIA-NeMo/RL/blob/minimax-m3/docs/guides/minimax-m3.md to conduct reinforcement learning on top of Minimax M3, referencing the following sample accuracy curves https://github.com/NVIDIA-NeMo/RL/blob/minimax-m3/docs/guides/minimax-m3.md . These libraries provide developers with a suite of lightweight tools for rapid experimentation on the latest frontier models. Get started today Developers can prototype and evaluate MiniMax M3 by using the GPU-accelerated API on build.nvidia.com https://build.nvidia.com/minimaxai/minimax-m3 or by downloading the weights from Hugging Face https://huggingface.co/MiniMaxAI .