Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

StepFun released Step 3.7 Flash, a 198B-parameter Mixture-of-Experts vision-language model with 11B active parameters, for enterprise deployment on NVIDIA-accelerated infrastructure. The multimodal AI model supports native image and video input, three configurable reasoning levels, and a 256k context window for use cases including financial analysis and concurrent coding agents. NVIDIA offers the model through GPU-accelerated endpoints for prototyping and as a NIM containerized microservice for production deployment across on-premises, cloud, or hybrid environments.

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights. Step 3.7 Flash https://huggingface.co/stepfun-ai/Step-3.7-Flash , the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale. With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements. | Model | Step 3.7 Flash | | Total parameters | 198B | | Visual encoder parameters | 1.8B | | Active parameters | 11B | | Context length | 256K | | Experts | 288 8 active | Table 1. Overview of the key Step 3.7 Flash specs, such as parameter counts, context length, and MoE configuration Step 3.7 Flash can be deployed with open source frameworks such as SGLang https://docs.sglang.io/cookbook/autoregressive/StepFun/Step-3.7-Flash , NVIDIA TensorRT-LLM https://nvidia.github.io/TensorRT-LLM/models/supported-models.html , and vLLM https://docs.vllm.ai/projects/recipes/en/latest/StepFun/Step-3.7-Flash.html to utilize kernels optimized for NVIDIA hardware. Build with NVIDIA endpoints Developers can use GPU-accelerated endpoints available through build.nvidia.com for prototyping and evaluating Step 3.7 Flash. Test this out in the demo notebook https://github.com/NVIDIA/GenerativeAIExamples/tree/main/oss tutorials/Nemotron Parse StepFun Document Intelligence , which uses Step 3.7 Flash and NVIDIA Nemotron Parse https://build.nvidia.com/nvidia/nemotron-parse . The multi-step document intelligence pipeline extracts structured insights from large, complex documents with bounding boxes like financial reports, slide decks, and scientific papers, including PDFs, and organizes the output. Production-ready deployment with NVIDIA NIM NVIDIA NIM https://www.nvidia.com/en-us/ai/ makes it easy to take Step 3.7 Flash from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it on-premises, in the cloud, or across hybrid environments. NIM provides a standard OpenAI inference for sending inference requests to the NIM server. - Download the NIM container from the NVIDIA container registry https://catalog.ngc.nvidia.com/orgs/nim/teams/stepfun-ai/containers/step-3.7-flash enterprise license required . - Start a server with the OpenAI client. - Send either text or image input to the endpoint. python from openai import OpenAI client = OpenAI base url = "http://0.0.0.0:8000/v1", api key="no-key-required" completion = client.chat.completions.create model="stepfun/step-3.7-flash", messages= {"role":"user","content":"Explain particle physics?"} temperature=0.5, top p=1, max tokens=1024, stream=True for chunk in completion: if chunk.choices 0 .delta.content is not None: print chunk.choices 0 .delta.content, end="" Day 0 fine-tuning with NVIDIA NeMo Framework Step 3.7 Flash can be customized with domain-specific data using open libraries from the NVIDIA NeMo framework https://github.com/NVIDIA-NeMo/ . NVIDIA NeMo Automodel https://github.com/nvidia-nemo/automodel library combines native PyTorch n-D parallelisms with optimized performance and supports Day 0 fine-tuning directly from Hugging Face model checkpoints without checkpoint conversion. The Automodel fine-tuning recipe https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/step-3-7.md for Step 3.7 supports techniques such as supervised fine-tuning SFT and memory-efficient LoRA at 600 tokens/sec on Hopper GPUs. For advanced large-scale training, teams can also use the NeMo Megatron-Bridge https://github.com/nvidia-nemo/megatron-Bridge/ fine-tuning recipe https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/stepfun/step37 , which provides additional performance optimizations. From data center deployments on NVIDIA Blackwell to deskside with NVIDIA DGX Station https://www.nvidia.com/en-us/products/workstations/dgx-station/ to managed NIM microservices and Day 0 fine-tuning workflows, NVIDIA provides a range of options for integrating Step 3.7 Flash across different stages of development and deployment. With 748 GB of coherent memory, DGX Station is ideal for running Step 3.7 Flash with increased headroom for the full 256k context length, and faster local developer iteration. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open source licenses https://developer.nvidia.com/open-source . NVIDIA is committed to open models such as Step 3.7 Flash that promote AI transparency and enable users to share their AI safety and resilience work. To get started, check out Step 3.7 Flash https://huggingface.co/stepfun-ai/Step-3.7-Flash on Hugging Face, test it with your own data on build.nvidia.com, or locally on DGX Station using the vLLM Playbook https://build.nvidia.com/station/vllm .