StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

StepFun released Step 3.7 Flash, a 198-billion parameter Mixture-of-Experts vision-language model with native image understanding and improved tool-use reliability for agentic coding and search workflows. The model achieves 56.26% on SWE-Bench Pro and 59.55% on Terminal-Bench 2.1, with an Advisor Mode that reaches 97% of Claude Opus 4.6's coding performance at roughly one-ninth the cost. Step 3.7 Flash is available under an Apache 2.0 license with a 256k-token context window and up to 400 tokens per second throughput.

StepFun today released Step 3.7 Flash https://github.com/stepfun-ai/Step-3.7-Flash , a multimodal Mixture-of-Experts model targeting agentic use cases. It adds native vision input and improved tool-use reliability over Step 3.5 Flash. What is Step 3.7 Flash? Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts MoE vision-language model . It pairs a 196B-parameter language backbone with a 1.8B-parameter vision encoder ViT for native image understanding. The model activates approximately 11B parameters per token during inference. In MoE architectures, only a subset of “expert” sub-networks fires per forward pass — not the full network. This keeps inference compute closer to an 11B dense model while maintaining a 198B total parameter budget. Key specs: | Spec | Value | |---|---| | Total parameters | 198B 196B language + 1.8B ViT | | Active parameters per token | ~11B | | Context window | 256k tokens | | Throughput | Up to 400 tokens/sec | | Reasoning levels | Low, medium, high | | License | Apache 2.0 | Architecture Notes The vision encoder runs as a separate 1.8B ViT module. It injects image representations into the language backbone’s context. Step 3.5 Flash had no multimodal support; this is a new addition in 3.7. Three selectable reasoning depths — low, medium, and high — let developers trade latency for reasoning depth. Low is faster and cheaper; high applies more computation per response. Agentic Coding Performance On SWE-Bench Pro , Step 3.7 Flash scores 56.26% , up from Step 3.5 Flash’s 51.3% — a gain of roughly 5 percentage points. On Terminal-Bench 2.1 , it scores 59.55% , up from 53.37%. On SWE-MTLG a multi-task long-generation coding benchmark , it scores 72.42% . Cross-harness consistency on StepFun’s internal Step-SWE-Bench : | Scaffold | Step 3.7 Flash | Step 3.5 Flash | |---|---|---| | Hermes Agent | 67.5% | 60.0% | | OpenClaw | 67.0% | 47.0% | | KiloCode | 67.5% | 59.0% | | RooCode | 64.5% | 43.0% | | Claude Code | 71.5% | 73.0% | | OpenCode | 64.5% | 57.0% | Step 3.5 Flash ranged from 43% to 73% across harnesses. Step 3.7 Flash ranges from 64.5% to 71.5%. In production, coding agents often run inside heterogeneous scaffolds — each with its own prompting conventions and tool schemas. Narrower per-harness variance means more predictable behavior across different setups. Advisor Mode Step 3.7 Flash supports Advisor Mode , StepFun’s implementation of the advisor strategy described by Anthropic. The model runs the agentic loop end-to-end — calling tools, reading results, iterating — and escalates to a larger advisor model only at specific inflection points, such as planning or recovering from repeated failures. Most of the run stays at executor cost. With Advisor Mode enabled on SWE-Bench Verified, StepFun reports Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the per-task cost $0.19 vs. $1.76 per task . These are StepFun’s internal figures. Multimodal Capabilities Step 3.7 Flash supports two visual tool pathways: Visual Search Tool — For recognition tasks where the model’s parametric knowledge is insufficient long-tail entities, recently emerged concepts , it invokes a visual search tool to retrieve and verify. On SimpleVQA with Search , it scores 79.16% , comparable to GPT 5.5 79.11% and above Kimi K2.6 78.24% and GLM 5V Turbo 78.20% . Python Tool — For fine-grained visual tasks high-resolution images, visual probing, bounding-box analysis , it uses a code interface to crop, zoom, and draw pixels or bounding boxes. On V a self-tested score with Python , it scores 95.29% . On HR-Bench 4K and HR-Bench 8K , it scores 89.13% and 86.34% respectively. StepFun notes an observed behavior during testing: the model combined visual tools with non-visual tools without being explicitly trained to do so. For example, after generating frontend code, it used the GUI to render and inspect the result before iterating. StepFun describes this as emergent compositional tool use. On Android Daily long-horizon phone UI task completion , Step 3.7 Flash scores 61.87% , ahead of Kimi K2.6 53.36% and GLM 5V Turbo 51.68% . Gemini 3 Flash 63.21% leads this benchmark. Search and Research Benchmarks StepFun focused this model’s search design on planning, evidence filtering, and synthesis — integrating search as part of the reasoning loop rather than a separate add-on. | Benchmark | Step 3.7 Flash | Notable comparison | |---|---|---| | HLE with Tools acc | 47.20% | DeepSeek V4 Flash: 45.10% | | BrowseComp acc | 75.82% | Claude Opus 4.7: 79.30% | | DeepSearchQA F1 | 92.82% | Kimi K2.6: 92.50% | | ResearchRubrics score | 71.68% | GPT 5.5: 61.50% | Note: The HLE with Tools score of 47.20% compares to Step 3.5 Flash’s text-only score of 35.68%. Step 3.5 Flash did not support tool-augmented evaluation on HLE. General Agent Benchmarks | Benchmark | Step 3.7 Flash | Description | |---|---|---| | Toolathlon | 49.51% | Multi-tool coordination | | ClawEval-1.1 | 67.07% | Daily autonomous task execution in realistic environments | | GDPval 44 occupations | 45.8% | General professional task execution | | Tau2-bench Telecom | 98% | Across different reasoning difficulty tiers | On ClawEval-1.1, Step 3.7 Flash 67.07% leads DeepSeek V4 Flash 57.80% and DeepSeek V4 Pro 59.80% among the compared models. Long-Context Performance On AA-LCR a long-context retrieval benchmark, avg@16/acc , Step 3.7 Flash scores 63.94% . This is comparable to DeepSeek V4 Flash 63.70% and DeepSeek V4 Pro 66.30% . Pricing | Token Type | Price | |---|---| | Input cache miss | $0.20 / M tokens | | Input cache hit | $0.04 / M tokens | | Output | $1.15 / M tokens | Marktechpost’s Visual Explainer Key Takeaways - Step 3.7 Flash is a 198B sparse MoE model with 11B active params and a 256k context window. - Native multimodal support images, GUIs, documents is new — Step 3.5 Flash was text-only. - Advisor Mode reaches 97% of Claude Opus 4.6's SWE-Bench Verified performance at $0.19 per task vs. $1.76. - Cross-harness coding variance narrowed from a 43–73% range 3.5 Flash to 64.5–71.5% 3.7 Flash . - Released under Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face. Check out the Model Weights , and Repo https://github.com/stepfun-ai/Step-3.7-Flash Also, feel free to follow us on Technical Details https://static.stepfun.com/blog/step-3.7-flash/ . and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58