{"slug": "step-3-7-flash-open-source-multimodal-model-for-speed-and-agents", "title": "Step 3.7 Flash – Open-source multimodal model for speed and agents", "summary": "Step Inc. released Step 3.7 Flash, an open-source multimodal model designed for agentic coding and enterprise task execution, achieving a 67.08% score on SWE-Bench Pro and a 6.1% improvement on Terminal-Bench 2.1 over its predecessor. The model supports native multimodal understanding, web and visual search enhancement, and reliable tool orchestration across mainstream agent harnesses, with an Advisor Mode that reaches 97% of Claude Opus 4.6's coding performance at roughly one-ninth the per-task cost. The release signals a shift from question-answering models to action-oriented digital agents capable of independently driving long-horizon assignments in dynamic enterprise environments.", "body_md": "# Step 3.7 Flash\n\nThe new frontier is agent efficiency.\n\nA high-efficiency Flash model for real-world agents.\n\n### Key Features\n\n-\n#### Native Multimodal Understanding & Acting\n\nUnderstands images across the full range — product UIs, documents, charts, and natural scenes — then writes code or calls tools to act on what it sees.\n\n-\n#### Web & Visual Search Enhancement\n\nWeb search reaches further — more sources, deeper follow-up. Visual search recognizes what other systems don't — long-tail entities, freshly emerged concepts.\n\n-\n#### Reliable Tool Use & Orchestration\n\nDrives terminals, browsers, Office tools, search, and beyond — staying coherent however long the run gets. Less drift, fewer broken toolcalls, fewer failed runs.\n\n-\n#### Agent Ecosystem Compatibility\n\nWorks with mainstream harnesses (Claude Code, KiloCode, Hermes Agent, OpenClaw) and Skills — lower integration cost, less workflow rewiring.\n\n## Gallery\n\n## Agentic Coding\n\nFoundation models are shifting from answering questions to taking action, and in the digital world that action takes the form of code. Coding is the substrate of digital agency, the purest form of the plan–execute–observe–iterate loop, and the leading indicator of where a model's broader agentic capability is heading. We invested heavily in this surface for **Step 3.7 Flash**. Compared to Step 3.5 Flash, it gains **+5% on SWE-Bench Pro** and **6.1% on Terminal-Bench 2.1**.\n\n##### Step-SWE-Bench\n\n**67.08%**\n\n**56.50%**\n\n| Step 3.7 Flash | Step 3.5 Flash | |\n|---|---|---|\n| Hermes Agent | 67.50% | 60.00% |\n| OpenClaw | 67.00% | 47.00% |\n| Claude Code | 71.50% | 73.00% |\n| KiloCode | 67.50% | 59.00% |\n| OpenCode | 64.50% | 57.00% |\n| RooCode | 64.50% | 43.00% |\n\nIn production, coding agents rarely run on a single scaffold. They live inside a heterogeneous stack of harnesses, each with its own prompting conventions, tool schemas, and orchestration patterns — and a model has to perform reliably across all of them to be genuinely useful. **Step 3.7 Flash** is markedly more balanced across this stack than Step 3.5 Flash, with the per-harness gap narrowing substantially on our in-house Step-SWE-Bench.\n\nTo push quality further without giving up Flash-tier efficiency, **Step 3.7 Flash** supports Advisor Mode. **Step 3.7 Flash** drives the trajectory end-to-end — calling tools, reading results, and iterating — and consults a larger advisor model only at the few inflection points where its own judgment falls short, such as planning or recovering from repeated failures. This is Step's implementation of the [advisor strategy](https://claude.com/blog/the-advisor-strategy) described by Anthropic, where a small executor stays in control and escalates to a frontier advisor only when needed, keeping most of the run at executor cost. With Advisor Mode enabled, **Step 3.7 Flash** reaches **97% of Claude Opus 4.6's coding performance at roughly one-ninth the per-task cost** (`$0.19`\n\nv.s. `$1.76`\n\nper task).\n\n## Sharpened for Enterprise Tasks\n\nEnterprise work inherently depends on two critical pillars: autonomous task execution in dynamic environments and deep, domain-specific vertical knowledge. **Step 3.7 Flash** is purpose-built and rigorously optimized across both frontiers to independently drive assignments and ship production-grade deliverables.\n\nThe model combines strong agentic execution with precise intent understanding and rich multimodal perception, allowing it to seamlessly bridge the gap between comprehension and action. Users can hand Step 3.7 Flash a complete piece of knowledge work and trust it to independently map out the plan, search across live sources, extract key information, and fluidly orchestrate tools to deliver a ready-to-ship result without intervention. It reads and directly acts on mixed inputs—such as screenshots, complex documents, and dense spreadsheets—parsing visual context and digital assets simultaneously. This long-horizon task execution is validated across diverse environments, where Step 3.7 Flash achieves **49.5% on Toolathlon** for multi-tool coordination and **67.1% on ClawEval-1.1** for daily autonomous task execution in realistic environments.\n\nThe path from general intelligence to true professional expertise starts with real expert practices. By partnering deeply with domain specialists, we have embedded native industry know-how into the model, validating its capabilities through our own benchmarks in finance, accounting, and data analysis. This expertise extends well beyond specialized domains: Step 3.7 Flash reaches **45.8% on GDPval** across 44 occupations, and passes at over **98% across different reasoning difficulty tiers on Tau2-bench Telecom**.\n\n## Search Wider and Deeper\n\nFor a model at the scale of Step 3.7 Flash, the goal is not to pack every piece of world knowledge into its weights, but to make the model better at calling upon that knowledge when needed. We therefore focus its capabilities on search planning, evidence filtering, and information synthesis, turning search from an external add-on into a native part of the reasoning process.\n\nStep 3.7 Flash delivers strong results across search-heavy benchmarks. It scores **47.20% on HLE with Tools**, up from 35.68% (text-only) for Step 3.5 Flash, and outperforms Flash models from DeepSeek V4 and Gemini 3.5. It reaches **75.82% on BrowseComp**, approaching larger models such as Claude Opus 4.7 and GLM 5.1. **On DeepSearchQA, it achieves 92.82% F1 score**, comparable to Kimi K2.6, a 1T / 32B-active model. **On ResearchRubrics, it scores 71.68%**, ahead of GPT 5.5 at 61.50% and close to Claude Opus 4.7 at 73.92%. These results show that Step 3.7 Flash combines Flash-level efficiency with strong deep-retrieval and research capabilities.\n\nThe trajectories further highlight both the breadth and depth of its search behavior. In the Ontario lawyer conflict-of-interest case, it similarly expanded its search around domain-specific concepts, combined evidence from papers, course materials, official rules, and case analyses, and caught the key traps in the questions.\n\n## Agents That Can SEE\n\nWe establish **Step 3.7 Flash** as an agentic foundation model with vision input support, shifting perception and recognition from parametric capacity to test-time scaling with visual tools. As the first of these, we strengthen its ability to invoke the Visual Search tool, thereby compensating for the parametric knowledge deficiencies caused by **Step 3.7 Flash**'s limited model size. As shown in the table below, on visual recognition tasks, **Step 3.7 Flash** with Visual Search achieves performance on par with models five times its size.\n\n### Visual Recognition with Visual Search\n\n| Flash Level | Pro Level | |||\n|---|---|---|---|---|\n| Benchmarks | Step 3.7 Flash | Kimi K2.6 | GLM 5V Turbo | GPT 5.5 |\n| SimpleVQA | 79.16% | 78.24%* | 78.20% | 79.11%* |\n| WorldVQA | 58.10% | 55.98%* | 47.81%* | 54.58%* |\n| BC-VL | 58.96% | 57.12%* | 51.90%* | 65.68%* |\n\n- * denotes a self-tested score.\n\nFor a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the *Python tool*. With Python, **Step 3.7 Flash** achieves exceptionally strong performance on these benchmarks.\n\n### Visual Perception with Python Tool\n\n| Flash Level | Pro Level | |||\n|---|---|---|---|---|\n| Benchmarks | Step 3.7 Flash | Kimi K2.6 | GLM 5V Turbo | Gemini 3 Flash |\n| V* | 95.29% | 96.90% | 89.00% | 96.30% |\n| HR-Bench 4K | 89.13% | 91.25%* | 84.62% | 94.50% |\n| HR-Bench 8K | 86.34% | 90.13%* | 83.12% | 94.80% |\n| VisualProbe | 65.05% | 64.47%* | 53.01% | 69.90% |\n\n- * denotes a self-tested score.\n- The GLM results were aligned with official GLM personnel, using crop + search and other tools.\n\nOne particularly interesting finding is the emergent ability of compositional generalization across visual and other tools. During testing, **Step 3.7 Flash** seamlessly combined visual tools with non-visual ones to accomplish complex tasks, despite never having been explicitly guided toward such compositional tool use during training.\n\n**Visual Reasoning with Python Tool**\n\n**Compositional Usage across Visual and Non-visual Tools**\n\nOperating graphical user interfaces (GUI) is another foundational visual capability for an agentic model — many real-world tasks live beyond the chatbox and the CLI, and require the agent to see, click, and verify. We extend **Step 3.7 Flash** with GUI operation, in particular for the Phone-use stack, so that it can complete long-horizon tasks across multiple apps. On the Android Daily benchmark, **Step 3.7 Flash** achieves a substantial improvement over last year's **Step-GUI** in stability, robustness, and long-horizon completion, and ahead of other models of larger scale.\n\n### Score of Android Daily Benchmark\n\n- * denotes a self-tested score.\n- Android Daily:\n[https://arxiv.org/abs/2605.27761](https://arxiv.org/abs/2605.27761)\n\nThe same compositional pattern we observed across visual tools also surfaces here: in the following case, after writing a piece of frontend code, the model autonomously turned to the GUI to test the page it had just produced — inspecting the rendered output, exercising interactive elements, and iterating on its own code based on what it saw. Again, this code-and-GUI compositional behavior was never explicitly demonstrated or rewarded during training, yet emerges robustly in test-time use.\n\n**GUI Operation**\n\n## Benchmarks\n\nIn our benchmark table, we provide a detailed, side-by-side comparison of today's top-performing open-source models. Across a wide range of metrics, Step 3.7 Flash stands out with consistently strong results. Our evaluation focuses on three core dimensions—Reasoning, Coding and Agentic Capability.\n\n| Flash Level | Pro Level | ||||||||\n|---|---|---|---|---|---|---|---|---|---|\n| Benchmarks | Step 3.7 Flash | Step 3.5 Flash | DeepSeek V4 Flash | Gemini 3.5 Flash | DeepSeek V4 Pro | GPT 5.5 | Claude Opus 4.7 | Kimi K2.6 | GLM 5.1 |\n| Total Params | 196B + 1.8B (ViT) | 196B | 284B | — | 1.6T | — | — | 1T | 754B |\n| Active Params | 11B | 11B | 13B | — | 49B | — | — | 32B | 40B |\n| Multi-modal | |||||||||\n| General Agent | |||||||||\n| HLE w. tool (acc) | 47.20% (text-only 49.70%) | 35.68% | 45.10% | 40.20% | 48.20% | 52.20% | 54.70% | 54.00% | 52.30% |\n| BrowseComp (acc) | 75.82% | 69.00% | 73.20% | — | 83.40% | 90.10% | 79.30% | 83.20% | 79.30% |\n| deepsearchQA (F1) | 92.82% | 85.48%* | 90.61%* | — | — | 93.98%* | 91.74%* | 92.50% | 91.16%* |\n| deepsearchQA (acc) | 81.69% | 73.44% | 79.76%* | — | — | 85.31%* | 82.31%* | 83.00% | 81.31%* |\n| ResearchRubrics (score) | 71.68% | 65.30% | 66.17%* | 63.58%* | 68.31%* | 61.50%* | 73.92%* | 62.96%* | 67.90%* |\n| Toolathlon | 49.51% | 33.33% | 52.78%* | 56.50% | 51.80% (56.61%*) | 60.18%* | 65.43%* | 54.63%* | 48.09%* |\n| Claweval-v1.1 (pass^3) | 67.07% | 43.60% | 57.80% | — | 59.80% | — | — | 62.30% | 62.30% |\n| GDPval-Stirrup (rubric-score) | 1415.8 (ii 45.79%) | 1055.0 (ii 27.75%) | 1414.0 (ii 44.00%) | 1656 (ii 57.80%) | 1554 (ii 53.00%) | 1769 (ii 63.00%) | 1753 (ii 63.00%) | 1481 (ii 49.00%) | 1535 (ii 52.00%) |\n| Coding | |||||||||\n| SWE-MTLG | 72.42% | 67.40% | 73.30% | — | 76.20% | — | 80.50% | 76.70% | — |\n| SWE-Bench Pro | 56.26% | 51.30% | 55.60%* | 55.10% | 55.40% | 58.60% | 64.30% | 58.60% | 58.40% |\n| Terminal-Bench 2.1 | 59.55% | 53.37% | 62.00%* | 76.20% | 72.00% | 78.2% ± 2.4 | 66.1% ± 2.7 | — | 69.00% |\n| Long Context | |||||||||\n| AA-LCR (avg@16/acc) | 63.94% | 45.50% | 63.70% (63.00%*) | — | 66.30% (66.30%*) | — | — | 69.10% (69.70%*) | 64.90% (62.30%*) |\n\n- \"—\" indicates the score is not publicly available or not tested. * denotes a self-tested score.\n- Android Daily:\n[https://arxiv.org/abs/2605.27761](https://arxiv.org/abs/2605.27761)\n\n## Availability, Deployment, and Ecosystem\n\n### Availability\n\nStep 3.7 Flash is available through StepFun Open Platform at [platform.stepfun.ai](https://platform.stepfun.ai/) and [platform.stepfun.com](https://platform.stepfun.com/), as well as partner platforms including OpenRouter and NVIDIA NIM.\n\n### Deployment\n\nStep 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Mac Pro devices with at least 128GB unified memory.\n\n### Ecosystem\n\nStep 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.", "url": "https://wpnews.pro/news/step-3-7-flash-open-source-multimodal-model-for-speed-and-agents", "canonical_source": "https://static.stepfun.com/blog/step-3.7-flash/", "published_at": "2026-05-29 04:15:58+00:00", "updated_at": "2026-05-29 04:46:04.612551+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "generative-ai", "ai-research", "ai-products"], "entities": ["Step 3.7 Flash", "Step 3.5 Flash", "Claude Code", "KiloCode", "Hermes Agent", "OpenClaw", "SWE-Bench Pro", "Terminal-Bench 2.1"], "alternates": {"html": "https://wpnews.pro/news/step-3-7-flash-open-source-multimodal-model-for-speed-and-agents", "markdown": "https://wpnews.pro/news/step-3-7-flash-open-source-multimodal-model-for-speed-and-agents.md", "text": "https://wpnews.pro/news/step-3-7-flash-open-source-multimodal-model-for-speed-and-agents.txt", "jsonld": "https://wpnews.pro/news/step-3-7-flash-open-source-multimodal-model-for-speed-and-agents.jsonld"}}