Step 3.7 Flash is a drop-in — except for one endpoint detail

wpnews.pro

Step 3.7 Flash shipped on May 29, 2026 as a structural upgrade to 3.5 Flash: same OpenAI-compatible SDK, new vision encoder, new runtime escalation, and a compute-control flag you can set per request. The migration from 3.5 is two environment variables. One of them has to be exactly right — or every call returns a silent 401.

Step 3.7 Flash adds three net-new capabilities over 3.5 Flash: a native 1.8B-parameter ViT encoder that injects image representations directly into the language backbone without a separate model call , an automatic Advisor Mode that routes failure-prone subtasks to a larger model at runtime, and a reasoning_effort

parameter (low / medium / high) as a first-class API flag rather than a prompt-engineering convention. The production-relevance number is variance: 3.5 Flash scores ranged from 43% to 73% across different harnesses ; 3.7 narrows that to 64.5–71.5% , which matters more for production scheduling than the raw score improvement.

Quick Answer: Step 3.7 Flash is an OpenAI-SDK-compatible model — model string step-3.7-flash

, base URL https://api.stepfun.ai/v1

(global) or https://api.stepfun.com/v1

(China region). New over 3.5: native vision input, automatic Advisor Mode escalation, and a reasoning_effort

flag. The only breaking change from 3.5: base URL must match your account region exactly, or you get a 401 with no error body.

The architecture is a 198B sparse MoE model with roughly 11B parameters active per forward pass — dense-10B compute cost at much larger capacity. SWE-Bench Pro improved to 56.3% from 51.3% ; Terminal-Bench 2.1 improved to 59.5% from 53.4% , suggesting the planning and shell-operation gains that matter for coding agents are consistent across benchmarks.

Advisor Mode carries the headline cost claim from StepFun's internal harness: 97% of Claude Opus 4.6's coding performance at $0.19 vs. $1.76 per task . That's a vendor figure on a first-party SWE-Bench Verified run — treat it as directional until independent replication appears.

Capability	Step 3.5 Flash	Step 3.7 Flash
Vision input	External model call	Native 1.8B ViT encoder
SWE-Bench Pro	51.3%	56.3%
Benchmark spread	43–73%	64.5–71.5%
`reasoning_effort` flag
Not available	low / medium / high
Advisor Mode	No	Automatic (runtime)
Context window	—	256k tokens

Two environment variables and the correct regional URL are all that's required. The URL is the part that fails silently — verify it before writing any code.

Account region and base URL. StepFun runs two separate API domains that share no authentication state:

STEP_BASE_URL=https://api.stepfun.ai/v1

STEP_BASE_URL=https://api.stepfun.com/v1

Export both before running any code:

export STEP_API_KEY="sk-..."
export STEP_BASE_URL="https://api.stepfun.ai/v1"   # global account

OpenRouter alternative. If you want to skip a StepFun account or consolidate all model routing behind a single proxy, OpenRouter lists Step 3.7 Flash under model ID stepfun/step-3.7-flash

. Set base URL to https://openrouter.ai/api/v1

and use your existing OpenRouter key. No StepFun registration required.

NVIDIA NIM. For enterprise GPU inference, NVIDIA's NIM containerized endpoint runs Step 3.7 Flash on Hopper-class GPUs at up to 600 tokens/second , exposes the same OpenAI-compatible interface at http://0.0.0.0:8000/v1

, and supports NeMo-based fine-tuning. Requires an NVIDIA enterprise license.

Python dependency: pip install openai

. No StepFun-specific SDK or plugin needed.

All four steps below use the standard openai

Python client without modification. The only constructor differences from a standard OpenAI call are api_key

and base_url

.

Step 1 — Basic call. The SDK call structure is identical for any OpenAI-compatible Flash endpoint. The snippet below is illustrative (not executed in this context) and demonstrates the structural pattern — the same shape applies to Step 3.7 Flash by substituting your StepFun credentials:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GEMINI_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Say hello in five words."}],
)

print(response.choices[0].message.content)

For Step 3.7 Flash, substitute your StepFun credentials in the constructor and set the model string to step-3.7-flash

:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ["STEP_BASE_URL"],   # https://api.stepfun.ai/v1
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{"role": "user", "content": "Explain the actor model of concurrency."}],
)
print(completion.choices[0].message.content)

Step 2 — reasoning_effort. Pass the parameter directly to

create()

. Use high

for complex code review or multi-step planning; use low

for extraction, summarization, or rewriting where latency matters more than depth; omit it entirely to default to medium

for general-purpose tasks. If you later switch base models, test the parameter explicitly — it may be accepted without error but silently ignored on models that don't support it:

completion = client.chat.completions.create(
    model="step-3.7-flash",
    reasoning_effort="high",   # low | medium | high
    messages=[{"role": "user", "content": "Review this code for race conditions: ..."}],
)

Step 3 — Image input. Replace the string content with a content array. Add a text

dict and an image_url

dict — identical shape to GPT-4o vision calls. The native 1.8B ViT encoder handles the image directly in the language backbone without routing to an external vision model:

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "List the interactive elements in this UI screenshot."},
            {"type": "image_url", "image_url": {"url": "https://example.com/screenshot.png"}}
        ]
    }]
)

Step 4 — Advisor Mode. No parameter required. When the model detects high failure probability on a subtask — repeated errors, complex architectural reasoning — it automatically routes that subtask to a larger model at runtime without any caller intervention. To confirm escalation occurred in a given turn, inspect the response's usage

or metadata

fields; unexpectedly high per-step token counts relative to your base-model baseline are a reliable indicator. There is no flag in the current public API to force or suppress escalation.

Most Step 3.7 Flash failures trace to one of four predictable sources. None produce descriptive error bodies — you have to know what to check.

STEP_BASE_URL

must match your account's registration domain exactly: global keys work only against api.stepfun.ai/v1

; China-region keys work only against api.stepfun.com/v1

. The 401 response body is empty — no hint about the actual cause. Check the env var before investigating anything else.reasoning_effort

silently ignored.step-3.7-flash

exactly — no version aliases are currently documented in the official API reference . Verify the string before debugging effort-parameter behavior.Once the basic call runs, four experiments will give you grounded data on whether Step 3.7 Flash fits your actual workload rather than StepFun's harness.

reasoning_effort

tradeoffs on your own task distribution.low

, medium

, and high

, and record latency, cost, and quality score for each tier. The optimal setting is workload-specific — the vendor benchmarks don't answer this for your data.reasoning_effort=high

. The auto-escalation cost delta is invisible on a single call; it becomes meaningful at the loop level.Endpoint region mismatch. Keys issued from platform.stepfun.ai (global) only authenticate against api.stepfun.ai/v1

. Keys from the China-region platform only work with api.stepfun.com/v1

. The 401 response body is empty — there is no hint in the error itself about the cause. Fix: confirm STEP_BASE_URL

exactly matches the domain where you registered your account before investigating anything else in your request chain.

Structurally yes. The same openai

Python client, the same messages

array shape, and the same image_url

content format all carry over without modification. Three things differ from a plain OpenAI call: the model string (step-3.7-flash

instead of a GPT variant), the base URL (your regional StepFun endpoint), and reasoning_effort

semantics — OpenAI's o-series uses it as a reasoning-chain depth hint, while Step 3.7 Flash uses it as a direct compute-allocation tier that controls inference cost and speed.

Automatic. No API parameter enables, disables, or triggers it. The model identifies subtasks it predicts will fail — recovering from repeated errors, deep architectural planning steps — and routes them to a larger model at runtime without any caller-side configuration. StepFun's own SWE-Bench Verified harness reports this blended approach reaches 97% of Claude Opus 4.6's coding performance at $0.19 vs. $1.76 per task . Independent replication of that figure has not been published as of the time of writing.

The ViT architecture is described as video-capable in StepFun's materials, but video input via the public API should be verified against current platform API documentation before building on it. Static image_url

objects in the messages

content array are confirmed working today via the native encoder. Don't assume video parity from the architecture description alone — check the current API reference first.

Input tokens cost $0.20/M tokens (cache miss) and $0.04/M tokens (cache hit); output is $1.15/M tokens as of May 2026 . For agentic workflows, the more meaningful unit is per-task cost: StepFun claims $0.19 per task with Advisor Mode enabled vs. $1.76 for Claude Opus 4.6 alone . Compare to DeepSeek V4 Flash and similar sparse-MoE models at the task level rather than the token level — actual token consumption per task varies widely with prompt length, context reuse, and workflow structure.

The migration from Step 3.5 Flash is mechanical: one model string, one env var, and you get vision input, Advisor Mode, and reasoning_effort

without any other code changes. The SDK, the message shape, and the image format are identical to what you already use.

The only non-drop-in detail is the regional base URL. It produces a silent 401, it has no descriptive error, and it catches most developers on first integration. Set STEP_BASE_URL

to match the domain where you registered, confirm the model string is exactly step-3.7-flash

, and the rest of the call works as written. Track independent benchmark results as they emerge at Benchable and monitor API parameter additions via the official GitHub repository as the API stabilizes.

Last updated: 2026-06-01. Reflects Step 3.7 Flash as released May 29, 2026 . Benchmark claims are vendor-reported unless otherwise noted; Advisor Mode cost figures are from StepFun's internal harness and have not been independently replicated as of this date.

source & further reading

dev.to — original article Building a Micro AI Code Reviewer in Rust: Lessons from 'ratatop' with Unsafe and System Metrics Your Ops Agent’s Chat History Is an Attack Surface: Prompt Injection Just Became an Infrastructure Problem GPT-5.6 Luna Just Cut Prices 80% Your AI Bill Is Still Going Up, and Here’s the Math

Step 3.7 Flash is a drop-in — except for one endpoint detail

Run your AI side-project on zahid.host