cd /news/large-language-models/gemini-3-5-flash-is-now-ga-three-api… · home topics large-language-models article
[ARTICLE · art-32814] src=byteiota.com ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Gemini 3.5 Flash Is Now GA: Three API Traps to Know

Google has made Gemini 3.5 Flash generally available, offering faster performance and improved agentic and coding benchmarks over its predecessor, but developers migrating from the preview version must navigate three breaking API changes: a silent shift in default thinking effort, removal of sampling parameters, and stricter function response requirements. The model excels in high-throughput agentic tasks but trails in complex reasoning, and Google recommends using the "low" thinking level for coding workflows to optimize speed and cost.

read5 min views1 publishedJun 18, 2026

Gemini 3.5 Flash is now generally available to all developers via the Gemini API and Google AI Studio. The headline writes itself: a Flash model that outperforms last year’s Pro on coding and agentic benchmarks. The numbers mostly hold up. What Google’s blog post buries, though, is that migrating from gemini-3-flash-preview

to gemini-3.5-flash

introduces three breaking changes that will silently degrade your outputs before you notice anything is wrong.

What Actually Changed: The Benchmark Story Is Lopsided on Purpose #

Gemini 3.5 Flash doesn’t beat 3.1 Pro across the board. It beats it specifically on the tasks that production AI agents actually run.

On agentic and coding benchmarks, the lead is real: Terminal-Bench 2.1 jumps from 70.3% to 76.2%, MCP Atlas from 78.2% to 83.6%, and Finance Agent v2 from 43.0% to 57.9%. It runs at 289 tokens per second — roughly four times faster than other frontier models. For anything involving tool calls, parallel subagents, or high-throughput document processing, Gemini 3.5 Flash is the clear choice in its price bracket.

Where it trails 3.1 Pro: Humanity’s Last Exam (40.2% vs 44.4%), ARC-AGI-2 (72.1% vs 77.1%), and long-context retrieval at 128k tokens (77.3% vs 84.9%). If your workload is primarily complex reasoning — hard math, multi-step logic chains, research synthesis — don’t migrate yet. The new model sacrifices some of that depth in exchange for speed. That’s a reasonable trade for most production agents, and a bad one for tasks that aren’t agents at all.

The Three API Traps #

These are not edge cases. All three will hit most codebases on migration.

Trap 1: The Thinking Default Changed Silently

In gemini-3-flash-preview

, the default thinking effort was equivalent to high

. In gemini-3.5-flash

, it defaults to medium

. The new parameter name is thinking_level

— a string enum that replaces the old thinking_budget

integer. If you copy-paste a migration without setting thinking_level

explicitly, you’ll get quieter, less capable outputs with no error message. The model just answers differently.

You cannot pass both thinking_level

and thinking_budget

in the same request — that returns a 400 error.

Trap 2: Sampling Parameters Are Gone

temperature

, top_p

, and top_k

are no longer accepted. Remove them from all requests entirely. Any code copied from pre-3.x Gemini examples — which covers most existing codebases — needs a cleanup pass before migration.

Trap 3: Function Responses Now Require id and name

Every FunctionResponse

must now include both the id

from the corresponding FunctionCall

and a name

that matches exactly. This one fails loudly — you’ll see errors immediately — but it’s easy to miss if you have multiple tool-calling paths in your codebase. Check the official migration docs for the full function calling spec.

Which thinking_level Should You Actually Use? #

The counter-intuitive answer from Google’s own documentation: for agentic coding workflows, use “low”

, not “medium”

.

Google specifically retuned the low

setting for code and tool-calling workloads. It’s faster, cheaper, and on coding benchmarks comparable to medium

. Reaching for “high”

as a default is expensive and adds latency without proportional gains for most agent tasks.

Level Best for
minimal High-volume classification, trivial chat queries
low Agentic coding loops, tool-calling workflows (recommended)
medium Complex coding tasks, general default
high Hard math, complex reasoning, research synthesis

One thing worth watching on long-running sessions: internal reasoning tokens are preserved automatically across multi-turn conversations. That improves coherence but inflates costs 30–50% on extended agent loops. Monitor the ThoughtsTokenCount

metric; if it exceeds 40% of PromptTokenCount

on later turns, restarting the session is cheaper than continuing it.

The Pricing Reality #

Hacker News noticed the price increase immediately. At $1.50/$9.00 per million input/output tokens, Gemini 3.5 Flash costs three times more than Gemini 3 Flash Preview and six times more than Flash-Lite. That’s worth acknowledging.

The relevant comparison, though, is against what it’s replacing in practice. At $2.00/$12.00 per million tokens, Gemini 3.1 Pro costs about 25% more for inferior performance on coding tasks. For teams already on Pro, 3.5 Flash is a pay cut with better results.

The biggest lever is caching. At $0.15 per million cached input tokens — a 90% discount — agents with large, stable system prompts can recover most of the premium. Google demoed 93 parallel subagents completing 15,000+ requests in 12 hours for under $1,000. That math works with aggressive caching in place.

Migration Checklist #

  • Update model ID: gemini-3-flash-preview

gemini-3.5-flash

  • Replace thinking_budget

withthinking_level

(string:“minimal”

,“low”

,“medium”

,“high”

) - Set thinking_level

explicitly — don’t rely on the medium default - Remove temperature

,top_p

,top_k

from all requests - Add id

andname

to everyFunctionResponse

  • Keep gemini-3-flash-preview

running for: Computer Use, image generation, audio generation, and Live API - Watch ThoughtsTokenCount

growth across multi-turn sessions

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=prompt,
    config={
        "thinking_config": {
            "thinking_level": "low"  # optimal for agentic coding
        }
    },
)

The Verdict #

Migrate for agentic and coding workloads. The performance gains are real, the speed advantage is significant, and the pricing is defensible if you’re already on Pro. Hold on gemini-3-flash-preview

for Computer Use and Live API workloads — those aren’t supported in 3.5 yet.

The three breaking changes are concrete and fixable in an afternoon. The thinking_level

default change is the one most likely to ship silently to production. Set it explicitly on every request, run your evals before flipping the switch, and review Appwrite’s independent benchmark analysis if you need a second opinion beyond Google’s own numbers. The full technical spec is available on the Google DeepMind model card.

── more in #large-language-models 4 stories · sorted by recency
── more on @google 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/gemini-3-5-flash-is-…] indexed:0 read:5min 2026-06-18 ·