Gemini 3.5 Flash vs Claude Haiku vs GPT-4o mini: Picking a Small Model

The article compares three small, fast LLMs—Gemini 3.5 Flash, Claude Haiku, and GPT-4o mini—for routine tasks like classification and code routing, emphasizing that cheap, consistent models are preferable over frontier models for such "boring" work. The author shares informal evaluation results and recommends using an adapter pattern to keep business logic provider-agnostic, enabling easy A/B testing and migration between models. Key caveats include skepticism toward marketing benchmarks and the importance of verifying model IDs and features in official documentation before deployment.

The Gemini 3.5 Flash announcement made the rounds on Hacker News this week, and I've been getting pinged by teammates asking whether to migrate our internal tools off Claude Haiku or GPT-4o mini. After spending a weekend running informal evals on classification and code-routing tasks, I have some takes. Upfront caveat: I'm hedging on specific Gemini 3.5 Flash feature claims because marketing benchmarks and actual API behavior rarely match. Check the official Gemini docs https://ai.google.dev/ before you bet your architecture on any number — including the ones I quote. Why migrate small models at all? Most teams I work with use small, fast models for the boring stuff: - Classification is this ticket billing, bug, or feature-request? - Summarization TL;DR a long Slack thread - Code routing in agent setups - Lightweight extraction pull dates and amounts out of an invoice You don't need a frontier model for any of these. You need something cheap, fast, and consistent. Frontier models are actually worse for tool routing in my experience — they overthink and burn latency on tasks that should be reflexive. So when a new small model lands, it's worth 30 minutes of evaluation. Not a migration. Just a check. The contenders Three models I'd consider for a new project today: - Gemini 3.5 Flash — Google's latest fast model. Reportedly faster and cheaper than the 2.5 generation, with a very long context window. I'll caveat the specifics below. - Claude Haiku 4.5 — Anthropic's small model. I've used Haiku in production for two years; it's my default for anything that needs structured outputs or tool calling. - GPT-4o mini — OpenAI's small model. Still solid, still has the deepest ecosystem support. I'm leaving out Llama and Mistral on purpose. Self-hosted open models deserve their own post. Side-by-side: calling the APIs Here's what each looks like in practice. All three have official SDKs, and the shape is similar but not identical. Gemini 3.5 Flash python pip install google-generativeai import google.generativeai as genai genai.configure api key="YOUR KEY" verify the exact model ID in the official docs before shipping model = genai.GenerativeModel "gemini-3.5-flash" response = model.generate content "Classify this ticket: 'Cannot log in after password reset.'", generation config={"temperature": 0.2} low temp for classification print response.text Claude Haiku 4.5 python pip install anthropic from anthropic import Anthropic client = Anthropic reads ANTHROPIC API KEY from env msg = client.messages.create model="claude-haiku-4-5", max tokens=200, messages= { "role": "user", "content": "Classify this ticket: 'Cannot log in after password reset.'" } print msg.content 0 .text GPT-4o mini python pip install openai from openai import OpenAI client = OpenAI resp = client.chat.completions.create model="gpt-4o-mini", messages= { "role": "user", "content": "Classify this ticket: 'Cannot log in after password reset.'" } , temperature=0.2 print resp.choices 0 .message.content Three different shapes for the same operation. If you've been around long enough you know the drill — wrap it in your own interface and stop caring about the cosmetic differences. Migrating: a thin adapter I migrated one of our pipelines from GPT-4o mini to Gemini Flash last quarter the 2.5 generation, not the new one . The pattern that saved me: adapter.py — keeps business logic provider-agnostic class LLMClient: def init self, provider: str : self.provider = provider if provider == "gemini": import google.generativeai as genai self.model = genai.GenerativeModel "gemini-3.5-flash" elif provider == "claude": from anthropic import Anthropic self.client = Anthropic elif provider == "openai": from openai import OpenAI self.client = OpenAI else: raise ValueError f"Unknown provider: {provider}" def classify self, prompt: str - str: one method, three implementations — callers stay clean if self.provider == "gemini": return self.model.generate content prompt .text if self.provider == "claude": msg = self.client.messages.create model="claude-haiku-4-5", max tokens=200, messages= {"role": "user", "content": prompt} return msg.content 0 .text openai resp = self.client.chat.completions.create model="gpt-4o-mini", messages= {"role": "user", "content": prompt} return resp.choices 0 .message.content Boring? Yes. But this adapter saved me a week when we needed to A/B test the new Gemini model. Flip a config flag, replay the same prompts, compare outputs. Zero business logic changed. Tradeoffs I'm willing to commit to Things I feel confident about from actually running these in production: - Claude Haiku is the most predictable for structured outputs and tool calling. If your downstream code parses JSON, start here. - GPT-4o mini has the deepest ecosystem — every framework, every tutorial, every Stack Overflow answer. If your team is junior, the docs alone are worth it. - Gemini Flash has historically had the largest context window of the three. Useful when you need to dump entire codebases or long docs into a prompt. Things I'm explicitly hedging on for 3.5 Flash: - The benchmark numbers in Google's announcement look great. Benchmarks always look great. Run your own evals. - I haven't tested its tool-calling reliability against Haiku yet. That's next month's project. - Pricing on the official page is the only number I trust. Check it before architecting around cost assumptions. Which one should you pick? Honest answer: it depends on what you're already running. - Already on Haiku and it works? Don't migrate. The savings won't pay for the engineering time. - Building something new with massive context needs? Start with Gemini 3.5 Flash. Long-context is its lane. - Need bulletproof tool calling for agents? Claude Haiku, every time. - Just shipping a side project and want zero friction? GPT-4o mini. The dirty secret of LLM migrations is that model choice matters way less than your prompts and your eval suite. Spend the time on evals first. Then pick whichever model wins your benchmark — not whichever one trended on Hacker News last week.