5 LLM APIs Tested for Latency: Real Data [2026]

wpnews.pro

Originally published at

[kunalganglani.com]— read it there for inline code, hero image, and live links.

597 milliseconds. That's how long Claude Haiku 4.5 takes to deliver its first token on a medium-length prompt. GPT-4.1 Mini? Roughly 2,400ms. Four times slower. That gap is the difference between an app that feels alive and one that feels broken. I tested five LLM APIs for latency — Claude Haiku 4.5, Claude Sonnet 4, GPT-4.1, GPT-4.1 Mini, and Gemini 2.5 Flash — across time-to-first-token (TTFT), output throughput, and end-to-end response time. Here's the real data, updated with mid-2026 context.

Vendor benchmarks are marketing. They're run on optimal infrastructure, with cherry-picked prompts, under conditions no production app will ever see. After shipping production AI features that serve real users, I've learned the hard way that the only numbers worth trusting are the ones you measure yourself. So I did.

The benchmark I ran in March 2026 was deliberately simple. Five models, three prompt sizes (short ~50 tokens, medium ~200 tokens, long ~500 tokens), three runs each, all using streaming APIs to capture accurate TTFT. I disabled Gemini's thinking mode for a fair comparison. Every test ran from the same Toronto-based server, same network, same time window.

Claude Haiku 4.5 delivered its first token in 597 milliseconds on a medium-length prompt. GPT-4.1 Mini took roughly 4x longer on the identical test. That's not a rounding error. That's the difference between a user who stays engaged and one who's already reaching for the back button.

For developers building AI agents or chatbot interfaces, TTFT matters more than raw throughput. Full stop. Users judge responsiveness by that first chunk of text appearing on screen. A model that starts streaming in under 600ms feels like a conversation. One that takes 2.4 seconds feels like you're waiting for a web page to load in 2005. I've shipped features where cutting TTFT by 300ms measurably improved user retention. The psychology is well-established: Jakob Nielsen's foundational research on response times showed that anything under 1 second keeps a user's flow of thought uninterrupted. Over 1 second, they start noticing the delay. Over 10 seconds, they leave.

Every model I tested, side by side. I'm combining my first-party benchmark data with independent numbers from Artificial Analysis.

| Model | TTFT (medium prompt) | Output Speed (tokens/sec) | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Verbosity | AA Intelligence Score |
|---|---|---|---|---|---|---|

| Claude Haiku 4.5 | ~597ms | — | $0.80 | $4.00 | Moderate | — | | Claude Sonnet 4 | ~900ms | ~53 tok/s | $3.00 | $15.00 | Moderate | — | | GPT-4.1 | ~1,100ms | 125.3 tok/s | $2.00 | $8.00 | Concise (4.5M) | 26 | | GPT-4.1 Mini | ~2,400ms | 94.5 tok/s | $0.40 | $1.60 | Very concise (4.6M) | 23 | | Gemini 2.5 Flash | ~450ms | 204.5 tok/s | $0.30 | $2.50 | Very verbose (17M) | 21 |

Some things stand out.

Gemini 2.5 Flash is the raw speed champion. 204.5 tokens per second, ranked #4 out of 85 non-reasoning models on Artificial Analysis. Its TTFT is also the lowest I tested. If your priority is getting text on screen fast, nothing here touches it.

GPT-4.1 Mini is the biggest letdown. OpenAI markets it as their "lightweight" variant, but it's rated "slower than average" at 94.5 tokens/sec (#32/85). Not the cheapest. Not the fastest. Not great at anything in particular. OpenAI has already positioned GPT-5 Mini as the successor, which tells you everything.

GPT-4.1's throughput is legitimately strong at 125 tokens/sec (#13/85), putting it in the top quartile. But you're paying $2.00/$8.00 per million tokens for the privilege. That's 5x more than Gemini 2.5 Flash on input and 3.2x on output.

I get this question constantly, and the honest answer is: it depends on what "fastest" means to you.

If you mean **fastest first token** — the thing that matters for chat UIs, [AI agent](https://dev.to/blog/types-of-ai-agents-developers-guide) responses, and anything user-facing with streaming: Gemini 2.5 Flash and Claude Haiku 4.5 are in a completely different class. Both deliver sub-600ms TTFT consistently. Claude Sonnet 4 is acceptable at ~900ms. GPT-4.1 and GPT-4.1 Mini are noticeably behind.

If you mean **fastest total throughput** — what matters for batch processing, code generation, long-form content: Gemini 2.5 Flash wins by a mile at 204.5 tok/s. GPT-4.1 is a solid second at 125 tok/s.

But here's the thing nobody's saying about Gemini Flash: it's extremely verbose. It generated 17 million output tokens during Artificial Analysis's Intelligence Index evaluation. The average across models? 9.2 million. That's nearly 2x the output length for the same tasks. So while it's blazing fast per-token, it generates far more tokens to get the job done. Your end-to-end latency inflates. Your costs inflate.

GPT-4.1 Mini generated only 4.6 million tokens on the same evaluation — the most concise model in the group. Its slower per-token speed is partially offset by the fact that it just produces fewer tokens. In real-world short-response scenarios (API calls returning structured JSON, classification tasks, routing decisions), GPT-4.1 Mini's total response time can actually be competitive despite the lower throughput number.

I've seen this pattern over and over in production AI systems: the model with the best benchmark throughput isn't always the model that finishes the job fastest. Verbosity is a hidden variable most benchmarks ignore.

Most LLM latency comparisons pretend cost doesn't exist. That's wrong. In production, you don't optimize for speed in isolation. You optimize for speed at a cost you can sustain.

Say you're processing 10 million input tokens and 2 million output tokens per day. That's a moderate-traffic AI feature. Nothing crazy.

Daily API costs at list pricing:

Gemini 2.5 Flash gives you the best speed and the second-lowest cost. GPT-4.1 Mini is cheapest but slowest. Claude Sonnet 4 costs 7.5x more than Gemini Flash for significantly less speed.

But factor in verbosity. If Gemini 2.5 Flash produces 1.8x more output tokens than GPT-4.1 Mini for the same task, your actual output token costs jump to $9.00 instead of $5.00. The "cheap" model isn't as cheap as the pricing page suggests.

This is why I always tell teams: run your actual prompts through multiple models and measure real output lengths before committing to one. The pricing page is a starting point, not the answer. If you're trying to manage these costs at scale, Netflix's Headroom approach to cutting AI agent costs is the best production case study I've come across.

The model with the best per-token price is not always the cheapest model to run. Verbosity is a hidden cost multiplier that most benchmarks ignore entirely.

The five models I tested in March 2026 aren't the full picture anymore. Several major releases have landed since.

GPT-5.5 is now OpenAI's flagship. According to Artificial Analysis, GPT-5.5 (high) achieves 62 tok/s with a 27.9s TTFT and scores 59 on the Intelligence Index. Smarter, yes. But considerably slower than GPT-4.1 for anything latency-sensitive. OpenAI also launched GPT-5 Mini as the replacement for GPT-4.1 Mini.

Gemini 3.5 Flash from Google hits 167-180 tok/s with a TTFT of 15-18 seconds and scores 55 on intelligence. That's a massive jump over Gemini 2.5 Flash (21 → 55) but at the cost of much higher latency. Google also released Gemini 2.5 Flash (Sep) as a direct successor to the model I tested.

Claude Fable 5 landed in June as the highest-intelligence model on the AA index at 65. But its TTFT is 147 seconds. Total response time: 155 seconds. This is a batch processing model. Do not put it behind a chat UI unless you want support tickets.

Qwen3.7 Max from Alibaba is the dark horse worth paying attention to: 193 tok/s output, only 2.59s TTFT, 57 intelligence score, $1.43 per million tokens blended. That's an exceptional speed-intelligence-cost profile. I haven't seen anything else come close to that combination.

The lowest-latency models as of mid-2026 are North Mini Code and Gemini 2.5 Flash-Lite. Neither existed when I ran the March benchmark.

For developers building with these newer models, understanding function calling patterns and agent orchestration becomes critical. Reasoning-heavy models have fundamentally different latency characteristics than the non-reasoning models I originally tested. This is the matchup I see developers agonizing over right now for latency-sensitive production apps. So let's just go through it properly.

TTFT: Gemini 2.5 Flash, by a hair. In my testing, Gemini's first token arrived slightly faster than Claude Haiku's 597ms. But both are sub-600ms. To a user, they feel identical.

Throughput: Gemini 2.5 Flash, by a lot. 204.5 tok/s vs roughly 80-100 tok/s for Haiku. For long-form generation, this gap is enormous.

Conciseness: Claude Haiku 4.5. Haiku produces moderate-length outputs. Gemini Flash is rated "very verbose" — 17M tokens vs the 9.2M average on the AA Intelligence Index. In practice, Haiku often finishes the same task with fewer total tokens. That partially closes the throughput gap on actual end-to-end response time.

Cost: Gemini 2.5 Flash on input ($0.30 vs $0.80), roughly a wash on output once you account for verbosity. Gemini's cache hit pricing at $0.03 per million tokens (90% discount) is particularly aggressive if your use case involves repeated context.

Intelligence: Claude Haiku 4.5. Both are "above average" for their class, but Haiku handles complex instructions and edge cases better in my testing. If your application needs reliable instruction-following — structured output, classification, routing — I'd pick Haiku. If you need raw speed for simpler tasks, go Gemini Flash.

I wrote a deeper comparison in Claude Haiku 4.5 vs GPT-4o Mini. Many of the architectural tradeoffs apply here too. Here's something I almost never see in latency comparison posts: uptime and reliability. A model that delivers 200ms TTFT 95% of the time but throws 503 errors the other 5% has worse effective latency than one delivering 600ms TTFT at 99.9% uptime.

I've been running production AI workloads across all three major providers for over a year. Here's my honest take:

For any serious production deployment, you need to be measuring P99 latency, not just median. If your P50 is 500ms but your P99 is 4 seconds, 1 in 100 users is having a terrible experience. I've watched teams obsess over median latency while completely ignoring tail latency. It always catches up to them. Geographic routing matters too. More than most developers realize. If you're building for users in Asia-Pacific, Gemini's infrastructure tends to deliver lower latency than OpenAI's. For North America and Europe, all three are competitive. I've measured 200-400ms differences just from cloud region selection. That's bigger than the TTFT gap between some of these models.

After running these benchmarks and building production features on all five models, here's how I'd think about it:

User-facing chat with streaming, cost-sensitive: Gemini 2.5 Flash. Best TTFT, best throughput, lowest cost. Accept the verbosity and use max_tokens

aggressively to rein it in.

User-facing chat with streaming, quality-sensitive: Claude Haiku 4.5. Costs a bit more, but you get more concise outputs and better instruction-following. Sub-600ms TTFT keeps users happy.

Backend processing, structured output, routing: GPT-4.1 Mini. The slow TTFT doesn't matter when nobody's watching a cursor blink. Very concise outputs keep costs low. Decent enough intelligence for classification.

Complex multi-step tasks, code generation: GPT-4.1 or Claude Sonnet 4. Higher intelligence, reasonable throughput. The cost premium pays for itself when accuracy saves you from expensive retries.

Maximum intelligence, latency is irrelevant: Claude Fable 5 or GPT-5.5. Batch processing, complex analysis, research tasks. These are not real-time models.

Best combined value across speed, intelligence, and cost: Qwen3.7 Max. 193 tok/s, 2.59s TTFT, 57 intelligence score, $1.43/M tokens blended. The catch: less ecosystem support than the big three, and you need to evaluate whether Alibaba's API reliability meets your bar.

One thing I want to stress for teams building agentic AI systems: latency compounds. When agents make multiple LLM calls in sequence, every millisecond multiplies. A four-step agent pipeline with 600ms TTFT per step adds 2.4 seconds just in first-token latency. At 2,400ms per step (GPT-4.1 Mini), you're at 9.6 seconds before the agent even starts producing useful output. TTFT should be your primary selection criterion for agent workloads. Not throughput. Not cost. TTFT.

For latency-sensitive production apps, pick the model with the best TTFT that meets your quality bar. Everything else — cost, throughput, intelligence — is negotiable. TTFT is what the user feels. I want to be upfront about the limitations of this data. My benchmark uses three prompt sizes, three runs each, from a single geographic location. That's enough to establish relative ordering and ballpark numbers. It's not enough to make P99 guarantees.

Artificial Analysis runs far more comprehensive tests across multiple providers and API endpoints. Their data broadly confirms what I found: Gemini 2.5 Flash is the speed leader, Claude Haiku is the TTFT leader among Anthropic models, and GPT-4.1 Mini underperforms expectations.

But no benchmark — mine, theirs, or the vendors' — can tell you how these models will perform on your prompts, with your input lengths, at your traffic patterns. I've seen models that benchmark beautifully fall apart under real load, and models that look mediocre on paper perform exceptionally well on specific task types. The variance between synthetic benchmarks and production reality is always larger than you expect.

The best approach I've found: pick 2-3 candidates from the comparison table, run them against your actual production prompts for a week, measure TTFT, throughput, total latency, and output quality. Then decide. If you're evaluating local LLM alternatives to avoid API latency entirely, that's a different conversation — but for most production apps serving real users, the API route still wins on reliability and ease of deployment.

The LLM API world in late 2026 looks nothing like it did six months ago. GPT-5.5, Gemini 3.5 Flash, Claude Fable 5, and Qwen3.7 Max have all pushed the performance ceiling higher. But the fundamental tradeoff hasn't moved an inch: speed costs money, intelligence costs latency, and verbosity costs both. The developers who ship great products are the ones who know which of those three they can compromise on. And which one their users will never forgive them for getting wrong.

As of mid-2026, Gemini 2.5 Flash and Claude Haiku 4.5 deliver the lowest time-to-first-token among mainstream LLM APIs, both consistently under 600ms on medium-length prompts. For absolute lowest latency regardless of model size, North Mini Code and Gemini 2.5 Flash-Lite lead the Artificial Analysis leaderboard.

No. Despite being OpenAI's lightweight model, GPT-4.1 Mini is rated "slower than average" at 94.5 tokens per second, ranking #32 out of 85 non-reasoning models. Its TTFT of approximately 2,400ms on medium prompts is roughly 4x slower than Claude Haiku 4.5. OpenAI has since positioned GPT-5 Mini as its successor.

Gemini 2.5 Flash generates nearly 2x the average output tokens for the same task (17M tokens vs 9.2M average on the AA Intelligence Index). While its per-token speed is exceptional at 204.5 tok/s, the extra tokens inflate both end-to-end response time and output costs. Setting aggressive max_tokens

limits can mitigate this.

For user-facing streaming applications, aim for under 1 second TTFT. Research on response time psychology shows users maintain their flow of thought below 1 second, start noticing delays between 1-10 seconds, and abandon after 10 seconds. Sub-600ms TTFT is the gold standard for chat interfaces. It depends on your volume. At low to moderate usage (under a few million tokens per day), API costs are lower than the hardware amortization of running models locally. At high volume or for privacy-sensitive workloads, self-hosted inference on dedicated GPUs can be significantly cheaper per token. The break-even point varies by model size and hardware.

Both are significantly smarter than their predecessors but come with much higher latency. GPT-5.5 (high) has a 27.9s TTFT, and Gemini 3.5 Flash has 15-18s TTFT. These are reasoning-optimized models unsuitable for real-time streaming. For latency-sensitive work, the previous generation (GPT-4.1, Gemini 2.5 Flash) or their direct successors remain better choices.

Originally published on kunalganglani.com

source & further reading

dev.to — original article `finish_reason=length` Returned Empty Content — and the Error Message Lied to Me Combined Offense + Defense (Engineering Edition) — Cross-Project Reuse Matrix and When Not to Use What actually belongs in CLAUDE.md — and what to move to skills, hooks, or docs

5 LLM APIs Tested for Latency: Real Data [2026]

Run your AI side-project on zahid.host