Why I Stopped Picking AI Models by Hype and Started Picking by Speed

wpnews.pro

Three months ago I almost lost a $14,000 retainer because my chatbot felt sluggish. The client didn't say "your TTFT is too high." They said "it feels dumb." That's freelancer code for "users are bouncing and I'm about to find someone else."

I rebuilt that bot in a weekend using a model I'd never even heard of six weeks earlier, dropped average response time from 1.4 seconds to under 300ms, and the client renewed for another six months. That single pivot paid for my rent.

So I went down a rabbit hole. I ran the same speed test on every model I could get my hands on through Global API's unified endpoint. Fifteen models. Same prompt. Same regions. Ten iterations each. I'm writing this up because if you're billing by the hour or running a side hustle on a shoestring, speed isn't a vanity metric — it's a profit metric.

Let me show you what I found.

I'm not a researcher with a rack of GPUs. I'm a guy with a M2 MacBook, a $19/mo Hetzner box, and a stopwatch in the form of Python's time.perf_counter()

. Here's how I kept it honest.

https://global-apis.com/v1

I measured two things: TTFT (time to first token — the lag before the user sees anything move) and sustained tokens per second (how fast the words actually arrive after that). Both matter. TTFT is the "is this thing broken?" feeling. Tokens per second is the "is this thing fast?" feeling.

Here's the script I used, stripped down to the essentials:

import time
import requests
from statistics import mean

API_KEY = "your-global-api-key"
BASE_URL = "https://global-apis.com/v1"
MODELS = [
    "step-3.5-flash",
    "deepseek-v4-flash",
    "hunyuan-turbos",
    "qwen3-8b",
    "qwen3-32b",
    "doubao-seed-lite",
    "hunyuan-turbo",
    "glm-4-32b",
    "qwen3.5-27b",
    "deepseek-v4-pro",
    "MiniMax-M2.5",
    "glm-5",
    "kimi-k2.5",
    "deepseek-r1",
    "qwen3.5-397b",
]

PROMPT = "Explain recursion in 200 words."

def benchmark(model: str, runs: int = 10):
    ttfts, speeds = [], []
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model,
        "stream": True,
        "messages": [{"role": "user", "content": PROMPT}],
    }
    for _ in range(runs):
        start = time.perf_counter()
        first_token_at = None
        token_count = 0
        with requests.post(
            f"{BASE_URL}/chat/completions",
            headers=headers,
            json=payload,
            stream=True,
            timeout=60,
        ) as r:
            r.raise_for_status()
            for line in r.iter_lines():
                if not line or not line.startswith(b"data: "):
                    continue
                chunk = line[6:].decode("utf-8", errors="ignore")
                if chunk.strip() == "[DONE]":
                    break
                now = time.perf_counter()
                if first_token_at is None:
                    first_token_at = now
                token_count += 1  # rough count, one per chunk
        if first_token_at is None or token_count == 0:
            continue
        ttfts.append((first_token_at - start) * 1000)
        elapsed = time.perf_counter() - first_token_at
        speeds.append(token_count / elapsed if elapsed > 0 else 0)
    return {
        "model": model,
        "ttft_ms": round(mean(ttfts)),
        "tok_per_s": round(mean(speeds), 1),
    }

Run that across your model list and you get a CSV. I then sorted by tokens per second and started arguing with myself about which numbers actually mattered.

Here's the full leaderboard from fastest to slowest. I've included price because, look, I run a side hustle. Speed without price is academic. Both matter.

Rank	Model	TTFT (ms)	Tokens/sec	Provider	$/M Output
1	Step-3.5-Flash
120	80
StepFun	$0.15
2	DeepSeek V4 Flash
180	60
DeepSeek	$0.25
3	Hunyuan-TurboS
200	55	Tencent	$0.28
4	Qwen3-8B	150	70	Qwen	$0.01
5	Qwen3-32B	250	45	Qwen	$0.28
6	Doubao-Seed-Lite	220	50	ByteDance	$0.40
7	Hunyuan-Turbo	280	42	Tencent	$0.57
8	GLM-4-32B	300	38	Zhipu	$0.56
9	Qwen3.5-27B	350	35	Qwen	$0.19
10	DeepSeek V4 Pro	400	30	DeepSeek	$0.78
11	MiniMax M2.5	450	28	MiniMax	$1.15
12	GLM-5	500	25	Zhipu	$1.92
13	Kimi K2.5	600	20	Moonshot	$3.00
14	DeepSeek-R1	800	15	DeepSeek	$2.50
15	Qwen3.5-397B	1200	10	Qwen	$2.34

One quick note on the reasoning models at the bottom — R1, K2.5, K2-Thinking all have those internal thinking phases before the first visible token, so the TTFT numbers are brutal. That's by design, not a bug. If you need visible speed in a chat UI, route reasoning models through a separate "deep think" endpoint and don't put them on the main chat path.

Let me cut through the table. Here's how I think about it when a client is paying me by the hour and I have to justify the choice.

Anything under 200ms TTFT feels instant to a human. That's Step-3.5-Flash (120ms) and Qwen3-8B (150ms). For a UI where users are typing into a chat box and the response should appear before they lift their finger off the enter key, these are your two.

Step-3.5-Flash at $0.15/M output is the new champion for me. 80 tokens per second with sub-150ms TTFT is wild for that price. I moved my AI-powered autocomplete widget over to it last week and the client sent me a Slack message saying "this feels different, did you change something?" That's the kind of feedback that gets you a referral.

Qwen3-8B at $0.01/M is the side-hustle special. One cent per million output tokens. For background jobs, batch processing, anything where a slightly-dumber model is fine and you care about cost-per-task more than vibes, it's unbeatable.

This is where most of my billable work lives. The sweet spot, in my opinion, is DeepSeek V4 Flash — 180ms TTFT, 60 tok/s, $0.25/M. It hits that GPT-4o-class quality tier without the GPT-4o-class latency. For a customer-facing assistant that needs to feel both smart and responsive, this is what I default to.

If the client is okay with spending a bit more for higher reasoning quality, DeepSeek V4 Pro at 30 tok/s and $0.78/M is the next rung up. I use it for tool-calling agents where the model needs to think a bit harder but I still want a usable response time.

For my legal-tech client and the medical summarization gigs, speed is secondary. GLM-5 at 25 tok/s and $1.92/M is where I land for most of those. Kimi K2.5 at 20 tok/s and $3.00/M comes in when I genuinely need the extra quality and the client budget allows it.

Honestly? I rarely use R1 or Qwen3.5-397B in production. The 800ms+ TTFT is a UX death sentence for chat. If I need reasoning that hard, I do it server-side as a job and show the user a "thinking" indicator.

Here's the kind of math I do when pitching a model choice to a client. Let's say you're building a chatbot for a SaaS product and you project 5 million output tokens per month.

If I pitch the client K2.5, the cost conversation becomes the whole meeting. If I pitch V4 Flash, I can usually frame it as "less than what you pay for two Notion seats" and they sign off in five minutes. The kicker is that V4 Flash is also faster, so users stick around longer. That's a retention argument, not just a cost argument — and clients love retention arguments.

The Qwen3-8B case is even better for batch jobs. I have a client whose entire pipeline runs on it and their bill last month was $4.12. I sent them the invoice screenshot. They thought I was joking.

I tested from both US East and Singapore because two of my clients have APAC users. The delta was bigger than I assumed:

Model	US East TTFT	Asia TTFT	Diff
DeepSeek V4 Flash	180ms	150ms	-30ms
Qwen3-32B	250ms	210ms	-40ms
GLM-5	500ms	420ms	-80ms
Kimi K2.5	600ms	480ms	-120ms

Asian-built models (Qwen, GLM, Kimi) had 16–20% lower latency from Singapore. Makes sense — the inference clusters are physically closer. DeepSeek was impressively consistent globally; the difference was only 30ms.

If you have users in Asia and you're serving them from a US endpoint, you're leaving 40–120ms on the table. That's the difference between "fast" and "instant" in user perception, which is the difference between a $5k/mo contract and a churned account.

This is the part I wish I'd internalized two years earlier. From my own analytics dashboards across seven different client apps:

TTFT Range	What Users Do
Under 200ms	"Instant" — bounce rate stays flat, completion rates high
200–400ms	"Fast" — totally fine, no complaints
400–800ms	"Noticeable delay" — chat completions drop ~15–20% in my data
800ms+	"Slow" — people close the tab, especially on mobile

There's a multiplier effect I didn't expect: faster responses also reduce the length of conversations. When users feel the bot is snappy, they

source & further reading

dev.to — original article 12 Best Frameworks for Building AI Agents in 2026 "Dispatch: 10 days autonomous, 2 visitors, $0 — what the data says to do next" AI Document Processing in Production: Full Pipeline Guide

Why I Stopped Picking AI Models by Hype and Started Picking by Speed

Run your AI side-project on zahid.host