cd /news/large-language-models/why-i-stopped-picking-ai-models-by-h… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-37546] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=↑ positive

Why I Stopped Picking AI Models by Hype and Started Picking by Speed

A developer benchmarked 15 AI models through Global API's unified endpoint, measuring time to first token and tokens per second. Step-3.5-Flash from StepFun topped the speed leaderboard with 80 tokens/sec and 120ms TTFT at $0.15/M output tokens. The developer argues that model speed directly impacts client retention and profitability for freelancers and small businesses.

read8 min views1 publishedJun 24, 2026

Why I Stopped Picking AI Models by Hype and Started Picking by Speed

Three months ago I almost lost a $14,000 retainer because my chatbot felt sluggish. The client didn't say "your TTFT is too high." They said "it feels dumb." That's freelancer code for "users are bouncing and I'm about to find someone else."

I rebuilt that bot in a weekend using a model I'd never even heard of six weeks earlier, dropped average response time from 1.4 seconds to under 300ms, and the client renewed for another six months. That single pivot paid for my rent.

So I went down a rabbit hole. I ran the same speed test on every model I could get my hands on through Global API's unified endpoint. Fifteen models. Same prompt. Same regions. Ten iterations each. I'm writing this up because if you're billing by the hour or running a side hustle on a shoestring, speed isn't a vanity metric β€” it's a profit metric.

Let me show you what I found.

I'm not a researcher with a rack of GPUs. I'm a guy with a M2 MacBook, a $19/mo Hetzner box, and a stopwatch in the form of Python's time.perf_counter()

. Here's how I kept it honest.

https://global-apis.com/v1

I measured two things: TTFT (time to first token β€” the lag before the user sees anything move) and sustained tokens per second (how fast the words actually arrive after that). Both matter. TTFT is the "is this thing broken?" feeling. Tokens per second is the "is this thing fast?" feeling.

Here's the script I used, stripped down to the essentials:

import time
import requests
from statistics import mean

API_KEY = "your-global-api-key"
BASE_URL = "https://global-apis.com/v1"
MODELS = [
    "step-3.5-flash",
    "deepseek-v4-flash",
    "hunyuan-turbos",
    "qwen3-8b",
    "qwen3-32b",
    "doubao-seed-lite",
    "hunyuan-turbo",
    "glm-4-32b",
    "qwen3.5-27b",
    "deepseek-v4-pro",
    "MiniMax-M2.5",
    "glm-5",
    "kimi-k2.5",
    "deepseek-r1",
    "qwen3.5-397b",
]

PROMPT = "Explain recursion in 200 words."

def benchmark(model: str, runs: int = 10):
    ttfts, speeds = [], []
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model,
        "stream": True,
        "messages": [{"role": "user", "content": PROMPT}],
    }
    for _ in range(runs):
        start = time.perf_counter()
        first_token_at = None
        token_count = 0
        with requests.post(
            f"{BASE_URL}/chat/completions",
            headers=headers,
            json=payload,
            stream=True,
            timeout=60,
        ) as r:
            r.raise_for_status()
            for line in r.iter_lines():
                if not line or not line.startswith(b"data: "):
                    continue
                chunk = line[6:].decode("utf-8", errors="ignore")
                if chunk.strip() == "[DONE]":
                    break
                now = time.perf_counter()
                if first_token_at is None:
                    first_token_at = now
                token_count += 1  # rough count, one per chunk
        if first_token_at is None or token_count == 0:
            continue
        ttfts.append((first_token_at - start) * 1000)
        elapsed = time.perf_counter() - first_token_at
        speeds.append(token_count / elapsed if elapsed > 0 else 0)
    return {
        "model": model,
        "ttft_ms": round(mean(ttfts)),
        "tok_per_s": round(mean(speeds), 1),
    }

Run that across your model list and you get a CSV. I then sorted by tokens per second and started arguing with myself about which numbers actually mattered.

Here's the full leaderboard from fastest to slowest. I've included price because, look, I run a side hustle. Speed without price is academic. Both matter.

Rank Model TTFT (ms) Tokens/sec Provider $/M Output
1 Step-3.5-Flash
120 80
StepFun $0.15
2 DeepSeek V4 Flash
180 60
DeepSeek $0.25
3 Hunyuan-TurboS
200 55 Tencent $0.28
4 Qwen3-8B 150 70 Qwen $0.01
5 Qwen3-32B 250 45 Qwen $0.28
6 Doubao-Seed-Lite 220 50 ByteDance $0.40
7 Hunyuan-Turbo 280 42 Tencent $0.57
8 GLM-4-32B 300 38 Zhipu $0.56
9 Qwen3.5-27B 350 35 Qwen $0.19
10 DeepSeek V4 Pro 400 30 DeepSeek $0.78
11 MiniMax M2.5 450 28 MiniMax $1.15
12 GLM-5 500 25 Zhipu $1.92
13 Kimi K2.5 600 20 Moonshot $3.00
14 DeepSeek-R1 800 15 DeepSeek $2.50
15 Qwen3.5-397B 1200 10 Qwen $2.34

One quick note on the reasoning models at the bottom β€” R1, K2.5, K2-Thinking all have those internal thinking phases before the first visible token, so the TTFT numbers are brutal. That's by design, not a bug. If you need visible speed in a chat UI, route reasoning models through a separate "deep think" endpoint and don't put them on the main chat path.

Let me cut through the table. Here's how I think about it when a client is paying me by the hour and I have to justify the choice.

Anything under 200ms TTFT feels instant to a human. That's Step-3.5-Flash (120ms) and Qwen3-8B (150ms). For a UI where users are typing into a chat box and the response should appear before they lift their finger off the enter key, these are your two.

Step-3.5-Flash at $0.15/M output is the new champion for me. 80 tokens per second with sub-150ms TTFT is wild for that price. I moved my AI-powered autocomplete widget over to it last week and the client sent me a Slack message saying "this feels different, did you change something?" That's the kind of feedback that gets you a referral.

Qwen3-8B at $0.01/M is the side-hustle special. One cent per million output tokens. For background jobs, batch processing, anything where a slightly-dumber model is fine and you care about cost-per-task more than vibes, it's unbeatable.

This is where most of my billable work lives. The sweet spot, in my opinion, is DeepSeek V4 Flash β€” 180ms TTFT, 60 tok/s, $0.25/M. It hits that GPT-4o-class quality tier without the GPT-4o-class latency. For a customer-facing assistant that needs to feel both smart and responsive, this is what I default to.

If the client is okay with spending a bit more for higher reasoning quality, DeepSeek V4 Pro at 30 tok/s and $0.78/M is the next rung up. I use it for tool-calling agents where the model needs to think a bit harder but I still want a usable response time.

For my legal-tech client and the medical summarization gigs, speed is secondary. GLM-5 at 25 tok/s and $1.92/M is where I land for most of those. Kimi K2.5 at 20 tok/s and $3.00/M comes in when I genuinely need the extra quality and the client budget allows it.

Honestly? I rarely use R1 or Qwen3.5-397B in production. The 800ms+ TTFT is a UX death sentence for chat. If I need reasoning that hard, I do it server-side as a job and show the user a "thinking" indicator.

Here's the kind of math I do when pitching a model choice to a client. Let's say you're building a chatbot for a SaaS product and you project 5 million output tokens per month.

If I pitch the client K2.5, the cost conversation becomes the whole meeting. If I pitch V4 Flash, I can usually frame it as "less than what you pay for two Notion seats" and they sign off in five minutes. The kicker is that V4 Flash is also faster, so users stick around longer. That's a retention argument, not just a cost argument β€” and clients love retention arguments.

The Qwen3-8B case is even better for batch jobs. I have a client whose entire pipeline runs on it and their bill last month was $4.12. I sent them the invoice screenshot. They thought I was joking.

I tested from both US East and Singapore because two of my clients have APAC users. The delta was bigger than I assumed:

Model US East TTFT Asia TTFT Diff
DeepSeek V4 Flash 180ms 150ms -30ms
Qwen3-32B 250ms 210ms -40ms
GLM-5 500ms 420ms -80ms
Kimi K2.5 600ms 480ms -120ms

Asian-built models (Qwen, GLM, Kimi) had 16–20% lower latency from Singapore. Makes sense β€” the inference clusters are physically closer. DeepSeek was impressively consistent globally; the difference was only 30ms.

If you have users in Asia and you're serving them from a US endpoint, you're leaving 40–120ms on the table. That's the difference between "fast" and "instant" in user perception, which is the difference between a $5k/mo contract and a churned account.

This is the part I wish I'd internalized two years earlier. From my own analytics dashboards across seven different client apps:

TTFT Range What Users Do
Under 200ms "Instant" β€” bounce rate stays flat, completion rates high
200–400ms "Fast" β€” totally fine, no complaints
400–800ms "Noticeable delay" β€” chat completions drop ~15–20% in my data
800ms+ "Slow" β€” people close the tab, especially on mobile

There's a multiplier effect I didn't expect: faster responses also reduce the length of conversations. When users feel the bot is snappy, they

── more in #large-language-models 4 stories Β· sorted by recency
── more on @global api 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/why-i-stopped-pickin…] indexed:0 read:8min 2026-06-24 Β· β€”