I measure how fast 42 LLMs actually answer. Here's the honest method.

wpnews.pro

cd /news/large-language-models/i-measure-how-fast-42-llms-actually-… · home › topics › large-language-models › article

[ARTICLE · art-29009] src=dev.to ↗ pub=2026-06-16T05:06Z topic=large-language-models verified=true sentiment=· neutral

I measure how fast 42 LLMs actually answer. Here's the honest method.

An independent speed tracker, ollamatps.com, benchmarks 42 Ollama Cloud models by measuring time to first token (TTFT) and tokens per second (TPS). The tracker reveals that smaller models can be faster, with a 30B model achieving over 200 TPS while the 'ultra' model lags at under 8 TPS, and TTFT varies from 0.3 to 23 seconds. The project, built by AI QA Architect Anton Gulin, uses a rigorous method with capped max_tokens and retries to ensure reliable comparisons.

read2 min views22 publishedJun 16, 2026

I test software for a living. So when a vendor calls an AI model "fast," I don't trust the word. I measure it.

Most leaderboards rank how smart a model is. Almost none rank how fast it answers. You pick a model because it scored well, ship it, and then your users sit and wait.

Speed is two different numbers. People mix them up constantly.

Time to first token (TTFT). The wait before the first word appears. You feel this every time a chatbot "thinks" before replying.

Tokens per second (TPS). How fast the model writes once it starts. A token is a chunk of a word.

A model can be great at one and terrible at the other. You need both.

I run an independent tracker called ollamatps.com. It benchmarks 42 Ollama Cloud models. Here is the exact method, because a benchmark you cannot inspect is just a claim.

max_tokens

is capped at 300. It never changes between runs.eval_count / (total_duration - time_to_first_token)

. The startup wait is removed, so TPS measures pure writing speed.Same prompt, same cap, same schedule. That is what makes two models comparable.

Building this was a testing job, not a coding job. Retries on failure. A reliability score per model. A circuit breaker for models that keep failing. If you cannot trust the measurement, the number is noise. That part is the same work I do on any test system.

Bigger is not faster.

The fastest model on the board is one of the smallest: a 30B model at over 200 tokens per second. The model literally named "ultra" is dead last, under 8 tokens per second.

And the wait varies wildly. TTFT ranges from about 0.3 seconds to 23 seconds across the 42 models. Same cloud. Roughly 80x difference in how long you wait for the first word.

If you picked your model on a benchmark score alone, you have no idea which of these you are getting. The first version tracked fewer models and was less robust. I rebuilt the engine this month (v2) to be multi-provider and to test continuously. The live board updates every 10 minutes.

Watch it run: ollamatps.com Anton Gulin is the AI QA Architect, the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.

source & further reading

dev.to — original article Skills as Sub-Agents: Orchestrating Complex work with Claude Skills Foundry as Master, Bedrock as Remote: The Smoke Test Finally Passed Google ADK as the Master Agent, Calling Amazon Bedrock over A2A

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-measure-how-fast-42-ll…

Read original on dev.to → dev.to/aiwithanton/i-measure-how-fast-42-llms-ac…

mentioned entities

Ollama

ollamatps.com

Anton Gulin

Apple

metadata

slugi-measure-how-fast-42-llms-actually-answer-here-s-the-honest-method

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevOpenAI spent $34B on R&D, sales,…

next →Your .NET RAG stack hides a Pyth…

── more in #large-language-models 4 stories · sorted by recency

leadprompt.sh · 31 Jul · #large-language-models

Building an Infinite Memory Local AI Stack on Fedora

dev.to · 31 Jul · #large-language-models

Qwen2.5-Coder vs DeepSeek-Coder for Solidity Review: What I Actually See Locally

ddsboston.com · 31 Jul · #large-language-models

AI News for Vibe Coders – Daily Brief (July 31, 2026)

lifehacker.com · 31 Jul · #large-language-models

You Might Soon Have to Pay More for Higher Access to Siri AI

── more on @ollama 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

Rewriting a Six-Year-Old Personal Project with AI

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required