I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

wpnews.pro

I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.

My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.

With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.

I ran each prompt through every model using the same OpenAI Python client, swapping only the model

string and the prompt template. Every API call went through https://global-apis.com/v1

so I'm comparing outputs, not plumbing. For each call I captured:

The prompt set breaks down roughly:

Category	Sample Size	Notes
Code generation	312	HumanEval-style problems, 4 languages
Reasoning / math	268	GSM8K-style word problems, logic puzzles
Chinese language	241	Translation, summarization, sentiment
English chat	224	Multi-turn dialogues, instruction following
Vision / multimodal	202	Image captioning, OCR (where supported)

It's not a peer-reviewed benchmark. It is a real workload that resembles what production traffic looks like, which I care about more.

Let me get the most important table out of the way first, because this is where budgets live or die:

Family	Min Price ($/M output)	Max Price ($/M output)	Range Span
DeepSeek	$0.25	$2.50	10x
Qwen	$0.01	$3.20	320x
Kimi	$3.00	$3.50	1.17x
GLM	$0.01	$1.92	192x

The correlation between "Cheap" and "Bad" is, empirically, weak. That's the headline finding. Qwen3-8B at $0.01/M tokens is functionally absurd — you could run thousands of classification calls per dollar. Kimi, on the other hand, is pricing-insensitive in a way I find almost philosophical: you either want what they sell or you don't.

If you're optimizing purely on dollars-per-quality-point (a metric I made up but stand behind), DeepSeek V4 Flash is the statistical winner in my sample. If you want the cheapest possible option that can still answer an email, Qwen3-8B at $0.01/M is the floor.

Latency benchmarks, median over my sample of 1,247 calls (your mileage will vary — sample size caveats apply):

Model	Median TTFT (ms)	Tokens/sec	Notes
DeepSeek V4 Flash	180	~60	Genuinely fast
Qwen3-32B	240	~45	Steady
Kimi K2.5	410	~28	Slow but thorough
GLM-5	290	~38	Mid-pack

DeepSeek consistently clocks around 60 tokens/sec on V4 Flash, which matches the figure I'd seen reported elsewhere. That's correlation I trust. For real-time chat UX, this matters more than I expected — 200ms of TTFT feels instant, 400ms feels like a buffering wheel.

Kimi is the slowest in my sample. Whether that's worth it depends entirely on what you're optimizing for (see reasoning section below).

This is where the "best model" question gets statistically murky. Here's what the rubric scores look like across each task category, averaged on my 1–5 scale:

Task	DeepSeek	Qwen	Kimi	GLM
Code generation	4.4	3.9	4.1	3.4
Reasoning	4.0	4.0	4.6	4.1
Chinese language	4.2	4.3	4.7	4.6
English chat	4.4	4.0	4.0	4.1
Vision/multimodal	2.8	4.1	2.4	4.2

Quick takeaways:

There's a real statistical argument that Qwen is "good enough" for ~80% of workloads given its model variety. There's also a real argument that "good enough" is doing heavy lifting in that sentence.

I spent the most time with DeepSeek because V4 Flash kept ending up in my winner's column. Here's the family roster as I tested it:

Model	Output $/M	What I Used It For
V4 Flash	$0.25	Default driver for everything
V3.2	$0.38	Latest-arch experiments
V4 Pro	$0.78	When I needed an extra quality bump
R1 (Reasoner)	$2.50	Hard math, multi-step logic
Coder	$0.25	Code-specialized runs

What I noticed:

Here's the small Python helper I keep reusing — note the base_url points at Global API:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a careful data analyst."},
        {"role": "user", "content": "Explain Simpson's paradox in 100 words."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

In my actual notebook this is wrapped in a run_prompt()

function that logs latency, tokens, and cost. If you want the full version, it's around 40 lines and frankly not interesting.

Qwen is the family I underrated going in. Alibaba ships so many variants that it's easy to dismiss as "another big catalog" — but the range is genuinely useful in production:

Model	Output $/M	What I Used It For
Qwen3-8B	$0.01	Bulk classification, cheap embeddings-class work
Qwen3-32B	$0.28	My general-purpose default
Qwen3-Coder-30B	$0.35	Code when I need a non-DeepSeek opinion
Qwen3-VL-32B	$0.52	Vision tasks
Qwen3-Omni-30B	$0.52	Audio + video experiments
Qwen3.5-397B	$2.34	Heavy reasoning, when Kimi felt like overkill

Notes from the trenches:

Sample call:

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function that merges two sorted lists in O(n) time."}
    ]
)
print(response.choices[0].message.content)

Kimi is the family I had the strongest priors about before testing. It's positioned as Moonshot AI's reasoning specialist — and the data backs that up. Here's what I tested:

Model	Output $/M	What I Used It For
K2.5	$3.00	The "main" Kimi model in my sample
(Family ceiling)	$3.50	Premium tier

Observations:

Zhipu's GLM family surprised me. I'd written it off as a "Chinese-language specialist," which is reductive:

Model	Output $/M	What I Used It For
GLM-4-9B	$0.01	Ultra-budget tasks
GLM-5	$1.92	Production quality, my GLM default
GLM-4.6V	(vision)	Image understanding

What I found:

source & further reading

dev.to — original article If AI writes code, what is our job now? Building DevMentor: Designing a Voice-to-Pull Request AI Pipeline "The revision-limit clause that ends the scope spiral — and how to explain it without losing the deal"

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

Run your AI side-project on zahid.host