cd /news/large-language-models/i-benchmarked-deepseek-qwen-kimi-glm… · home topics large-language-models article
[ARTICLE · art-46439] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

A developer benchmarked four Chinese LLM families—DeepSeek, Qwen, Kimi, and GLM—over 30 days using 1,247 prompts across code generation, reasoning, Chinese language, English chat, and vision tasks. DeepSeek V4 Flash emerged as the best value for dollars-per-quality-point, while Qwen3-8B at $0.01/M tokens offered the cheapest viable option. The benchmark found a weak correlation between price and quality, with DeepSeek leading in speed and code generation, and Kimi excelling in reasoning and Chinese language tasks.

read6 min views1 publishedJul 1, 2026

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.

My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.

With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.

I ran each prompt through every model using the same OpenAI Python client, swapping only the model

string and the prompt template. Every API call went through https://global-apis.com/v1

so I'm comparing outputs, not plumbing. For each call I captured:

The prompt set breaks down roughly:

Category Sample Size Notes
Code generation 312 HumanEval-style problems, 4 languages
Reasoning / math 268 GSM8K-style word problems, logic puzzles
Chinese language 241 Translation, summarization, sentiment
English chat 224 Multi-turn dialogues, instruction following
Vision / multimodal 202 Image captioning, OCR (where supported)

It's not a peer-reviewed benchmark. It is a real workload that resembles what production traffic looks like, which I care about more.

Let me get the most important table out of the way first, because this is where budgets live or die:

Family Min Price ($/M output) Max Price ($/M output) Range Span
DeepSeek $0.25 $2.50 10x
Qwen $0.01 $3.20 320x
Kimi $3.00 $3.50 1.17x
GLM $0.01 $1.92 192x

The correlation between "Cheap" and "Bad" is, empirically, weak. That's the headline finding. Qwen3-8B at $0.01/M tokens is functionally absurd — you could run thousands of classification calls per dollar. Kimi, on the other hand, is pricing-insensitive in a way I find almost philosophical: you either want what they sell or you don't.

If you're optimizing purely on dollars-per-quality-point (a metric I made up but stand behind), DeepSeek V4 Flash is the statistical winner in my sample. If you want the cheapest possible option that can still answer an email, Qwen3-8B at $0.01/M is the floor.

Latency benchmarks, median over my sample of 1,247 calls (your mileage will vary — sample size caveats apply):

Model Median TTFT (ms) Tokens/sec Notes
DeepSeek V4 Flash 180 ~60 Genuinely fast
Qwen3-32B 240 ~45 Steady
Kimi K2.5 410 ~28 Slow but thorough
GLM-5 290 ~38 Mid-pack

DeepSeek consistently clocks around 60 tokens/sec on V4 Flash, which matches the figure I'd seen reported elsewhere. That's correlation I trust. For real-time chat UX, this matters more than I expected — 200ms of TTFT feels instant, 400ms feels like a buffering wheel.

Kimi is the slowest in my sample. Whether that's worth it depends entirely on what you're optimizing for (see reasoning section below).

This is where the "best model" question gets statistically murky. Here's what the rubric scores look like across each task category, averaged on my 1–5 scale:

Task DeepSeek Qwen Kimi GLM
Code generation 4.4 3.9 4.1 3.4
Reasoning 4.0 4.0 4.6 4.1
Chinese language 4.2 4.3 4.7 4.6
English chat 4.4 4.0 4.0 4.1
Vision/multimodal 2.8 4.1 2.4 4.2

Quick takeaways:

There's a real statistical argument that Qwen is "good enough" for ~80% of workloads given its model variety. There's also a real argument that "good enough" is doing heavy lifting in that sentence.

I spent the most time with DeepSeek because V4 Flash kept ending up in my winner's column. Here's the family roster as I tested it:

Model Output $/M What I Used It For
V4 Flash $0.25 Default driver for everything
V3.2 $0.38 Latest-arch experiments
V4 Pro $0.78 When I needed an extra quality bump
R1 (Reasoner) $2.50 Hard math, multi-step logic
Coder $0.25 Code-specialized runs

What I noticed:

Here's the small Python helper I keep reusing — note the base_url points at Global API:

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a careful data analyst."},
        {"role": "user", "content": "Explain Simpson's paradox in 100 words."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

In my actual notebook this is wrapped in a run_prompt()

function that logs latency, tokens, and cost. If you want the full version, it's around 40 lines and frankly not interesting.

Qwen is the family I underrated going in. Alibaba ships so many variants that it's easy to dismiss as "another big catalog" — but the range is genuinely useful in production:

Model Output $/M What I Used It For
Qwen3-8B $0.01 Bulk classification, cheap embeddings-class work
Qwen3-32B $0.28 My general-purpose default
Qwen3-Coder-30B $0.35 Code when I need a non-DeepSeek opinion
Qwen3-VL-32B $0.52 Vision tasks
Qwen3-Omni-30B $0.52 Audio + video experiments
Qwen3.5-397B $2.34 Heavy reasoning, when Kimi felt like overkill

Notes from the trenches:

Sample call:

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function that merges two sorted lists in O(n) time."}
    ]
)
print(response.choices[0].message.content)

Kimi is the family I had the strongest priors about before testing. It's positioned as Moonshot AI's reasoning specialist — and the data backs that up. Here's what I tested:

Model Output $/M What I Used It For
K2.5 $3.00 The "main" Kimi model in my sample
(Family ceiling) $3.50 Premium tier

Observations:

Zhipu's GLM family surprised me. I'd written it off as a "Chinese-language specialist," which is reductive:

Model Output $/M What I Used It For
GLM-4-9B $0.01 Ultra-budget tasks
GLM-5 $1.92 Production quality, my GLM default
GLM-4.6V (vision) Image understanding

What I found:

── more in #large-language-models 4 stories · sorted by recency
── more on @deepseek 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-benchmarked-deepse…] indexed:0 read:6min 2026-07-01 ·