# I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

> Source: <https://dev.to/rarenode/i-benchmarked-deepseek-qwen-kimi-glm-for-30-days-the-numbers-5je>
> Published: 2026-07-01 11:02:43+00:00

I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.

My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.

With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.

I ran each prompt through every model using the same OpenAI Python client, swapping only the `model`

string and the prompt template. Every API call went through `https://global-apis.com/v1`

so I'm comparing outputs, not plumbing. For each call I captured:

The prompt set breaks down roughly:

| Category | Sample Size | Notes |
|---|---|---|
| Code generation | 312 | HumanEval-style problems, 4 languages |
| Reasoning / math | 268 | GSM8K-style word problems, logic puzzles |
| Chinese language | 241 | Translation, summarization, sentiment |
| English chat | 224 | Multi-turn dialogues, instruction following |
| Vision / multimodal | 202 | Image captioning, OCR (where supported) |

It's not a peer-reviewed benchmark. It *is* a real workload that resembles what production traffic looks like, which I care about more.

Let me get the most important table out of the way first, because this is where budgets live or die:

| Family | Min Price ($/M output) | Max Price ($/M output) | Range Span |
|---|---|---|---|
| DeepSeek | $0.25 | $2.50 | 10x |
| Qwen | $0.01 | $3.20 | 320x |
| Kimi | $3.00 | $3.50 | 1.17x |
| GLM | $0.01 | $1.92 | 192x |

The correlation between "Cheap" and "Bad" is, empirically, weak. That's the headline finding. Qwen3-8B at $0.01/M tokens is functionally absurd — you could run thousands of classification calls per dollar. Kimi, on the other hand, is pricing-insensitive in a way I find almost philosophical: you either want what they sell or you don't.

If you're optimizing purely on dollars-per-quality-point (a metric I made up but stand behind), DeepSeek V4 Flash is the statistical winner in my sample. If you want the cheapest possible option that can still answer an email, Qwen3-8B at $0.01/M is the floor.

Latency benchmarks, median over my sample of 1,247 calls (your mileage will vary — sample size caveats apply):

| Model | Median TTFT (ms) | Tokens/sec | Notes |
|---|---|---|---|
| DeepSeek V4 Flash | 180 | ~60 | Genuinely fast |
| Qwen3-32B | 240 | ~45 | Steady |
| Kimi K2.5 | 410 | ~28 | Slow but thorough |
| GLM-5 | 290 | ~38 | Mid-pack |

DeepSeek consistently clocks around 60 tokens/sec on V4 Flash, which matches the figure I'd seen reported elsewhere. That's correlation I trust. For real-time chat UX, this matters more than I expected — 200ms of TTFT feels instant, 400ms feels like a buffering wheel.

Kimi is the slowest in my sample. Whether that's worth it depends entirely on what you're optimizing for (see reasoning section below).

This is where the "best model" question gets statistically murky. Here's what the rubric scores look like across each task category, averaged on my 1–5 scale:

| Task | DeepSeek | Qwen | Kimi | GLM |
|---|---|---|---|---|
| Code generation | 4.4 | 3.9 | 4.1 | 3.4 |
| Reasoning | 4.0 | 4.0 | 4.6 | 4.1 |
| Chinese language | 4.2 | 4.3 | 4.7 | 4.6 |
| English chat | 4.4 | 4.0 | 4.0 | 4.1 |
| Vision/multimodal | 2.8 | 4.1 | 2.4 | 4.2 |

Quick takeaways:

There's a real statistical argument that Qwen is "good enough" for ~80% of workloads given its model variety. There's also a real argument that "good enough" is doing heavy lifting in that sentence.

I spent the most time with DeepSeek because V4 Flash kept ending up in my winner's column. Here's the family roster as I tested it:

| Model | Output $/M | What I Used It For |
|---|---|---|
| V4 Flash | $0.25 | Default driver for everything |
| V3.2 | $0.38 | Latest-arch experiments |
| V4 Pro | $0.78 | When I needed an extra quality bump |
| R1 (Reasoner) | $2.50 | Hard math, multi-step logic |
| Coder | $0.25 | Code-specialized runs |

What I noticed:

Here's the small Python helper I keep reusing — note the base_url points at Global API:

``` python
from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a careful data analyst."},
        {"role": "user", "content": "Explain Simpson's paradox in 100 words."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
```

In my actual notebook this is wrapped in a `run_prompt()`

function that logs latency, tokens, and cost. If you want the full version, it's around 40 lines and frankly not interesting.

Qwen is the family I underrated going in. Alibaba ships so many variants that it's easy to dismiss as "another big catalog" — but the range is genuinely useful in production:

| Model | Output $/M | What I Used It For |
|---|---|---|
| Qwen3-8B | $0.01 | Bulk classification, cheap embeddings-class work |
| Qwen3-32B | $0.28 | My general-purpose default |
| Qwen3-Coder-30B | $0.35 | Code when I need a non-DeepSeek opinion |
| Qwen3-VL-32B | $0.52 | Vision tasks |
| Qwen3-Omni-30B | $0.52 | Audio + video experiments |
| Qwen3.5-397B | $2.34 | Heavy reasoning, when Kimi felt like overkill |

Notes from the trenches:

Sample call:

```
# Switching the same client to Qwen3-32B for general tasks
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function that merges two sorted lists in O(n) time."}
    ]
)
print(response.choices[0].message.content)
```

Kimi is the family I had the strongest priors about before testing. It's positioned as Moonshot AI's reasoning specialist — and the data backs that up. Here's what I tested:

| Model | Output $/M | What I Used It For |
|---|---|---|
| K2.5 | $3.00 | The "main" Kimi model in my sample |
| (Family ceiling) | $3.50 | Premium tier |

Observations:

Zhipu's GLM family surprised me. I'd written it off as a "Chinese-language specialist," which is reductive:

| Model | Output $/M | What I Used It For |
|---|---|---|
| GLM-4-9B | $0.01 | Ultra-budget tasks |
| GLM-5 | $1.92 | Production quality, my GLM default |
| GLM-4.6V | (vision) | Image understanding |

What I found:
