A real-world comparison of two LLMs on a genuine race condition bug from GitHub
| Metric | DeepSeek V4 Pro | MiMo V2.5 Pro |
|---|---|---|
| Time | ~8 min (2 rounds) | ~15 min (2 rounds) |
| Tokens | 2.43M | 3.36M |
| Cache hit rate | 92.1% | 95.2% |
| Cost | $0.14 (6% top-up fee) | $0.13 (0% fee) |
| Bugs found | 1 race condition | 3 race conditions |
| Fix approach | Prevention (lock-based) | Prevention (three-phase separation) |
Verdict: MiMo is better at debugging (finds more bugs, deeper analysis) AND cheaper. DeepSeek is faster and better for writing code.
Most LLM benchmarks test coding ability β write a function, solve a puzzle, implement an algorithm. But in real-world development, debugging is harder than writing code. You need to:
We wanted to test this specific skill. So we took a real race condition bug from a popular open-source library and gave it to both models.
Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations
httpcore is a low-level HTTP client library used by httpx (the popular Python HTTP client). It handles connection pooling, HTTP/2, proxies, and more.
When async tasks are cancelled during connection operations, the pool's internal state becomes inconsistent. The pool thinks connections are still in use when they're actually cancelled, leading to pool exhaustion β new requests can never acquire a connection.
connection_pool.py
, connection.py
, and http2.py
We gave each model the entire httpcore project at the commit BEFORE the fix (commit 79fa6bf
). The project included:
README.md
with bug description (no hints about the fix)PROMPT.md
with instructionsSOLUTION.md
and SOLUTION.diff
(hidden from models)Prompt (identical for both models):
You are given a Python project with a bug. Your task is to find the bug
and write a detailed explanation of how to fix it.
1. Read README.md to understand the project and the bug description.
2. Analyze the source code in httpcore/_async/ and httpcore/_sync/
to find the root cause of the race condition.
3. Run the tests to see which ones fail:
pip install -e ".[asyncio]"
pytest tests/ -v
4. Write your findings to SOLUTION.md with:
- Root cause analysis (what exactly goes wrong)
- Why it happens (the mechanism)
- How to fix it (the approach, not necessarily the exact code)
- Which files need to be changed
Do NOT modify the source code. Only write SOLUTION.md.
After Round 1, both models proposed patches (cleanup handlers) rather than prevention (atomic state management). We gave them a hint:
Prompt (identical for both models):
Your previous fix handled orphaned connections in the cancellation
handler. This works, but it treats the symptom β connections still
get orphaned, you just clean them up after.
A better approach would be to prevent the race condition from
happening in the first place. The root cause is that state
management (tracking idle vs in-use connections) is interleaved
with I/O operations (queue.get(), queue.put()). When a task is
cancelled between state update and I/O, the pool loses track.
Can you find a way to make the state management atomic β so that
cancellation cannot happen midway through the acquire/release
sequence?
Write your refined solution to SOLUTION_V2.md.
Root Cause: Found 1 race condition β orphaned connections when task is cancelled after assignment but before resume.
Key Insight: "The connection remains in the pool marked as 'in use' but the task that was supposed to use it is gone."
Proposed Fix: Handle orphaned connections in the cancellation handler β check if a connection was assigned and release it.
Quality: Excellent root cause analysis, step-by-step mechanism explanation. However, proposed fix was a patch (cleanup handler), not prevention.
Root Cause: Found 3 distinct race conditions:
Key Insight: Explained why existing tests don't catch it β they use single-request scenarios.
Proposed Fix: Add cleanup handlers + defensive connection sweep.
Quality: Excellent analysis, deeper than DeepSeek (3 bugs vs 1). However, proposed fix was also a patch (cleanup handlers), not prevention.
| Aspect | DeepSeek | MiMo |
|---|---|---|
| Time | ~3 min | ~9 min |
| Bugs found | 1 | 3 |
| Fix approach | Patch (cleanup) | Patch (cleanup) |
| Fix quality | π‘ Treats symptoms | π‘ Treats symptoms |
| Explanation quality | Excellent | Excellent |
Approach: Move connection claiming to the waiting task, make it atomic inside a lock.
Key Changes:
_wait_and_acquire()
methodresponse_closed()
_pool_state_changed
eventQuality: π’ Architecturally clean, similar to the actual fix.
Approach: Three-phase separation β CLEANUP (I/O), STATE (sync), I/O (network).
Key Changes:
_attempt_to_acquire_connection
is now synchronous (no await inside lock)AsyncShieldCancellation
for critical sectionsQuality: π’ Systematic approach, analyzed 5 cancellation scenarios.
| Aspect | DeepSeek | MiMo |
|---|---|---|
| Time | ~5 min | ~6 min |
| Approach | Lock-based atomic | Three-phase separation |
| Complexity | Medium | High |
| Edge cases | Good | Excellent (5 scenarios) |
| Metric | DeepSeek V4 Pro | MiMo V2.5 Pro |
|---|---|---|
| Total tokens | 2,431,121 | 3,356,951 |
| Cache hit | 2,198,400 | 3,146,304 |
| Cache miss | 189,058 | 157,502 |
| Output | 43,663 | 53,145 |
| Cache hit rate | 92.1% | 95.2% |
| API requests | 30 | 34 |
Both models use the same pricing on OpenCode Go:
DeepSeek V4 Pro:
MiMo V2.5 Pro:
Even though MiMo used 38% more tokens, it was still cheaper because:
The actual fix by Tom Christie (httpcore author) was elegantly simple:
Approach: Move ALL state management into non-cancellable sections using locks.
Key Insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."
Files Changed: 9 files, +512/-379 lines
Both models converged on this approach in Round 2, though with different implementations:
| Task | Better Model | Why |
|---|---|---|
| Writing code | ||
| DeepSeek V4 Pro | Faster, fewer tokens, cleaner architecture | |
| Debugging | ||
| MiMo V2.5 Pro | Finds more bugs, deeper analysis, cheaper |
For this specific debugging task:
MiMo is both better at debugging AND cheaper. The higher token usage is offset by the lack of top-up commission.
Synthetic bugs are too easy β models solve them in seconds. Real bugs from production codebases require:
In real-world debugging, you often get a quick fix first, then refine it. Testing both rounds shows:
This benchmark reveals that debugging and code writing are different skills. DeepSeek excels at writing clean, efficient code quickly. MiMo excels at deep analysis and finding subtle bugs.
For teams building AI-assisted development tools:
The surprise finding: MiMo is cheaper for debugging despite using more tokens, thanks to zero commission on top-up. For high-volume debugging workloads, this cost difference adds up.
Benchmark conducted on June 30, 2026 using DeepSeek API and Xiaomi MiMo API platforms. Full benchmark data available in the author's GitHub repository.