cd /news/large-language-models/deepswe-audit-deepseek-v4-pro-result… · home topics large-language-models article
[ARTICLE · art-21223] src=github.com pub= topic=large-language-models verified=true sentiment=↓ negative

DeepSWE Audit: DeepSeek-v4-pro results are unreliable

A DeepSWE benchmark audit found that DeepSeek-v4-pro's reported 8% solve rate and $4.22 average cost are unreliable due to inflated pricing and procedural errors. The audit team solved all three tasks DeepSeek-v4-pro reportedly failed using the same model, with a combined cache-adjusted API cost of approximately $0.86, and identified that the benchmark billed all input tokens at full cache-miss rates despite 78% of tokens being cache hits. Additional issues included OpenRouter's default privacy guardrail blocking DeepSeek, no effort tuning for the model, and zero verifier infrastructure failures during the audit's runs.

read7 min publishedJun 4, 2026

NotificationsYou must be signed in to change notification settings - Fork 31

Description #

DeepSWE Audit: deepseek-v4-pro results are unreliable — 3/3 "failed" tasks solved with same model #

We investigated the DeepSWE benchmark after noticing deepseek-v4-pro's reported results (8% solve rate, $4.22 avg cost) didn't match real-world experience. We found multiple issues that invalidate those numbers.

TL;DR

Cost inflated ~5×: The benchmark bills all input tokens at the full cache-miss rate ($0.435/M). In reality, 78% of tokens in agent runs are cache hits, which DeepSeek charges at $0.003625/M (99.2% discount). A representative trial reported at $4.36 drops to ~$0.89 with proper cache pricing on the verifiable portion. An additional $0.41 in the reported cost is unexplained (could be reasoning tokens, OpenRouter markup, or both) — but the dominant error is the cache-pricing gap. The $4.22 leaderboard average is similarly inflated. - We solved all three tasks they failed: Same model (deepseek-v4-pro), same task definitions, same test verifiers. Three tasks, three passes. Combined cache-adjusted API cost: ~$0.86 total ($0.37 bandit + $0.17 termenv + ~$0.31 superjson estimated). For context, DeepSWE reports $4.22/task for this model. - OpenRouter privacy guardrail blocks DeepSeek by default: OpenRouter hides providers that may train on data. Without explicitly enabling DeepSeek in privacy settings, the API returns 404s. DeepSWE has no failsafe for this — related to issueBenchmark still runs when model 404s continuously #18. We reproduced the 404 loop. - No effort tuning for DeepSeek: deepseek-v4-pro ran at "default" effort (reasoning_effort: null

). Every other model on the leaderboard got tuned effort levels (xhigh, max, high, medium). Meanwhile thinking mode was ON by default, burning reasoning tokens at output rates without any configuration. - We had zero verifier infrastructure failures: We ran tests directly on the host (no Docker). None of the issues documented inDid anyone actually check the failures for these tests? #13(browser timeouts, Go dependency failures) affected our runs.

1. Cost calculation ignores DeepSeek's cache pricing

Representative trial: abs-module-cache-flags (deepseek-v4-pro, trial data from deepswe.datacurve.ai)

DeepSeek V4 Pro actual pricing (from https://api-docs.deepseek.com/quick_start/pricing, 75% promotional discount):

  • Input (cache miss): $0.435/M
  • Input (cache hit): $0.003625/M (99.2% discount)
  • Output: $0.87/M

Trial token breakdown:

  • 8,986,237 input tokens total
  • 7,078,144 cache hits (78.8%)
  • 1,908,093 cache misses
  • 43,110 output tokens Reported cost: $4.36

Where the $4.36 comes from (partial analysis):

Component Calculation Amount
Input (all at miss rate — no cache discount) 8,986,237 × $0.435/M $3.91
Visible output 43,110 × $0.87/M $0.04
Unexplained remainder $4.36 - $3.91 - $0.04 $0.41
Total
$4.36

We cannot fully account for the $0.41 remainder. It could be reasoning tokens (DeepSeek V4 Pro defaults to thinking mode), OpenRouter-specific markup, stale pricing from litellm's model registry (which lacks deepseek-v4-pro and may fall back to incorrect rates), or other overhead. The benchmark's cost computation is opaque — we're reverse-engineering it from the outside.

What it should cost with proper cache pricing:

Component Calculation Amount
Cache misses 1,908,093 × $0.435/M $0.83
Cache hits 7,078,144 × $0.003625/M $0.026
Visible output 43,110 × $0.87/M $0.038
Cache-adjusted subtotal
~$0.89

The benchmark's $4.36 is 4.9× this subtotal — driven entirely by billing 7M cache hits at the miss rate (a 120× overcharge on 78% of input tokens). The $0.41 remainder in the benchmark's cost cannot be explained from published trial data. Whether it represents reasoning tokens, OpenRouter markup, or something else — the dominant distortion remains the cache-pricing error.

2. OpenRouter privacy guardrail silently blocks DeepSeek

OpenRouter's default privacy settings block providers that may train on data. DeepSeek is affected. Without explicitly enabling it at https://openrouter.ai/settings/privacy:

  • API returns: "No endpoints available matching your guardrail restrictions and data policy"

(404) - OpenRouter may silently route to another provider (we observed Alibaba on one run)

  • The benchmark has no failsafe — it continues retrying, burning time and money on dead requests

Issue #18 describes similar behavior (404 loops from misconfigured endpoints). After enabling DeepSeek in OpenRouter privacy settings, the model worked correctly and solved the task.

3. We solved all three "failed" tasks with the same model

We selected the three tasks where deepseek-v4-pro had the cheapest failures — indicating simpler tasks where the model gave up early:

Task DeepSWE result Our result API calls Changes
bandit-incremental-cache-control FAIL (61 steps) 89/89 PASS
49 6 files, +702 lines
termenv-preserve-ansi-resets FAIL (33 steps) ALL PASS
37 5 files, 1 new package
superjson-error-stack-serialization FAIL (44 steps) 116/116 PASS
41 6 files, +741 lines

Methodology:

  • Same model: deepseek-v4-pro via api.deepseek.com

(also verified via OpenRouter with privacy enabled) - Same base commits defined in the deep-swe task corpus

  • Same test verifiers (test patches applied, identical test.sh grading scripts)
  • Same task instructions verbatim from each task's instruction.md
  • Zero regressions from our changes (17 bandit base failures exist on both clean base commit and our branch — pre-existing, likely environment-specific)
  • Combined API cost across all three tasks: ~$0.86 (calculated from actual delegate token counts at DeepSeek's rates with cache pricing; bandit $0.37 confirmed, termenv $0.17 confirmed, superjson ~$0.31 estimated due to timeout)

Cost calculation methodology:

DeepSeek V4 Pro rates (75% promotional discount):
  Cache miss: $0.435/M   Cache hit: $0.003625/M   Output: $0.87/M

Cache hit ratio: 78% (observed from DeepSWE's own trial data)

bandit   (49 calls): 3,699,735 in + 25,489 out
  = (0.812M miss × $0.435) + (2.888M hit × $0.003625) + (0.025M × $0.87)
  = $0.353 + $0.010 + $0.022 = $0.374

termenv  (37 calls): 1,613,794 in + 19,290 out
  = (0.354M miss × $0.435) + (1.260M hit × $0.003625) + (0.019M × $0.87)
  = $0.154 + $0.005 + $0.017 = $0.170

superjson (41 calls, timed out during exit — estimated):
  ~3,096,000 in + ~21,300 out  (scaled from bandit per-call avg)
  = ~$0.313

Total: $0.374 + $0.170 + $0.313 = ~$0.86

Note: cache-adjusted only — reasoning tokens not tracked in delegate runs. If thinking mode was active, add ~$0.20-0.40.

Compare to DeepSWE's reported $4.22/task average — same model, same tasks. The gap is driven primarily by billing cache hits at miss rates (120× overcharge on 78% of tokens), with reasoning-token billing as a smaller contributing factor.

Key differences from DeepSWE's setup:

  • Agent scaffold: Hermes Agent delegates instead of mini-swe-agent (49 avg API calls vs their 111 avg steps)
  • Runtime: bare metal, not Docker containers
  • API routing: DeepSeek direct API (not OpenRouter, except for one verification pass)
  • Zero verifier infrastructure failures

4. No reasoning/effort tuning for DeepSeek

The leaderboard shows deepseek-v4-pro

with reasoning_effort: null

. Every other frontier model receives tuned effort levels:

  • gpt-5.5 [xhigh], gpt-5.4 [xhigh], gpt-5.4-mini [xhigh]
  • claude-opus-4.8 [max] [xhigh] [high] [medium]
  • claude-opus-4.7 [max] [xhigh] [high] [medium]
  • claude-sonnet-4.6 [high]
  • gemini-3.5-flash [medium]

DeepSeek V4 Pro supports both thinking and non-thinking modes (thinking is default). It was run completely untuned while competitors got their best effort configurations. The thinking mode was left ON by default, generating reasoning tokens that inflated the cost without any corresponding effort tuning that could have improved the solve rate.

Recommendations

Fix cost calculation: Apply provider-specific cache pricing. DeepSeek's 99.2% cache discount fundamentally changes the cost comparison. - Add error failsafe: Halt runs on consistent 404/auth errors. A 73-minute retry loop on dead requests wastes resources and produces invalid results. - Re-run deepseek-v4-pro with proper tuning: Test at multiple effort levels (xhigh, high, medium) as is done for every other model. Test both thinking and non-thinking modes. Ensure OpenRouter privacy settings allow DeepSeek routing. - Audit provider routing: If OpenRouter can silently fall back to alternative providers, any model affected by privacy guardrails may have results from the wrong provider. Publish which provider actually served each model's API calls. - Document effort configurations: The "default" label for deepseek-v4-pro is misleading when competitors got specifically tuned effort levels. Either tune all models or run all at true defaults.

Reproducibility

Task environments at exact base commits from the deep-swe repository. Test verifiers in the task corpus.

Model:    deepseek-v4-pro via https://api.deepseek.com
Scaffold: Hermes Agent delegate system

Task IDs: bandit-incremental-cache-control

, termenv-preserve-ansi-resets

, superjson-error-stack-serialization

Report written by Hermes, under supervision of dephnor

Metadata #

Metadata #

Assignees

Labels

Type

Fields

Give feedback

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/deepswe-audit-deepse…] indexed:0 read:7min 2026-06-04 ·