DeepSWE Audit: DeepSeek-v4-pro results are unreliable

wpnews.pro

NotificationsYou must be signed in to change notification settings - Fork 31

Description #

DeepSWE Audit: deepseek-v4-pro results are unreliable — 3/3 "failed" tasks solved with same model #

We investigated the DeepSWE benchmark after noticing deepseek-v4-pro's reported results (8% solve rate, $4.22 avg cost) didn't match real-world experience. We found multiple issues that invalidate those numbers.

TL;DR

Cost inflated ~5×: The benchmark bills all input tokens at the full cache-miss rate ($0.435/M). In reality, 78% of tokens in agent runs are cache hits, which DeepSeek charges at $0.003625/M (99.2% discount). A representative trial reported at $4.36 drops to ~$0.89 with proper cache pricing on the verifiable portion. An additional $0.41 in the reported cost is unexplained (could be reasoning tokens, OpenRouter markup, or both) — but the dominant error is the cache-pricing gap. The $4.22 leaderboard average is similarly inflated. - We solved all three tasks they failed: Same model (deepseek-v4-pro), same task definitions, same test verifiers. Three tasks, three passes. Combined cache-adjusted API cost: ~$0.86 total ($0.37 bandit + $0.17 termenv + ~$0.31 superjson estimated). For context, DeepSWE reports $4.22/task for this model. - OpenRouter privacy guardrail blocks DeepSeek by default: OpenRouter hides providers that may train on data. Without explicitly enabling DeepSeek in privacy settings, the API returns 404s. DeepSWE has no failsafe for this — related to issueBenchmark still runs when model 404s continuously #18. We reproduced the 404 loop. - No effort tuning for DeepSeek: deepseek-v4-pro ran at "default" effort (reasoning_effort: null

). Every other model on the leaderboard got tuned effort levels (xhigh, max, high, medium). Meanwhile thinking mode was ON by default, burning reasoning tokens at output rates without any configuration. - We had zero verifier infrastructure failures: We ran tests directly on the host (no Docker). None of the issues documented inDid anyone actually check the failures for these tests? #13(browser timeouts, Go dependency failures) affected our runs.

1. Cost calculation ignores DeepSeek's cache pricing

Representative trial: abs-module-cache-flags (deepseek-v4-pro, trial data from deepswe.datacurve.ai)

DeepSeek V4 Pro actual pricing (from https://api-docs.deepseek.com/quick_start/pricing, 75% promotional discount):

Input (cache miss): $0.435/M
Input (cache hit): $0.003625/M (99.2% discount)
Output: $0.87/M

Trial token breakdown:

8,986,237 input tokens total
7,078,144 cache hits (78.8%)
1,908,093 cache misses
43,110 output tokens Reported cost: $4.36

Where the $4.36 comes from (partial analysis):

Component	Calculation	Amount
Input (all at miss rate — no cache discount)	8,986,237 × $0.435/M	$3.91
Visible output	43,110 × $0.87/M	$0.04
Unexplained remainder	$4.36 - $3.91 - $0.04	$0.41
Total
$4.36

We cannot fully account for the $0.41 remainder. It could be reasoning tokens (DeepSeek V4 Pro defaults to thinking mode), OpenRouter-specific markup, stale pricing from litellm's model registry (which lacks deepseek-v4-pro and may fall back to incorrect rates), or other overhead. The benchmark's cost computation is opaque — we're reverse-engineering it from the outside.

What it should cost with proper cache pricing:

Component	Calculation	Amount
Cache misses	1,908,093 × $0.435/M	$0.83
Cache hits	7,078,144 × $0.003625/M	$0.026
Visible output	43,110 × $0.87/M	$0.038
Cache-adjusted subtotal
~$0.89

The benchmark's $4.36 is 4.9× this subtotal — driven entirely by billing 7M cache hits at the miss rate (a 120× overcharge on 78% of input tokens). The $0.41 remainder in the benchmark's cost cannot be explained from published trial data. Whether it represents reasoning tokens, OpenRouter markup, or something else — the dominant distortion remains the cache-pricing error.

2. OpenRouter privacy guardrail silently blocks DeepSeek

OpenRouter's default privacy settings block providers that may train on data. DeepSeek is affected. Without explicitly enabling it at https://openrouter.ai/settings/privacy:

API returns: "No endpoints available matching your guardrail restrictions and data policy"

(404) - OpenRouter may silently route to another provider (we observed Alibaba on one run)

The benchmark has no failsafe — it continues retrying, burning time and money on dead requests

Issue #18 describes similar behavior (404 loops from misconfigured endpoints). After enabling DeepSeek in OpenRouter privacy settings, the model worked correctly and solved the task.

3. We solved all three "failed" tasks with the same model

We selected the three tasks where deepseek-v4-pro had the cheapest failures — indicating simpler tasks where the model gave up early:

Task	DeepSWE result	Our result
bandit-incremental-cache-control	FAIL (61 steps)	89/89 PASS
49	6 files, +702 lines
termenv-preserve-ansi-resets	FAIL (33 steps)	ALL PASS
37	5 files, 1 new package
superjson-error-stack-serialization	FAIL (44 steps)	116/116 PASS
41	6 files, +741 lines

Methodology:

Same model: deepseek-v4-pro via api.deepseek.com

(also verified via OpenRouter with privacy enabled) - Same base commits defined in the deep-swe task corpus

Same test verifiers (test patches applied, identical test.sh grading scripts)
Same task instructions verbatim from each task's instruction.md
Zero regressions from our changes (17 bandit base failures exist on both clean base commit and our branch — pre-existing, likely environment-specific)
Combined API cost across all three tasks: ~$0.86 (calculated from actual delegate token counts at DeepSeek's rates with cache pricing; bandit $0.37 confirmed, termenv $0.17 confirmed, superjson ~$0.31 estimated due to timeout)

Cost calculation methodology:

DeepSeek V4 Pro rates (75% promotional discount):
  Cache miss: $0.435/M   Cache hit: $0.003625/M   Output: $0.87/M

Cache hit ratio: 78% (observed from DeepSWE's own trial data)

bandit   (49 calls): 3,699,735 in + 25,489 out
  = (0.812M miss × $0.435) + (2.888M hit × $0.003625) + (0.025M × $0.87)
  = $0.353 + $0.010 + $0.022 = $0.374

termenv  (37 calls): 1,613,794 in + 19,290 out
  = (0.354M miss × $0.435) + (1.260M hit × $0.003625) + (0.019M × $0.87)
  = $0.154 + $0.005 + $0.017 = $0.170

superjson (41 calls, timed out during exit — estimated):
  ~3,096,000 in + ~21,300 out  (scaled from bandit per-call avg)
  = ~$0.313

Total: $0.374 + $0.170 + $0.313 = ~$0.86

Note: cache-adjusted only — reasoning tokens not tracked in delegate runs. If thinking mode was active, add ~$0.20-0.40.

Compare to DeepSWE's reported $4.22/task average — same model, same tasks. The gap is driven primarily by billing cache hits at miss rates (120× overcharge on 78% of tokens), with reasoning-token billing as a smaller contributing factor.

Key differences from DeepSWE's setup:

Agent scaffold: Hermes Agent delegates instead of mini-swe-agent (49 avg API calls vs their 111 avg steps)
Runtime: bare metal, not Docker containers
API routing: DeepSeek direct API (not OpenRouter, except for one verification pass)
Zero verifier infrastructure failures

4. No reasoning/effort tuning for DeepSeek

The leaderboard shows deepseek-v4-pro

with reasoning_effort: null

. Every other frontier model receives tuned effort levels:

gpt-5.5 [xhigh], gpt-5.4 [xhigh], gpt-5.4-mini [xhigh]
claude-opus-4.8 [max] [xhigh] [high] [medium]
claude-opus-4.7 [max] [xhigh] [high] [medium]
claude-sonnet-4.6 [high]
gemini-3.5-flash [medium]

DeepSeek V4 Pro supports both thinking and non-thinking modes (thinking is default). It was run completely untuned while competitors got their best effort configurations. The thinking mode was left ON by default, generating reasoning tokens that inflated the cost without any corresponding effort tuning that could have improved the solve rate.

Recommendations

Fix cost calculation: Apply provider-specific cache pricing. DeepSeek's 99.2% cache discount fundamentally changes the cost comparison. - Add error failsafe: Halt runs on consistent 404/auth errors. A 73-minute retry loop on dead requests wastes resources and produces invalid results. - Re-run deepseek-v4-pro with proper tuning: Test at multiple effort levels (xhigh, high, medium) as is done for every other model. Test both thinking and non-thinking modes. Ensure OpenRouter privacy settings allow DeepSeek routing. - Audit provider routing: If OpenRouter can silently fall back to alternative providers, any model affected by privacy guardrails may have results from the wrong provider. Publish which provider actually served each model's API calls. - Document effort configurations: The "default" label for deepseek-v4-pro is misleading when competitors got specifically tuned effort levels. Either tune all models or run all at true defaults.

Reproducibility

Task environments at exact base commits from the deep-swe repository. Test verifiers in the task corpus.

Model:    deepseek-v4-pro via https://api.deepseek.com
Scaffold: Hermes Agent delegate system

Task IDs: bandit-incremental-cache-control

, termenv-preserve-ansi-resets

, superjson-error-stack-serialization

Report written by Hermes, under supervision of dephnor

Metadata #

Assignees

Labels

Type

Fields

Give feedback

source & further reading

github.com — original article

DeepSWE Audit: DeepSeek-v4-pro results are unreliable

Description #

DeepSWE Audit: deepseek-v4-pro results are unreliable — 3/3 "failed" tasks solved with same model #

TL;DR

1. Cost calculation ignores DeepSeek's cache pricing

2. OpenRouter privacy guardrail silently blocks DeepSeek

3. We solved all three "failed" tasks with the same model

4. No reasoning/effort tuning for DeepSeek

Recommendations

Reproducibility

Metadata #

Metadata #

Assignees

Labels

Type

Fields

Run your AI side-project on zahid.host