{"slug": "deepswe-audit-deepseek-v4-pro-results-are-unreliable", "title": "DeepSWE Audit: DeepSeek-v4-pro results are unreliable", "summary": "A DeepSWE benchmark audit found that DeepSeek-v4-pro's reported 8% solve rate and $4.22 average cost are unreliable due to inflated pricing and procedural errors. The audit team solved all three tasks DeepSeek-v4-pro reportedly failed using the same model, with a combined cache-adjusted API cost of approximately $0.86, and identified that the benchmark billed all input tokens at full cache-miss rates despite 78% of tokens being cache hits. Additional issues included OpenRouter's default privacy guardrail blocking DeepSeek, no effort tuning for the model, and zero verifier infrastructure failures during the audit's runs.", "body_md": "-\n[Notifications](/login?return_to=%2Fdatacurve-ai%2Fdeep-swe)You must be signed in to change notification settings -\n[Fork 31](/login?return_to=%2Fdatacurve-ai%2Fdeep-swe)\n\n# DeepSWE Audit: deepseek-v4-pro results are unreliable — 3/3 \"failed\" tasks solved with same model #21\n\n## Description\n\n## DeepSWE Audit: deepseek-v4-pro results are unreliable — 3/3 \"failed\" tasks solved with same model\n\nWe investigated the DeepSWE benchmark after noticing deepseek-v4-pro's reported results (8% solve rate, $4.22 avg cost) didn't match real-world experience. We found multiple issues that invalidate those numbers.\n\n### TL;DR\n\n-\n**Cost inflated ~5×**: The benchmark bills all input tokens at the full cache-miss rate ($0.435/M). In reality, 78% of tokens in agent runs are cache hits, which DeepSeek charges at $0.003625/M (99.2% discount). A representative trial reported at $4.36 drops to ~$0.89 with proper cache pricing on the verifiable portion. An additional $0.41 in the reported cost is unexplained (could be reasoning tokens, OpenRouter markup, or both) — but the dominant error is the cache-pricing gap. The $4.22 leaderboard average is similarly inflated. -\n**We solved all three tasks they failed**: Same model (deepseek-v4-pro), same task definitions, same test verifiers. Three tasks, three passes. Combined cache-adjusted API cost: ~$0.86 total ($0.37 bandit + $0.17 termenv + ~$0.31 superjson estimated). For context, DeepSWE reports $4.22/task for this model. -\n**OpenRouter privacy guardrail blocks DeepSeek by default**: OpenRouter hides providers that may train on data. Without explicitly enabling DeepSeek in privacy settings, the API returns 404s. DeepSWE has no failsafe for this — related to issue[Benchmark still runs when model 404s continuously #18](https://github.com/datacurve-ai/deep-swe/issues/18). We reproduced the 404 loop. -\n**No effort tuning for DeepSeek**: deepseek-v4-pro ran at \"default\" effort (`reasoning_effort: null`\n\n). Every other model on the leaderboard got tuned effort levels (xhigh, max, high, medium). Meanwhile thinking mode was ON by default, burning reasoning tokens at output rates without any configuration. -\n**We had zero verifier infrastructure failures**: We ran tests directly on the host (no Docker). None of the issues documented in[Did anyone actually check the failures for these tests? #13](https://github.com/datacurve-ai/deep-swe/issues/13)(browser timeouts, Go dependency failures) affected our runs.\n\n### 1. Cost calculation ignores DeepSeek's cache pricing\n\n**Representative trial: abs-module-cache-flags (deepseek-v4-pro, trial data from deepswe.datacurve.ai)**\n\nDeepSeek V4 Pro actual pricing (from [https://api-docs.deepseek.com/quick_start/pricing](https://api-docs.deepseek.com/quick_start/pricing), 75% promotional discount):\n\n- Input (cache miss): $0.435/M\n- Input (cache hit): $0.003625/M (99.2% discount)\n- Output: $0.87/M\n\nTrial token breakdown:\n\n- 8,986,237 input tokens total\n- 7,078,144 cache hits (78.8%)\n- 1,908,093 cache misses\n- 43,110 output tokens\n**Reported cost: $4.36**\n\n**Where the $4.36 comes from (partial analysis):**\n\n| Component | Calculation | Amount |\n|---|---|---|\n| Input (all at miss rate — no cache discount) | 8,986,237 × $0.435/M | $3.91 |\n| Visible output | 43,110 × $0.87/M | $0.04 |\n| Unexplained remainder | $4.36 - $3.91 - $0.04 | $0.41 |\nTotal |\n$4.36 |\n\nWe cannot fully account for the $0.41 remainder. It could be reasoning tokens (DeepSeek V4 Pro defaults to thinking mode), OpenRouter-specific markup, stale pricing from litellm's model registry (which lacks deepseek-v4-pro and may fall back to incorrect rates), or other overhead. The benchmark's cost computation is opaque — we're reverse-engineering it from the outside.\n\n**What it should cost with proper cache pricing:**\n\n| Component | Calculation | Amount |\n|---|---|---|\n| Cache misses | 1,908,093 × $0.435/M | $0.83 |\n| Cache hits | 7,078,144 × $0.003625/M | $0.026 |\n| Visible output | 43,110 × $0.87/M | $0.038 |\nCache-adjusted subtotal |\n~$0.89 |\n\nThe benchmark's $4.36 is 4.9× this subtotal — driven entirely by billing 7M cache hits at the miss rate (a 120× overcharge on 78% of input tokens). The $0.41 remainder in the benchmark's cost cannot be explained from published trial data. Whether it represents reasoning tokens, OpenRouter markup, or something else — the dominant distortion remains the cache-pricing error.\n\n### 2. OpenRouter privacy guardrail silently blocks DeepSeek\n\nOpenRouter's default privacy settings block providers that may train on data. DeepSeek is affected. Without explicitly enabling it at [https://openrouter.ai/settings/privacy](https://openrouter.ai/settings/privacy):\n\n- API returns:\n`\"No endpoints available matching your guardrail restrictions and data policy\"`\n\n(404) - OpenRouter may silently route to another provider (we observed Alibaba on one run)\n- The benchmark has no failsafe — it continues retrying, burning time and money on dead requests\n\nIssue [#18](https://github.com/datacurve-ai/deep-swe/issues/18) describes similar behavior (404 loops from misconfigured endpoints). After enabling DeepSeek in OpenRouter privacy settings, the model worked correctly and solved the task.\n\n### 3. We solved all three \"failed\" tasks with the same model\n\nWe selected the three tasks where deepseek-v4-pro had the cheapest failures — indicating simpler tasks where the model gave up early:\n\n| Task | DeepSWE result | Our result | API calls | Changes |\n|---|---|---|---|---|\n| bandit-incremental-cache-control | FAIL (61 steps) | 89/89 PASS |\n49 | 6 files, +702 lines |\n| termenv-preserve-ansi-resets | FAIL (33 steps) | ALL PASS |\n37 | 5 files, 1 new package |\n| superjson-error-stack-serialization | FAIL (44 steps) | 116/116 PASS |\n41 | 6 files, +741 lines |\n\n**Methodology:**\n\n- Same model: deepseek-v4-pro via\n`api.deepseek.com`\n\n(also verified via OpenRouter with privacy enabled) - Same base commits defined in the deep-swe task corpus\n- Same test verifiers (test patches applied, identical test.sh grading scripts)\n- Same task instructions verbatim from each task's instruction.md\n- Zero regressions from our changes (17 bandit base failures exist on both clean base commit and our branch — pre-existing, likely environment-specific)\n- Combined API cost across all three tasks: ~$0.86 (calculated from actual delegate token counts at DeepSeek's rates with cache pricing; bandit $0.37 confirmed, termenv $0.17 confirmed, superjson ~$0.31 estimated due to timeout)\n\n**Cost calculation methodology:**\n\n```\nDeepSeek V4 Pro rates (75% promotional discount):\n  Cache miss: $0.435/M   Cache hit: $0.003625/M   Output: $0.87/M\n\nCache hit ratio: 78% (observed from DeepSWE's own trial data)\n\nbandit   (49 calls): 3,699,735 in + 25,489 out\n  = (0.812M miss × $0.435) + (2.888M hit × $0.003625) + (0.025M × $0.87)\n  = $0.353 + $0.010 + $0.022 = $0.374\n\ntermenv  (37 calls): 1,613,794 in + 19,290 out\n  = (0.354M miss × $0.435) + (1.260M hit × $0.003625) + (0.019M × $0.87)\n  = $0.154 + $0.005 + $0.017 = $0.170\n\nsuperjson (41 calls, timed out during exit — estimated):\n  ~3,096,000 in + ~21,300 out  (scaled from bandit per-call avg)\n  = ~$0.313\n\nTotal: $0.374 + $0.170 + $0.313 = ~$0.86\n\nNote: cache-adjusted only — reasoning tokens not tracked in delegate runs. If thinking mode was active, add ~$0.20-0.40.\n```\n\nCompare to DeepSWE's reported $4.22/task average — same model, same tasks. The gap is driven primarily by billing cache hits at miss rates (120× overcharge on 78% of tokens), with reasoning-token billing as a smaller contributing factor.\n\n**Key differences from DeepSWE's setup:**\n\n- Agent scaffold: Hermes Agent delegates instead of mini-swe-agent (49 avg API calls vs their 111 avg steps)\n- Runtime: bare metal, not Docker containers\n- API routing: DeepSeek direct API (not OpenRouter, except for one verification pass)\n- Zero verifier infrastructure failures\n\n### 4. No reasoning/effort tuning for DeepSeek\n\nThe leaderboard shows `deepseek-v4-pro`\n\nwith `reasoning_effort: null`\n\n. Every other frontier model receives tuned effort levels:\n\n- gpt-5.5 [xhigh], gpt-5.4 [xhigh], gpt-5.4-mini [xhigh]\n- claude-opus-4.8 [max] [xhigh] [high] [medium]\n- claude-opus-4.7 [max] [xhigh] [high] [medium]\n- claude-sonnet-4.6 [high]\n- gemini-3.5-flash [medium]\n\nDeepSeek V4 Pro supports both thinking and non-thinking modes (thinking is default). It was run completely untuned while competitors got their best effort configurations. The thinking mode was left ON by default, generating reasoning tokens that inflated the cost without any corresponding effort tuning that could have improved the solve rate.\n\n### Recommendations\n\n-\n**Fix cost calculation**: Apply provider-specific cache pricing. DeepSeek's 99.2% cache discount fundamentally changes the cost comparison. -\n**Add error failsafe**: Halt runs on consistent 404/auth errors. A 73-minute retry loop on dead requests wastes resources and produces invalid results. -\n**Re-run deepseek-v4-pro with proper tuning**: Test at multiple effort levels (xhigh, high, medium) as is done for every other model. Test both thinking and non-thinking modes. Ensure OpenRouter privacy settings allow DeepSeek routing. -\n**Audit provider routing**: If OpenRouter can silently fall back to alternative providers, any model affected by privacy guardrails may have results from the wrong provider. Publish which provider actually served each model's API calls. -\n**Document effort configurations**: The \"default\" label for deepseek-v4-pro is misleading when competitors got specifically tuned effort levels. Either tune all models or run all at true defaults.\n\n### Reproducibility\n\nTask environments at exact base commits from the deep-swe repository. Test verifiers in the task corpus.\n\n```\nModel:    deepseek-v4-pro via https://api.deepseek.com\nScaffold: Hermes Agent delegate system\n```\n\nTask IDs: `bandit-incremental-cache-control`\n\n, `termenv-preserve-ansi-resets`\n\n, `superjson-error-stack-serialization`\n\n*Report written by Hermes, under supervision of dephnor*\n\n## Metadata\n\n## Metadata\n\n### Assignees\n\n### Labels\n\n### Type\n\n### Fields\n\n[Give feedback](https://github.com/orgs/community/discussions/189141)", "url": "https://wpnews.pro/news/deepswe-audit-deepseek-v4-pro-results-are-unreliable", "canonical_source": "https://github.com/datacurve-ai/deep-swe/issues/21", "published_at": "2026-06-04 05:54:32+00:00", "updated_at": "2026-06-04 06:17:33.892838+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-research", "ai-products", "ai-tools"], "entities": ["DeepSWE", "DeepSeek", "OpenRouter", "deepseek-v4-pro"], "alternates": {"html": "https://wpnews.pro/news/deepswe-audit-deepseek-v4-pro-results-are-unreliable", "markdown": "https://wpnews.pro/news/deepswe-audit-deepseek-v4-pro-results-are-unreliable.md", "text": "https://wpnews.pro/news/deepswe-audit-deepseek-v4-pro-results-are-unreliable.txt", "jsonld": "https://wpnews.pro/news/deepswe-audit-deepseek-v4-pro-results-are-unreliable.jsonld"}}