{"slug": "benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier", "title": "Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance", "summary": "A developer benchmarked the Claude Agent SDK on a local LLM, finding that a 35B-parameter model running on an RTX 3090 Ti can replace Anthropic's Haiku tier in production. The local model achieved parity on the dominant workload (verify-rag-fact) with a 5.8x speedup, and outperformed the Anthropic ceiling on importance scoring. However, the extract-chunk workload scored 66-83% of the ceiling, requiring human spot-checks until the gap is closed.", "body_md": "The Claude Agent SDK exposes three budget tiers (`haiku`\n\n, `sonnet`\n\n, `opus`\n\n) and reads its routing target from environment variables on every call. That means a single env-var swap can point a tier at any Anthropic-compatible endpoint — including a local `llama-server`\n\n. The question is not whether you *can* do it. The question is whether the local model is good enough, per tier, to ship.\n\nThis article is the benchmark we ran to answer that question for Anatoly's document fact-check pipeline. **5 providers × 4 workloads × N=5 trials**, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: one RTX 3090 Ti, 24 GB VRAM.\n\n| Tier | What it does in our pipeline | Calls per run | Profile |\n|---|---|---|---|\nHaiku |\nextract atomic facts, verify them against retrieval, score importance | ~1,700 | JSON-only, low-latency, strict schema |\nSonnet |\nrewrite document sections to integrate omissions and correct hallucinations | ~8 | markdown out, ~3 k tokens, citation-tag sensitive |\nOpus |\nreserved for final review and high-stakes work | varies | stays on Anthropic regardless |\n\nThe two questions:\n\nFive providers cycled sequentially through four workloads, N=5 trials, 100 calls total, 0 errors:\n\n`parallel=1`\n\n`parallel=2`\n\n(the 1.5 GB freed by 4-bit KV makes the extra slot viable)Three quality signals per output, all measured against the Anthropic reference:\n\nThe Opus judge is the load-bearing one. Crucially, we also rate **Anthropic trial 2 against Anthropic trial 1** as the empirical ceiling: the score \"indistinguishable\" gets for two runs of the same backend on the same input. Any local provider at the ceiling is at parity, *empirically*, not approximately.\n\nThree workloads: extract-chunk (~88 calls/run), verify-rag-fact (~1,300 calls/run), importance-score (~300 calls/run). All JSON-only with strict schemas.\n\n| Workload | Anthropic | local-turbo-parallel | Speedup |\n|---|---|---|---|\n| extract-chunk | 53.80 s | 5.79 s | ×9.3 |\n| verify-rag-fact | 10.90 s | 1.87 s | ×5.8 |\n| importance-score | 10.39 s | 2.42 s | ×4.3 |\n\n| Workload | Anthropic ceiling | Local (production: 35 B no-think) | Reading |\n|---|---|---|---|\n| verify-rag-fact | 9/10 | 9/10 |\nat ceiling (1,300 calls/run, dominant cost) |\n| importance-score | 8/10 | 9/10 |\none point above ceiling (35B picks borderline buckets cleaner than Anthropic Haiku) |\n| extract-chunk | 6/10 | 4–5/10 | 66–83 % of ceiling (Anthropic is strict on its own output here) |\n\n**Verdict on the Haiku tier**: the local 35B-A3B in no-think mode is at parity on the dominant workload (verify, ×5.8 faster) and one point above ceiling on importance. Extract is one to two points below ceiling; the gap is real but small, and the upper bound matches the ceiling. **The local 35B can replace Anthropic Haiku in production**, with the caveat that extract benefits from human spot-checks until we close that gap.\n\nA surprise from a follow-up mini-bench: a 35B-A3B MoE in no-think mode **outperforms a dedicated dense 4B** at essentially the same latency, because MoE only activates ~3 B parameters per token.\n\n| Config (importance workload) | Wall mean | Stdev | Opus rating |\n|---|---|---|---|\n| 4 B no-think | 2.13 s | 0.02 s | 8/10 |\n35 B no-think |\n2.42 s |\n0.15 s |\n9/10 |\n\nWe initially shipped two separate models (4B for haiku, 35B for sonnet). After the mini-bench we collapsed to one GGUF in two containers, distinguished only by a thinking flag.\n\nOne workload: correct-section-rewrite (~8 calls/run). Markdown output, ~1.5 to 3 k tokens, requires `[#filename]`\n\ncitation tags on every inserted sentence.\n\n| Provider | Wall mean | Note |\n|---|---|---|\n| Anthropic Sonnet | 26.01 s | reference |\n| Anthropic Opus | 12.88 s | the actual Anthropic option for this workload |\n| local Qwen 35B no-think | 9.61 s | fastest |\n| local Qwen 35B thinking ON | 23.50 s | thinking helps quality, hurts latency |\n\n| Provider | Opus rating | Failure mode |\n|---|---|---|\n| Anthropic Opus | 9/10 | reference |\n| local Qwen 35B thinking ON | 6/10 | misses `[#filename]` citation tags |\n| local Qwen 35B no-think | 5/10 | misses citation tags, slight formatting drift |\n\n**Verdict on the Sonnet tier**: **local does not replace Anthropic on this workload**. The local 35B applies every requested fix (Opus judge: \"all four hallucinations softened and all three omissions integrated\"), but consistently omits the `[#filename]`\n\ncitation tags Opus produces by default. No combination of thinking flags, prompt tweaks, or larger context closed the gap. The shortfall is 3 to 4 Opus-judge points on a 0-10 scale.\n\nSo we ship hybrid: local for the Haiku tier middle, **Anthropic Opus for the 8 Sonnet-tier rewrites**. Those 8 calls cost ~$4 per run, but they buy the full ceiling on the content-touching phase, which matters more for the user than the cost.\n\n| Full Anthropic | Production hybrid (local middle + Opus rewrite) | Delta | |\n|---|---|---|---|\n| Anthropic API calls per run | 1,696 | 8 | −99.5% |\n| Wall-clock end-to-end | ~4 h | ~59 min | −75% |\n| Cost per run | ~$5 | ~$4 | −20% |\n| Verify quality | 9/10 | 9/10 (parity) | 0 |\n| Rewrite quality | 9/10 | 9/10 (still Opus) | 0 |\n\nThe cost saving is modest because the 8 Anthropic calls we keep are on Opus (expensive per call but mandatory for quality). The volume reduction is the headline: 99.5% fewer calls means our Anthropic quota is no longer the bottleneck for the rest of the product.\n\n```\nSDK env vars per call\n       │\n       ├─ haiku  ──► local llama-server, Qwen3.6-35B-A3B GGUF, thinking OFF\n       ├─ sonnet ──► (defined but not loaded in production; falls back to Opus)\n       └─ opus   ──► Anthropic native API (for correct-phase rewrite)\n```\n\nSame GGUF in both LLM containers, different thinking flags. ~10 s restart to switch tiers. TurboQuant 4-bit KV cache + `--parallel 2`\n\nfor throughput.\n\nFor Anatoly, the practical impact:\n\nThe benchmark above is contingent on four SDK integration fixes. They are not exotic, but none are obvious from the docs:\n\n`--alias`\n\non `llama-server`\n\n`/v1/models`\n\nvalidation accepts a stable name (`local-haiku`\n\n, `local-sonnet`\n\n) instead of the GGUF filename.`parallel=1, ctx_per_slot=32768`\n\n`--ctx-size`\n\nby `--parallel`\n\nfor per-slot context; defaults give only 4 k per slot).`AskUserQuestion`\n\n, `Skill`\n\n, `CronCreate`\n\n, `ScheduleWakeup`\n\n, and ~20 others by default. Sonnet ignores them politely; Qwen happily calls `AskUserQuestion`\n\nto \"think out loud\" and burns the `max_turns`\n\nbudget.`/no_think`\n\ndirective is honoured 8% of the time on Qwen3.5+. Fix at `llama-server`\n\nstartup:\n\n```\nllama-server \\\n  --jinja \\\n  --reasoning off \\\n  --chat-template-kwargs '{\"enable_thinking\": false}' \\\n  ...\n```\n\nThe last one is the biggest single perf lever: **×12 speedup** on the importance workload (21.7 s → 1.83 s wall mean) because the model stops emitting 358 tokens of reasoning before the 9-token JSON answer.\n\nFull write-up with the TurboQuant Dockerfile, the build gotchas (`libcuda.so.1`\n\n, `libgomp.so.1`\n\n), the fork comparison, the complete per-workload Opus-judge tables, and the N=5 variance discussion: [https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant](https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant)", "url": "https://wpnews.pro/news/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier", "canonical_source": "https://dev.to/r-via/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier-performance-504b", "published_at": "2026-05-28 08:31:52+00:00", "updated_at": "2026-05-28 08:53:03.633814+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-tools", "ai-infrastructure"], "entities": ["Claude Agent SDK", "Anthropic", "RTX 3090 Ti", "llama-server", "Anatoly"], "alternates": {"html": "https://wpnews.pro/news/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier", "markdown": "https://wpnews.pro/news/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier.md", "text": "https://wpnews.pro/news/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier.txt", "jsonld": "https://wpnews.pro/news/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier.jsonld"}}