cd /news/artificial-intelligence/benchmarking-the-claude-agent-sdk-on… Β· home β€Ί topics β€Ί artificial-intelligence β€Ί article
[ARTICLE Β· art-16202] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=Β· neutral

Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

A developer benchmarked the Claude Agent SDK on a local LLM, finding that a 35B-parameter model running on an RTX 3090 Ti can replace Anthropic's Haiku tier in production. The local model achieved parity on the dominant workload (verify-rag-fact) with a 5.8x speedup, and outperformed the Anthropic ceiling on importance scoring. However, the extract-chunk workload scored 66-83% of the ceiling, requiring human spot-checks until the gap is closed.

read6 min publishedMay 28, 2026

The Claude Agent SDK exposes three budget tiers (haiku

, sonnet

, opus

) and reads its routing target from environment variables on every call. That means a single env-var swap can point a tier at any Anthropic-compatible endpoint β€” including a local llama-server

. The question is not whether you can do it. The question is whether the local model is good enough, per tier, to ship.

This article is the benchmark we ran to answer that question for Anatoly's document fact-check pipeline. 5 providers Γ— 4 workloads Γ— N=5 trials, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: one RTX 3090 Ti, 24 GB VRAM.

Tier What it does in our pipeline Calls per run Profile
Haiku
extract atomic facts, verify them against retrieval, score importance ~1,700 JSON-only, low-latency, strict schema
Sonnet
rewrite document sections to integrate omissions and correct hallucinations ~8 markdown out, ~3 k tokens, citation-tag sensitive
Opus
reserved for final review and high-stakes work varies stays on Anthropic regardless

The two questions:

Five providers cycled sequentially through four workloads, N=5 trials, 100 calls total, 0 errors:

parallel=1

parallel=2

(the 1.5 GB freed by 4-bit KV makes the extra slot viable)Three quality signals per output, all measured against the Anthropic reference:

The Opus judge is the load-bearing one. Crucially, we also rate Anthropic trial 2 against Anthropic trial 1 as the empirical ceiling: the score "indistinguishable" gets for two runs of the same backend on the same input. Any local provider at the ceiling is at parity, empirically, not approximately.

Three workloads: extract-chunk (~88 calls/run), verify-rag-fact (~1,300 calls/run), importance-score (~300 calls/run). All JSON-only with strict schemas.

Workload Anthropic local-turbo-parallel Speedup
extract-chunk 53.80 s 5.79 s Γ—9.3
verify-rag-fact 10.90 s 1.87 s Γ—5.8
importance-score 10.39 s 2.42 s Γ—4.3
Workload Anthropic ceiling Local (production: 35 B no-think) Reading
verify-rag-fact 9/10 9/10
at ceiling (1,300 calls/run, dominant cost)
importance-score 8/10 9/10
one point above ceiling (35B picks borderline buckets cleaner than Anthropic Haiku)
extract-chunk 6/10 4–5/10 66–83 % of ceiling (Anthropic is strict on its own output here)

Verdict on the Haiku tier: the local 35B-A3B in no-think mode is at parity on the dominant workload (verify, Γ—5.8 faster) and one point above ceiling on importance. Extract is one to two points below ceiling; the gap is real but small, and the upper bound matches the ceiling. The local 35B can replace Anthropic Haiku in production, with the caveat that extract benefits from human spot-checks until we close that gap.

A surprise from a follow-up mini-bench: a 35B-A3B MoE in no-think mode outperforms a dedicated dense 4B at essentially the same latency, because MoE only activates ~3 B parameters per token.

Config (importance workload) Wall mean Stdev Opus rating
4 B no-think 2.13 s 0.02 s 8/10
35 B no-think
2.42 s
0.15 s
9/10

We initially shipped two separate models (4B for haiku, 35B for sonnet). After the mini-bench we collapsed to one GGUF in two containers, distinguished only by a thinking flag.

One workload: correct-section-rewrite (~8 calls/run). Markdown output, ~1.5 to 3 k tokens, requires [#filename]

citation tags on every inserted sentence.

Provider Wall mean Note
Anthropic Sonnet 26.01 s reference
Anthropic Opus 12.88 s the actual Anthropic option for this workload
local Qwen 35B no-think 9.61 s fastest
local Qwen 35B thinking ON 23.50 s thinking helps quality, hurts latency
Provider Opus rating Failure mode
Anthropic Opus 9/10 reference
local Qwen 35B thinking ON 6/10 misses [#filename] citation tags
local Qwen 35B no-think 5/10 misses citation tags, slight formatting drift

Verdict on the Sonnet tier: local does not replace Anthropic on this workload. The local 35B applies every requested fix (Opus judge: "all four hallucinations softened and all three omissions integrated"), but consistently omits the [#filename]

citation tags Opus produces by default. No combination of thinking flags, prompt tweaks, or larger context closed the gap. The shortfall is 3 to 4 Opus-judge points on a 0-10 scale.

So we ship hybrid: local for the Haiku tier middle, Anthropic Opus for the 8 Sonnet-tier rewrites. Those 8 calls cost ~$4 per run, but they buy the full ceiling on the content-touching phase, which matters more for the user than the cost.

Full Anthropic Production hybrid (local middle + Opus rewrite) Delta
Anthropic API calls per run 1,696 8 βˆ’99.5%
Wall-clock end-to-end ~4 h ~59 min βˆ’75%
Cost per run ~$5 ~$4 βˆ’20%
Verify quality 9/10 9/10 (parity) 0
Rewrite quality 9/10 9/10 (still Opus) 0

The cost saving is modest because the 8 Anthropic calls we keep are on Opus (expensive per call but mandatory for quality). The volume reduction is the headline: 99.5% fewer calls means our Anthropic quota is no longer the bottleneck for the rest of the product.

SDK env vars per call
       β”‚
       β”œβ”€ haiku  ──► local llama-server, Qwen3.6-35B-A3B GGUF, thinking OFF
       β”œβ”€ sonnet ──► (defined but not loaded in production; falls back to Opus)
       └─ opus   ──► Anthropic native API (for correct-phase rewrite)

Same GGUF in both LLM containers, different thinking flags. ~10 s restart to switch tiers. TurboQuant 4-bit KV cache + --parallel 2

for throughput.

For Anatoly, the practical impact:

The benchmark above is contingent on four SDK integration fixes. They are not exotic, but none are obvious from the docs:

--alias

on llama-server

/v1/models

validation accepts a stable name (local-haiku

, local-sonnet

) instead of the GGUF filename.parallel=1, ctx_per_slot=32768

--ctx-size

by --parallel

for per-slot context; defaults give only 4 k per slot).AskUserQuestion

, Skill

, CronCreate

, ScheduleWakeup

, and ~20 others by default. Sonnet ignores them politely; Qwen happily calls AskUserQuestion

to "think out loud" and burns the max_turns

budget./no_think

directive is honoured 8% of the time on Qwen3.5+. Fix at llama-server

startup:

llama-server \
  --jinja \
  --reasoning off \
  --chat-template-kwargs '{"enable_thinking": false}' \
  ...

The last one is the biggest single perf lever: Γ—12 speedup on the importance workload (21.7 s β†’ 1.83 s wall mean) because the model stops emitting 358 tokens of reasoning before the 9-token JSON answer.

Full write-up with the TurboQuant Dockerfile, the build gotchas (libcuda.so.1

, libgomp.so.1

), the fork comparison, the complete per-workload Opus-judge tables, and the N=5 variance discussion: https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant

── more in #artificial-intelligence 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/benchmarking-the-cla…] indexed:0 read:6min 2026-05-28 Β· β€”