# Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

> Source: <https://dev.to/r-via/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier-performance-504b>
> Published: 2026-05-28 08:31:52+00:00

The Claude Agent SDK exposes three budget tiers (`haiku`

, `sonnet`

, `opus`

) and reads its routing target from environment variables on every call. That means a single env-var swap can point a tier at any Anthropic-compatible endpoint — including a local `llama-server`

. The question is not whether you *can* do it. The question is whether the local model is good enough, per tier, to ship.

This article is the benchmark we ran to answer that question for Anatoly's document fact-check pipeline. **5 providers × 4 workloads × N=5 trials**, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: one RTX 3090 Ti, 24 GB VRAM.

| Tier | What it does in our pipeline | Calls per run | Profile |
|---|---|---|---|
Haiku |
extract atomic facts, verify them against retrieval, score importance | ~1,700 | JSON-only, low-latency, strict schema |
Sonnet |
rewrite document sections to integrate omissions and correct hallucinations | ~8 | markdown out, ~3 k tokens, citation-tag sensitive |
Opus |
reserved for final review and high-stakes work | varies | stays on Anthropic regardless |

The two questions:

Five providers cycled sequentially through four workloads, N=5 trials, 100 calls total, 0 errors:

`parallel=1`

`parallel=2`

(the 1.5 GB freed by 4-bit KV makes the extra slot viable)Three quality signals per output, all measured against the Anthropic reference:

The Opus judge is the load-bearing one. Crucially, we also rate **Anthropic trial 2 against Anthropic trial 1** as the empirical ceiling: the score "indistinguishable" gets for two runs of the same backend on the same input. Any local provider at the ceiling is at parity, *empirically*, not approximately.

Three workloads: extract-chunk (~88 calls/run), verify-rag-fact (~1,300 calls/run), importance-score (~300 calls/run). All JSON-only with strict schemas.

| Workload | Anthropic | local-turbo-parallel | Speedup |
|---|---|---|---|
| extract-chunk | 53.80 s | 5.79 s | ×9.3 |
| verify-rag-fact | 10.90 s | 1.87 s | ×5.8 |
| importance-score | 10.39 s | 2.42 s | ×4.3 |

| Workload | Anthropic ceiling | Local (production: 35 B no-think) | Reading |
|---|---|---|---|
| verify-rag-fact | 9/10 | 9/10 |
at ceiling (1,300 calls/run, dominant cost) |
| importance-score | 8/10 | 9/10 |
one point above ceiling (35B picks borderline buckets cleaner than Anthropic Haiku) |
| extract-chunk | 6/10 | 4–5/10 | 66–83 % of ceiling (Anthropic is strict on its own output here) |

**Verdict on the Haiku tier**: the local 35B-A3B in no-think mode is at parity on the dominant workload (verify, ×5.8 faster) and one point above ceiling on importance. Extract is one to two points below ceiling; the gap is real but small, and the upper bound matches the ceiling. **The local 35B can replace Anthropic Haiku in production**, with the caveat that extract benefits from human spot-checks until we close that gap.

A surprise from a follow-up mini-bench: a 35B-A3B MoE in no-think mode **outperforms a dedicated dense 4B** at essentially the same latency, because MoE only activates ~3 B parameters per token.

| Config (importance workload) | Wall mean | Stdev | Opus rating |
|---|---|---|---|
| 4 B no-think | 2.13 s | 0.02 s | 8/10 |
35 B no-think |
2.42 s |
0.15 s |
9/10 |

We initially shipped two separate models (4B for haiku, 35B for sonnet). After the mini-bench we collapsed to one GGUF in two containers, distinguished only by a thinking flag.

One workload: correct-section-rewrite (~8 calls/run). Markdown output, ~1.5 to 3 k tokens, requires `[#filename]`

citation tags on every inserted sentence.

| Provider | Wall mean | Note |
|---|---|---|
| Anthropic Sonnet | 26.01 s | reference |
| Anthropic Opus | 12.88 s | the actual Anthropic option for this workload |
| local Qwen 35B no-think | 9.61 s | fastest |
| local Qwen 35B thinking ON | 23.50 s | thinking helps quality, hurts latency |

| Provider | Opus rating | Failure mode |
|---|---|---|
| Anthropic Opus | 9/10 | reference |
| local Qwen 35B thinking ON | 6/10 | misses `[#filename]` citation tags |
| local Qwen 35B no-think | 5/10 | misses citation tags, slight formatting drift |

**Verdict on the Sonnet tier**: **local does not replace Anthropic on this workload**. The local 35B applies every requested fix (Opus judge: "all four hallucinations softened and all three omissions integrated"), but consistently omits the `[#filename]`

citation tags Opus produces by default. No combination of thinking flags, prompt tweaks, or larger context closed the gap. The shortfall is 3 to 4 Opus-judge points on a 0-10 scale.

So we ship hybrid: local for the Haiku tier middle, **Anthropic Opus for the 8 Sonnet-tier rewrites**. Those 8 calls cost ~$4 per run, but they buy the full ceiling on the content-touching phase, which matters more for the user than the cost.

| Full Anthropic | Production hybrid (local middle + Opus rewrite) | Delta | |
|---|---|---|---|
| Anthropic API calls per run | 1,696 | 8 | −99.5% |
| Wall-clock end-to-end | ~4 h | ~59 min | −75% |
| Cost per run | ~$5 | ~$4 | −20% |
| Verify quality | 9/10 | 9/10 (parity) | 0 |
| Rewrite quality | 9/10 | 9/10 (still Opus) | 0 |

The cost saving is modest because the 8 Anthropic calls we keep are on Opus (expensive per call but mandatory for quality). The volume reduction is the headline: 99.5% fewer calls means our Anthropic quota is no longer the bottleneck for the rest of the product.

```
SDK env vars per call
       │
       ├─ haiku  ──► local llama-server, Qwen3.6-35B-A3B GGUF, thinking OFF
       ├─ sonnet ──► (defined but not loaded in production; falls back to Opus)
       └─ opus   ──► Anthropic native API (for correct-phase rewrite)
```

Same GGUF in both LLM containers, different thinking flags. ~10 s restart to switch tiers. TurboQuant 4-bit KV cache + `--parallel 2`

for throughput.

For Anatoly, the practical impact:

The benchmark above is contingent on four SDK integration fixes. They are not exotic, but none are obvious from the docs:

`--alias`

on `llama-server`

`/v1/models`

validation accepts a stable name (`local-haiku`

, `local-sonnet`

) instead of the GGUF filename.`parallel=1, ctx_per_slot=32768`

`--ctx-size`

by `--parallel`

for per-slot context; defaults give only 4 k per slot).`AskUserQuestion`

, `Skill`

, `CronCreate`

, `ScheduleWakeup`

, and ~20 others by default. Sonnet ignores them politely; Qwen happily calls `AskUserQuestion`

to "think out loud" and burns the `max_turns`

budget.`/no_think`

directive is honoured 8% of the time on Qwen3.5+. Fix at `llama-server`

startup:

```
llama-server \
  --jinja \
  --reasoning off \
  --chat-template-kwargs '{"enable_thinking": false}' \
  ...
```

The last one is the biggest single perf lever: **×12 speedup** on the importance workload (21.7 s → 1.83 s wall mean) because the model stops emitting 358 tokens of reasoning before the 9-token JSON answer.

Full write-up with the TurboQuant Dockerfile, the build gotchas (`libcuda.so.1`

, `libgomp.so.1`

), the fork comparison, the complete per-workload Opus-judge tables, and the N=5 variance discussion: [https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant](https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant)