{"slug": "the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the", "title": "The AI Cost-Modeling Handbook: I let Claude do the modeling, but never the arithmetic", "summary": "A developer built an auditable cost-modeling pipeline using exact-rational math to compare LLM provider prices for agent workloads. The analysis found DeepSeek V3.2 on OpenRouter is the best value at $0.1145 per million tokens blended cost per unit of quality. The pipeline uses live pricing data and a rational math kernel to avoid floating-point errors and LLM arithmetic hallucinations.", "body_md": "Every \"what's the cheapest model?\" thread online is people trading vibes. I got tired of it, so I built a pipeline that pulls\n\nlive, citedprices and runs the numbers through anexact-rational math kernel— no floating-point drift, no LLM hallucinating a multiplication. Then I pointed it at eight of the cost questions every agent builder actually faces. Here's everything it found, and the repo so you can re-run all of it.\n\nThere are two kinds of LLM cost advice. The first is a benchmark leaderboard with a price column, which tells you nothing about *your* workload. The second is a confident tweet that's quietly wrong because someone multiplied a per-million price by the wrong token count in their head.\n\nI wanted a third kind: a model you can **audit**. Re-run it and you get bit-identical numbers, with the source of every input price one file away. This article is the result — a tour through eight cost decisions, each answered with real money math, and each one teaching something that intuition gets wrong.\n\nLet's start with the question that kicked it off.\n\nThe setup: I run [Hermes Agent](https://github.com/NousResearch/hermes-agent), a model-agnostic agent framework, and I wanted the cheapest token provider to drive it. Not cheapest *per token* — cheapest **per unit of quality**, because a model that flubs tool calls and retries can cost more than a pricier reliable one.\n\nSo the metric is **blended cost ÷ agentic-quality-score**. Blended cost uses a production-agentic token mix (more on that below); quality is a normalized agentic score (mean of BFCL / τ²-bench / SWE-bench Verified). Filters: open-weights, prompt caching, no-train.\n\n| Model @ Provider | Blended $/1M | Quality | $/quality (×1000) |\n|---|---|---|---|\nDeepSeek V3.2 @ OpenRouter |\n0.1145 |\n77 | 1.49 |\n| DeepSeek V3.2 @ DeepInfra | 0.1951 | 77 | 2.53 |\n| MiniMax M2 @ MiniMax | 0.2629 | 73 | 3.60 |\n| GLM-4.6 @ z.ai | 0.5276 | 71 | 7.43 |\n| Kimi K2 @ DeepInfra | 0.6613 | 66 | 10.02 |\n| DeepSeek R1 @ DeepInfra | 0.6519 | 54 | 12.07 |\n\n**DeepSeek V3.2 wins decisively** — it's both the highest-quality open model in the set *and* near the cheapest, because its output token price is absurdly low for its tier. The emerging best deal (pending a quality test) is **DeepSeek V4 Flash on Fireworks** at $0.0896/1M blended — cheapest in the field, with ZDR-by-default and a 50% batch discount.\n\nThat's a useful answer. But the *interesting* part is everything that question opens up. If you can split traffic across providers, what's the optimal mix? Should you self-host instead? Is the \"smart expensive model\" actually cheaper once you count its thinking tokens? Each of those is a chapter.\n\nFirst, how this is built.\n\nThe pipeline has two halves, and the whole point is that **the LLM never touches the arithmetic.**\n\n**1. Research agents gather live, cited inputs.** Token prices in 2026 move monthly — half the models in my training data are already legacy. So every price and benchmark in this guide came from a research agent doing live web search/fetch, returning a structured table with source URLs and observation dates. Those land in `data/`\n\n(`pricing.md`\n\n, `quality.md`\n\n, `gpu.md`\n\n, `frontier.md`\n\n, `decline.md`\n\n), each one auditable.\n\n**2. An exact-rational kernel does the math.** I used [ agent-calc](https://github.com/copyleftdev/agent-calc), an AI-native computation kernel that works in exact rationals —\n\n`3/5 × 1,000,000`\n\nreturns `600000`\n\n, not `599999.9999998`\n\n. It exposes typed domains: `rational`\n\n, `linear`\n\n(LP), `finance`\n\n(NPV/amortization), `solve`\n\n(algebra), `stats`\n\n(distributions), `polynomial`\n\n, `interval`\n\n(uncertainty arithmetic). Every chapter below drives a different one.Why bother? Because cost models are exactly where floating-point lies and LLMs fumble: tiny per-token prices, huge token counts, compounding rates. When an answer is \"self-hosting breaks even at 87.2% utilization,\" you want that to be *computed*, not vibed. And the secret structure of this guide is that each chapter answers a money question **and** exercises one kernel domain — it's a cost-modeling handbook that doubles as a tour of exact computation.\n\nEvery chapter uses one production-agentic token profile, because agent tool-loops are wildly input-heavy:\n\n| Component | Share of tokens |\n|---|---|\n| Fresh input | 21.25% |\n| Cached input | 63.75% |\n| Output | 15% |\n\nThat's an 85/15 input/output split with 75% of input being a cache hit (the system prompt, tools, and skills get re-sent every tool-loop step). Blended price = `0.2125·in + 0.6375·cached + 0.15·out`\n\n. Chapter 7 explains *why* this profile is forced on you; Chapter 6 explains why the cache hit rate is so high. For now, take it as the workload.\n\nYou don't have to pick one provider. Run a cheap one for the bulk, escalate elsewhere when constraints demand. The optimal split — minimize spend subject to a quality floor, per-endpoint capacity caps, and a **US-jurisdiction floor θ** — is a literal linear program. `agent-calc`\n\nsolves it (`solve_lp`\n\n), then I re-verify the winning allocation's cost in exact rationals (the LP solver is f64; the headline number shouldn't be).\n\nSweeping the privacy knob θ traces a **cost-of-privacy frontier**:\n\n| US floor θ | min $/1M | $/mo @ 1B | allocation |\n|---|---|---|---|\n| 0.00–0.40 | 0.1468 | $146.77 | A=0.60, B=0.40 |\n| 0.60 | 0.1629 | $162.89 | A=0.40, B=0.60 |\n| 0.80 | 0.2716 | $271.61 | A=0.20, B=0.60, F=0.13, H=0.07 |\n0.872 |\n0.3116 | $311.59 | A=0.13, B=0.60, F=0.27 |\n| ≥ 0.873 | — | INFEASIBLE |\nthe privacy wall |\n\nThree things intuition misses:\n\nEveryone's instinct: \"at some volume, renting GPUs beats per-token APIs.\" The reframe that changes everything: **a self-hosted node costs the same at 5% load or 95% load**, while the API scales linearly with tokens. So it's not a *dollar* break-even — it's a **break-even utilization**: how busy must the box stay to win?\n\n```\nu* = node_$/mo × 1e6 / (capacity_tokens/mo × API_$/1M)\n```\n\nIf `u* > 100%`\n\n, self-hosting loses *even at full tilt*. Modeling DeepSeek V3.2 (8-GPU node) against DeepInfra's $0.195/1M, with `agent-calc`\n\nhandling the owned-hardware amortization (`finance`\n\n) and the exact `u*`\n\n(`eval`\n\n):\n\n| Hardware / procurement | break-even u* | verdict |\n|---|---|---|\n| 8×H200, rent on-demand | 531% | never |\n| 8×H200, own (amortized) | 220% | never |\n| 8×B200, rent reserved | 174% | never |\n8×B200, own (amortized) |\n72% |\nthe only winner |\n\nFor a commodity model this cheap on the API, **renting GPUs loses in every configuration** — the serverless provider batches across thousands of tenants to hit a utilization you can't. The *only* config that wins is **owned Blackwell silicon kept above ~72% utilization 24/7** (~56B tokens/month of steady load). And against the falling API floor ($0.0896/1M), even that roughly doubles its break-even. The takeaway: **self-host for privacy, control, or latency — not to cut the token bill.**\n\nReasoning models emit hidden \"thinking\" tokens, billed at the output rate. A 600-token answer can bill 7,800. Does the higher success rate justify the burn? The honest unit isn't $/token — it's **cost per solved task** = attempt cost ÷ success rate.\n\nWith `agent-calc`\n\ncomputing exact costs (`eval`\n\n) and the break-even success rate (`solve`\n\n):\n\n| Model | k (think mult.) | $/solved | ×champion |\n|---|---|---|---|\nDeepSeek V3.2 |\n4 | 3.17m |\n1.00× |\n| MiniMax M2 | 2 | 4.03m | 1.27× |\n| Kimi K2 | 1 | 8.79m | 2.77× |\n| GLM-4.6 | 5 | 13.77m | 4.35× |\n| DeepSeek R1 | 12 | 36.80m | 11.61× |\n\n(`m`\n\n= milli-dollars = $0.001/task.)\n\nThe killer result: **DeepSeek R1 would need a 627% success rate** to match V3.2's cost-per-solved — mathematically impossible. Even a *perfect* R1 stays far more expensive, because its per-attempt token burn alone exceeds V3.2's entire cost-per-solved. The sweep confirms it: even at k=0 (zero thinking), R1 is still 2.6× the champion.\n\nThe rule: **the reasoning tax only pays when output tokens are cheap.** Thinking is a multiplier on the output price. Pay $2.15/M and ×12 and you've built the most expensive possible way to solve a task; pay $0.38/M and the same multiplier barely registers.\n\nYou don't need one model for every task. Run a **cheap** model on everything, **escalate** to a reliable premium model only on the residual failures. `agent-calc`\n\ncomputes the escalation probability (`stats/binomial_pmf`\n\n) and the exact expected cost (`eval`\n\n).\n\n| Strategy | cost/solved | coverage | % → premium |\n|---|---|---|---|\n| Cheap-only (DeepSeek V3.2) | 3.17m | 77.0% | 0% |\n| Premium-only (Claude Opus 4.8) | 82.96m | 88.0% | 100% |\nCascade V3.2 → Opus |\n19.78m |\n97.2% |\n23% |\nCascade V3.2 → Gemini → Opus |\n13.52m |\n99.5% |\n4.4% |\n\nThe cascade beats premium-only on **both** axes: higher coverage (97% vs 88%) at a quarter the cost, because only 23% of tasks reach the expensive model. Adding a mid-tier pushes coverage to 99.5% while *cutting* cost further — the cheap+mid tiers absorb 95.6% of traffic, so the $73m Opus call fires on 4.4% of tasks.\n\nAnd the trap to avoid: **retrying the same model never changes cost-per-solved** (it stays C/p), and for deterministic failures it doesn't even raise coverage. Only escalating to a\n\nA provider offers a discounted *reserved* rate if you commit for a year. Locking it in feels prudent — but token prices fall fast, and you might be locking yourself *above* where pay-go will be in six months. This is a present-value problem: NPV the committed cash-flow stream against the declining pay-go stream (`agent-calc finance/net_present_value`\n\n), and find the **break-even decline rate**.\n\nThe decisive input is that price decline depends entirely on what you hold fixed:\n\nAt a 30%-off, 12-month commit:\n\n| Workload | central decline | winner |\n|---|---|---|\nFixed-quality |\n80%/yr | pay-go (−23%) |\nFrontier |\n30%/yr | commit (+22%) |\n\nAnd the break-even decline grid shows committing **never** wins for commodity models (80%/yr exceeds every reservable discount's break-even), while longer terms always make it worse: a 30% discount breaks even at 57.8%/yr over 12 months but only **24.5%/yr over 36 months**. The rule: **reserve where prices are sticky (frontier); ride the curve where they fall fast (commodity). Never sign a 3-year inference commit in a market falling >25%/yr.**\n\nCaching looks free, but a cache *write* often costs more than a normal input token (Anthropic charges up to 2×), while reads are cheap. So it's a bet: pay the write premium to save on reads, and it only pays if you reuse enough before the cache expires. The break-even reuse count is `N* = (write − read)/(in − read)`\n\n.\n\n| Provider | break-even N* | savings ceiling |\n|---|---|---|\n| DeepInfra / DeepSeek / OpenAI / Fireworks (no write fee) | 1.00 | 50–90% |\n| Anthropic Opus 5-min TTL (1.25× write) | 1.28 | 90% |\n| Anthropic Opus 1-hr TTL (2× write) | 2.11 | 90% |\n\n**Caching pays almost immediately** — most providers from the second call, even the harshest write premium by the third. The *read* multiplier sets the long-run ceiling (0.1×-read → 90% savings; 0.5×-read → 50%), so for cache-heavy loops that's the number that compounds. Using `agent-calc`\n\n's `interval`\n\ndomain, an 8k-token agent prefix reused an uncertain 3–20 times costs $0.088–0.156 cached vs $0.12–0.80 uncached — **80% off at the high end**, even with a 2× write premium. This is why the guide's 75% cache-hit assumption holds: agent loops reuse a large prefix 10–20× per task.\n\nWhy is that reused prefix so large? Because a multi-turn agent re-sends its growing context every step — at step *k*, the input is the base prompt plus *every prior tool call and result*. Per-step input grows linearly, so **cumulative input over a K-step task grows quadratically**:\n\n```\ncumulative over K = (δ/2)·K² + (B − δ/2)·K\n```\n\n`agent-calc`\n\nevaluates that polynomial exactly (`polynomial/evaluate`\n\n):\n\n| K steps | cum tokens | naive $ | cached $ | compacted $ |\n|---|---|---|---|---|\n| 10 | 147,500 | 0.0384 | 0.0220 | 0.0384 |\n| 50 | 2,237,500 | 0.5817 | 0.3015 | 0.3267 |\n| 100 | 8,225,000 | 2.1385 | 1.0896 | 0.6907 |\n\nA 100-step task isn't 10× a 10-step task — it's **56×**. Your 50th tool call's input costs 10× your first. **Caching bends the curve (lowers the constant); only compaction breaks it** — capping the context window turns the quadratic linear, and the gap widens without bound as loops get longer. They compose: cached *and* compacted is read-rate pricing on a linear token count. For any loop past ~20 steps, do both.\n\nThis closes the loop on the whole guide: the quadratic re-send is exactly why agent workloads are so input-heavy and so cache-dependent — the entire shared workload profile falls out of this one chapter.\n\nEverything above, compressed to decisions:\n\n| Decision | Verdict |\n|---|---|\nWhich model? |\nCheap efficient hybrid (DeepSeek V3.2-class). Cheap output + high success beats everything on cost-per-solved. |\nOne provider or many? |\nBlend via LP. First ~40% of any jurisdiction/diversity constraint is usually free. |\nSelf-host? |\nNo — for cost. Renting always loses for commodity models; only owned next-gen hardware at >72% utilization wins. Self-host for privacy/control. |\nReasoning model? |\nOnly if its output tokens are cheap. A pricey heavy reasoner can be mathematically uncatchable on cost-per-solved. |\nRetries? |\nCascade to a different, stronger model — never retry the same one. 3 tiers ≈ 99.5% coverage at ~6× lower cost than going straight to frontier. |\nCommit to reserved capacity? |\nOnly for sticky frontier prices, short terms. Never for commodity models in a market falling 50–90%/yr. |\nCaching? |\nAlways, for any reused prefix. Pays from call 2–3. Prefer 0.1×-read providers. |\nLong agent loops? |\nCap the context window (compaction) past ~20 steps, or pay quadratically. |\n\nThe whole thing is [a repo](https://github.com/copyleftdev/rational). Each chapter is a self-contained script that shells out to [ agent-calc](https://github.com/copyleftdev/agent-calc); each input price lives in\n\n`data/`\n\nwith a source URL and observation date.\n\n```\ngit clone https://github.com/copyleftdev/rational\nmake all       # re-run every chapter\nmake ch04      # or just one\n```\n\nSame inputs → bit-identical outputs, every time. The LLM did the modeling and pulled the data. The math kernel did the math. That division of labor is the only reason I trust any number in this article — and it's the part I'd encourage you to steal.", "url": "https://wpnews.pro/news/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the", "canonical_source": "https://dev.to/copyleftdev/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the-arithmetic-3h95", "published_at": "2026-07-01 02:51:48+00:00", "updated_at": "2026-07-01 03:18:41.114612+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-infrastructure", "developer-tools"], "entities": ["DeepSeek V3.2", "OpenRouter", "DeepInfra", "MiniMax M2", "GLM-4.6", "Kimi K2", "Fireworks", "Hermes Agent"], "alternates": {"html": "https://wpnews.pro/news/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the", "markdown": "https://wpnews.pro/news/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the.md", "text": "https://wpnews.pro/news/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the.txt", "jsonld": "https://wpnews.pro/news/the-ai-cost-modeling-handbook-i-let-claude-do-the-modeling-but-never-the.jsonld"}}