{"slug": "token-budgeting-the-engineering-skill-nobody-talks-about", "title": "Token Budgeting: The Engineering Skill Nobody Talks About", "summary": "A developer argues that token optimization for LLM costs is a context engineering problem, not just prompt shortening. By instrumenting token usage and analyzing cost distributions, teams can reduce bills by 60-80% through caching, model selection, and controlling output length. The user message typically accounts for only 4% of total tokens, while conversation history and system prompts dominate.", "body_md": "Ask a developer how to reduce their LLM bill and they'll say: \"write shorter prompts.\" Remove adjectives. Trim examples. Cut the system prompt.\n\nThis isn't wrong — it's just the lowest-leverage version of the right idea. It optimizes the 4% of your context that is the actual user message while ignoring the 96% that is conversation history, system prompt, idle tool schemas, and over-retrieved documents.\n\nToken optimization is a **context engineering problem.** The real questions are:\n\nWhat is in your context that doesn't need to be there?\n\nIs your context structured so the cache can work?\n\nIs the model you're paying for the right one for this specific task?\n\nAnswer those and you'll reduce your bill by 60–80%. Shorten prompts and you'll reduce it by 5%.\n\nBefore touching anything, instrument what you have. Every provider returns token usage in the API response — read it.\n\n```\n// Wrap your API calls to log token usage from day one\nasync function loggedCompletion(params: Anthropic.MessageCreateParams) {\n  const response = await client.messages.create(params);\n  const { input_tokens, output_tokens,\n          cache_read_input_tokens, cache_creation_input_tokens } = response.usage;\n\n  console.log({\n    inputTokens:    input_tokens,\n    outputTokens:   output_tokens,\n    cacheHits:      cache_read_input_tokens   ?? 0,   // paid 10% of normal\n    cacheWrites:    cache_creation_input_tokens ?? 0,  // paid 125% (first call)\n    estimatedCost: (\n      (input_tokens  * 0.000003) +\n      (output_tokens * 0.000015) +\n      ((cache_read_input_tokens ?? 0) * 0.0000003)\n    ).toFixed(6),\n  });\n\n  return response;\n}\n```\n\n**Count tokens before you send** to understand what a request costs before paying for it:\n\n``` js\nconst count = await client.messages.countTokens({\n  model:    'claude-sonnet-4-6',\n  system:   SYSTEM_PROMPT,\n  tools:    TOOLS,\n  messages: conversationHistory,\n});\nconsole.log(`This call: ${count.input_tokens} input tokens`);\n```\n\nRun this for 48 hours before optimizing anything. The distribution will tell you exactly which lever to pull first.\n\nBefore optimizing, know what you're paying. Two facts dominate the table:\n\n| Model | Input /1M | Output /1M | Cached Input | Context |\n|---|---|---|---|---|\n| Claude Opus 4.8 | $5.00 | $25.00 | $0.50 | 1M |\n| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 1M |\n| Claude Haiku 4.5 | $0.80 | $4.00 | $0.08 | 200K |\n| GPT-5.5 | $5.00 | $30.00 | $0.50 | 1M |\n| GPT-4.1 | $2.00 | $8.00 | $1.00 | 1M |\n| GPT-4.1 Nano | $0.10 | $0.40 | $0.05 | 1M |\n| DeepSeek V4 Flash | $0.14 | $0.28 | $0.003 | 1M |\n\n**Fact 1: Output costs 4–8× more than input.** A verbose response is far more expensive than a verbose prompt. Controlling output length matters more than controlling input length.\n\n**Fact 2: The model spread is 89×.** GPT-4.1 Nano at $0.10/1M input vs Claude Opus 4.8 at $5.00/1M. Every routine task sent to a frontier model burns 10–50× its necessary cost.\n\n```\nxychart-beta\n    title \"Output Token Price per 1M (USD) — June 2026\"\n    x-axis [\"DeepSeek V4 Flash\", \"GPT-4.1 Nano\", \"Haiku 4.5\", \"GPT-4.1\", \"Sonnet 4.6\", \"Opus 4.8\", \"GPT-5.5\"]\n    y-axis \"$/1M output tokens\" 0 --> 32\n    bar [0.28, 0.40, 4.00, 8.00, 15.00, 25.00, 30.00]\n```\n\nCache savings that stack with any of the above: Anthropic 90% off cached input, OpenAI 50% off automatically, Batch API 50% off async requests at both providers.\n\nMost teams optimize the wrong things because they haven't diagnosed where their tokens land.\n\n```\npie title \"Typical Token Distribution — Multi-Turn Agent Session\"\n    \"Conversation history (replayed each turn)\" : 42\n    \"System prompt (replayed each turn)\" : 18\n    \"Tool schemas (loaded but often unused)\" : 15\n    \"RAG retrieved context\" : 14\n    \"Actual user message\" : 7\n    \"Model output\" : 4\n```\n\nThe user message — the thing most developers try to shorten — is 4% of the total. The big cost drivers are conversation history and system prompt, both of which replay in full on every single turn.\n\n**The quadratic growth problem:** Every new message replays the entire prior conversation from scratch. A session with *n* equal-length turns doesn't cost *n* turns — it costs *n(n+1)/2* turn-equivalents.\n\n```\nxychart-beta\n    title \"Cumulative Input Tokens — 200-Token Average Turn\"\n    x-axis [\"Turn 5\", \"Turn 10\", \"Turn 20\", \"Turn 30\", \"Turn 50\"]\n    y-axis \"Cumulative tokens (thousands)\" 0 --> 260\n    bar [3, 11, 42, 93, 255]\n```\n\nA 50-turn conversation pays for 255,000 input tokens from something that generated roughly 10,000 tokens of actual content. This compounds silently and is the dominant cost driver for any application with extended conversations.\n\nPrompt caching is the highest single-impact optimization available if your app has a consistent system prompt, tool definitions, or reference documents. At Anthropic, cached tokens cost **10% of the normal rate.** At OpenAI, 50% off automatically.\n\nThe mechanism: Claude computes a KV cache of your prompt prefix. On the next request, if the prefix is identical, it reuses the cached state instead of reprocessing it.\n\n``` js\nconst response = await client.messages.create({\n  model:      'claude-sonnet-4-6',\n  max_tokens: 1024,\n  system: [\n    {\n      type:          'text',\n      text:          SYSTEM_PROMPT,          // 3,000+ tokens of static context\n      cache_control: { type: 'ephemeral' },  // ← mark for caching\n    },\n  ],\n  messages: conversationHistory,\n});\n\n// Inspect the result — is caching actually working?\nconsole.log({\n  cacheHits:   response.usage.cache_read_input_tokens,    // paid 10%\n  cacheWrites: response.usage.cache_creation_input_tokens, // paid 125% (once)\n});\n```\n\n**The rules you must know before caching anything:**\n\nMinimum 1,024 tokens per block to qualify. Sub-1,024 blocks are silently ignored.\n\nDefault TTL is 5 minutes. Requests spaced further apart miss the cache — use the 1-hour extension for slower workflows.\n\nUp to 4 cache breakpoints per request. Use them on your four largest static blocks.\n\nCache writes cost 25% more on the first call. Breakeven is typically 3–4 requests.\n\n**Cache structure: the ordering that makes or breaks hits**\n\nCaching is prefix-based. A single dynamic element placed before your static content breaks every cache hit. The correct order — every time:\n\n```\n// ✅ CORRECT — static content first, dynamic last\nawait client.messages.create({\n  system: [\n    { type: 'text', text: SYSTEM_PROMPT,   cache_control: { type: 'ephemeral' } },\n    { type: 'text', text: TOOL_DOCS,       cache_control: { type: 'ephemeral' } },\n    { type: 'text', text: REFERENCE_DOCS,  cache_control: { type: 'ephemeral' } },\n  ],\n  messages: [\n    ...conversationHistory,                // dynamic — grows per turn\n    { role: 'user', content: userMessage }, // always last\n  ],\n});\n\n// ❌ WRONG — timestamp in the system block breaks every cache hit\nsystem: [{\n  type: 'text',\n  text: `Current time: ${new Date().toISOString()}\\n${SYSTEM_PROMPT}`,\n  //     ↑ changes every second — cache never activates\n}]\n\n// ✅ Put dynamic values in the user message instead\nmessages: [{ role: 'user', content: `[${new Date().toISOString()}] ${userMessage}` }]\n```\n\nFor OpenAI, caching is automatic — keep the prompt prefix identical across calls and the 50% discount applies with no code changes.\n\nCaching addresses static content. Pruning addresses the quadratic growth of conversation history.\n\n**Summarize old turns into a cached block:**\n\n```\nasync function prepareContext(messages: Message[], keepRecentTurns = 6) {\n  if (messages.length <= keepRecentTurns * 2) {\n    return { summary: null, messages };\n  }\n\n  const older  = messages.slice(0, -(keepRecentTurns * 2));\n  const recent = messages.slice(-(keepRecentTurns * 2));\n\n  // Use a cheap model for summarization — this is simple work\n  const res = await client.messages.create({\n    model:      'claude-haiku-4-5',\n    max_tokens: 300,\n    messages: [{\n      role:    'user',\n      content: `Summarize this conversation in under 200 words.\nInclude: user's goal, decisions made, current status. Exclude: detailed reasoning.\n${older.map(m => `${m.role}: ${m.content}`).join('\\n')}`,\n    }],\n  });\n\n  return { summary: res.content[0].text, messages: recent };\n}\n\n// Use the summary as a second cached block\nconst { summary, messages } = await prepareContext(history);\n\nawait client.messages.create({\n  system: [\n    { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },\n    ...(summary ? [{ type: 'text' as const, text: `## Prior context\\n${summary}`,\n                     cache_control: { type: 'ephemeral' as const } }] : []),\n  ],\n  messages,\n  max_tokens: 1024,\n});\n```\n\nThe Haiku summarization call costs a fraction of a cent. The savings from not replaying 30 turns of history on every subsequent call are substantial.\n\nWarning:Over-pruning context degrades answer quality and triggers retries that cost more than the context you saved. The goal is thesmallest sufficientcontext, not the smallest possible one.\n\nThis is the highest-leverage structural change most teams never make. The 89× price spread between cheap and frontier models means every routine task sent to Opus burns unnecessary budget.\n\n```\n// A cheap classifier routes tasks to the right model tier\nasync function classifyTask(message: string): Promise<'simple' | 'medium' | 'complex'> {\n  const res = await client.messages.create({\n    model:      'claude-haiku-4-5',  // $0.80/1M — the classifier itself is cheap\n    max_tokens: 5,\n    messages: [{\n      role:    'user',\n      content: `Reply with ONE word: simple, medium, or complex.\nsimple = extraction, classification, yes/no, short lookups\nmedium = summarization, structured analysis, multi-step predictable tasks\ncomplex = deep reasoning, architecture decisions, research synthesis\nTask: \"${message}\"`,\n    }],\n  });\n\n  const result = res.content[0].text.trim().toLowerCase();\n  return (['simple', 'medium', 'complex'].includes(result) ? result : 'medium') as any;\n}\n\nconst MODEL_MAP = {\n  simple:  'claude-haiku-4-5',   // $4.00/1M output\n  medium:  'claude-sonnet-4-6',  // $15.00/1M output\n  complex: 'claude-opus-4-8',    // $25.00/1M output\n};\n\nconst MAX_TOKENS_MAP = { simple: 256, medium: 1024, complex: 4096 };\n\nexport async function routedCompletion(message: string, context: ConversationContext) {\n  const complexity = await classifyTask(message);\n  return client.messages.create({\n    model:      MODEL_MAP[complexity],\n    max_tokens: MAX_TOKENS_MAP[complexity],\n    system:     context.system,\n    messages:   [...context.history, { role: 'user', content: message }],\n  });\n}\n```\n\nReal-world distribution for a typical support application: ~60% simple, ~30% medium, ~10% complex. Routing to Haiku for 60% of requests instead of Sonnet reduces total cost by roughly 40% on that portion alone — with no quality degradation on routine tasks.\n\nNot every LLM request needs to be real-time. Both Anthropic and OpenAI offer batch APIs at 50% off with identical quality. If your users aren't waiting for the response, use it.\n\nGood candidates: content classification, bulk summarization, document extraction, data enrichment, A/B prompt testing, nightly report generation.\n\n``` js\n// Anthropic Message Batches — 50% off, processed within 24 hours\nconst batch = await client.beta.messages.batches.create({\n  requests: items.map(item => ({\n    custom_id: item.id,\n    params: {\n      model:      'claude-sonnet-4-6',\n      max_tokens: 512,\n      system: [{ type: 'text', text: SYSTEM_PROMPT,\n                 cache_control: { type: 'ephemeral' } }],\n      messages: [{ role: 'user', content: item.content }],\n    },\n  })),\n});\n\n// Poll or webhook for results\nfor await (const result of await client.beta.messages.batches.results(batch.id)) {\n  if (result.result.type === 'succeeded') {\n    await saveResult(result.custom_id, result.result.message.content[0]);\n  }\n}\n```\n\nStack caching with batching: a cached batch request on Sonnet 4.6 drops from $3.00/1M to $0.30/1M (cache) × 0.50 (batch) = **$0.15/1M** — a 95% reduction on that input.\n\nRAG pipelines are the most common source of unnecessary context bloat. Teams retrieve 10,000+ tokens by default because the context window can hold it — not because it improves results.\n\nThe research is clear: context window size does not drive quality. Placement and precision do. More context, placed poorly, actively hurts performance via the \"lost in the middle\" effect — models perform worse when critical information is buried mid-context.\n\n```\nasync function surgicalRag(query: string, tokenBudget = 2000) {\n  // Retrieve more candidates than you'll use\n  const candidates = await vectorDB.search(query, { limit: 20 });\n\n  // Filter by relevance score — weak matches add noise\n  const relevant = candidates.filter(c => c.score >= 0.75);\n\n  // Fill up to budget, best matches first\n  let used = 0;\n  const selected = [];\n  for (const chunk of relevant) {\n    if (used + chunk.tokenCount > tokenBudget) break;\n    selected.push(chunk);\n    used += chunk.tokenCount;\n  }\n\n  return selected\n    .map((c, i) => `[Source ${i + 1}: ${c.source}]\\n${c.content}`)\n    .join('\\n\\n---\\n\\n');\n}\n// Result: ~1,500–2,000 tokens vs 10,000+ from naïve retrieval\n// Same or better quality. Significantly lower cost.\n```\n\nThe savings from surgical RAG are substantial — limiting retrieval to 2–3 focused chunks instead of 8–10 full documents can cut input tokens by more than 50%. Cache stable reference content (product docs, FAQs, policies) using prompt caching instead of re-retrieving it on every turn.\n\n**1. Optimizing prompt length before diagnosing cost drivers** Spending hours trimming 200 tokens from a system prompt when 20,000 tokens of conversation history are growing quadratically. Instrument first, optimize second.\n\n**2.** `max_tokens: 4096`\n\n**as a default everywhere** Output tokens cost 4–8× more than input. Set `max_tokens`\n\ndeliberately:\n\n| Task | Appropriate max_tokens |\n|---|---|\n| Classification, yes/no | 10 |\n| Factual lookup | 256 |\n| Summary or analysis | 512–1024 |\n| Document generation | 2048–4096 |\n\n**3. Dynamic content in static blocks** A timestamp, session ID, or A/B flag placed inside the system prompt block resets the cache prefix on every call. Every cached token becomes uncached. Move dynamic values to the user message.\n\n**4. Routing everything to the frontier model \"to be safe\"** Safe is not the same as correct. Haiku handles 60% of real-world support tasks correctly. Routing those to Opus or GPT-5.5 burns 6–30× the necessary budget with no quality gain on routine work.\n\nA team running a customer support AI agent across 1,200 daily conversations hit a $2,847 monthly bill after three months of growth. After applying the fixes in this guide:\n\n```\npie title \"Source of 72% Cost Reduction\"\n    \"Model routing (Haiku for simple tasks)\" : 41\n    \"Prompt caching (fixing the cache miss)\" : 33\n    \"Context pruning (conversation history)\" : 18\n    \"Output token budgeting (max_tokens)\" : 8\n```\n\nMonth 3: $2,847 → Month 6: $849. A 70% reduction while handling 18% more conversations per day. The largest single saving was model routing — not prompt caching — because the team had been sending every request including simple intent classification to Sonnet 4.6.\n\nThe full story with month-by-month numbers is in [How I Cut My AI API Bill by 70%](https://dev.to/how-i-cut-my-ai-api-bill).\n\n**Q: Should I optimize input tokens or output tokens first?** Output tokens. They cost 4–8× more. Set `max_tokens`\n\ndeliberately per task type, use structured output (JSON schema) to prevent verbose prose when you need data, and ask the model for concise responses where appropriate. This is the fastest single change to make.\n\n**Q: How do I know if prompt caching is actually working?** Check `cache_read_input_tokens`\n\nin the API response. If it's 0 on every call, caching is not activating. Common reasons: content below the 1,024-token minimum, `cache_control`\n\nmarkers missing or misplaced, or dynamic content in the static block breaking the prefix.\n\n**Q: Is model routing worth the engineering effort for a small app?** Yes, if you're running more than ~1,000 requests per day. The classifier call costs a fraction of a cent using Haiku and pays for itself immediately. Breakeven is typically within the first day of deployment.\n\n**Q: How does the Batch API affect latency — and when is it acceptable?** Batch requests complete within 24 hours, typically much faster for small batches. It rules out any real-time user-facing flow. It's ideal for anything that runs in the background: classification pipelines, nightly report generation, bulk data extraction, A/B prompt testing. If your users aren't waiting for the response, use the Batch API.\n\n**Official Docs**\n\n**Deep Dives**\n\n📰 [Claude API Cost Optimization: 60% Reduction in Production — DEV Community](https://dev.to/whoffagents/claude-api-cost-optimization-caching-batching-and-60-token-reduction-in-production-3n49)\n\n📰 [Lost in the Middle: How Language Models Use Long Contexts — Stanford NLP](https://arxiv.org/abs/2307.03172)\n\n**Tools**\n\n*Measure before you optimize. Run* `client.messages.countTokens()`\n\n*on your most frequent request type before changing anything. The distribution will tell you exactly which lever to pull first.*\n\n*Originally published on ZyVOP*\n\n💡 For more articles like this, [subscribe to the ZyVOP newsletter](https://zyvop.com/newsletter)!", "url": "https://wpnews.pro/news/token-budgeting-the-engineering-skill-nobody-talks-about", "canonical_source": "https://dev.to/sanjay_singh_1/token-budgeting-the-engineering-skill-nobody-talks-about-3ifp", "published_at": "2026-06-20 17:38:41+00:00", "updated_at": "2026-06-20 18:06:43.260819+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure", "mlops"], "entities": ["Anthropic", "OpenAI", "DeepSeek", "Claude Opus 4.8", "Claude Sonnet 4.6", "Claude Haiku 4.5", "GPT-5.5", "GPT-4.1"], "alternates": {"html": "https://wpnews.pro/news/token-budgeting-the-engineering-skill-nobody-talks-about", "markdown": "https://wpnews.pro/news/token-budgeting-the-engineering-skill-nobody-talks-about.md", "text": "https://wpnews.pro/news/token-budgeting-the-engineering-skill-nobody-talks-about.txt", "jsonld": "https://wpnews.pro/news/token-budgeting-the-engineering-skill-nobody-talks-about.jsonld"}}