How to Use Prompt Caching to Cut Claude Code Token Costs in Dynamic Workflows

Anthropic's prompt caching feature reduces token costs in dynamic Claude Code workflows by storing repeated context like system prompts and tool definitions, cutting cached input token prices to approximately 10% of standard rates. The technique works best when identical content appears across multiple API calls within the 5-minute cache lifetime, with cache breakpoints set using the `cache_control` parameter in the messages API. Combined with scope bounding and routing lighter tasks to Claude Haiku, this approach can significantly lower expenses without sacrificing reasoning quality in multi-step agentic workflows.

How to Use Prompt Caching to Cut Claude Code Token Costs in Dynamic Workflows Dynamic workflows burn tokens fast. Learn how to use prompt caching, scope bounding, and Haiku sub-agents to control costs in Claude Code. Why Token Costs Spiral in Dynamic Workflows If you’ve run Claude Code on anything more complex than a single-shot task, you’ve probably watched your token count balloon fast. A workflow that starts with a modest system prompt and a few tool definitions can triple in cost by step five, simply because every new action adds more context to the pile. This isn’t a flaw in how Claude works — it’s a natural consequence of how agentic workflows are structured. Claude needs context to reason well. But that same context, repeated across every API call, is where the bill quietly grows. Prompt caching in Claude is the most direct way to address this. Combined with scope bounding and routing lighter tasks to Claude Haiku, you can cut costs significantly without sacrificing reasoning quality. This guide explains each technique concretely, with enough implementation detail to apply them to your own workflows. Understanding Why Dynamic Workflows Are Expensive A static prompt — one system prompt, one user message, one response — is cheap and predictable. Dynamic workflows are different. In an agentic context, you’re often: - Passing the same system prompt and tool definitions on every API call - Accumulating conversation history across multiple steps - Including large documents, codebases, or retrieved chunks as context - Running many sub-tasks in sequence, each inheriting the full prior context Each of these adds tokens. And because Claude’s pricing is per-token on both input and output, a 10-step workflow where each step re-sends 4,000 tokens of shared context costs 40,000 tokens just for that repeated content — before any new information gets processed. The problem compounds when workflows branch or retry. A tool call that fails and re-runs sends all that shared context again. A workflow that spawns sub-agents multiplies it further. Prompt caching targets exactly this pattern: repeated, identical content that doesn’t change between calls. How Prompt Caching Works in Claude Anthropic’s prompt caching feature lets you mark portions of your prompt as cacheable. When Claude processes a request with a cache breakpoint, it stores the computed KV key-value state for that portion of the prompt. Subsequent requests that hit the same cached content skip re-processing it — and pay a fraction of the normal input token price. The Cost Math Cached input tokens cost approximately 10% of the standard input token price. Cache writes cost slightly more than standard input around 25% more , but that’s a one-time cost per cache population. The breakeven is quick. If any piece of content appears in two or more API calls, caching it saves money. For system prompts that repeat across dozens of workflow steps, the savings are substantial. Cache Lifetime and Invalidation Claude’s cache has a 5-minute TTL by default. The clock resets each time the cached content is accessed. For long-running workflows, this means: - If your workflow completes within 5 minutes, the cache stays warm throughout - For workflows that pause or run slowly, you may need to re-warm the cache - Any change to the cached content — even a single token — creates a new cache entry This last point matters for dynamic workflows: the cached portion must be byte-for-byte identical across calls. Setting Cache Breakpoints Cache breakpoints are set using the cache control parameter in the messages API. You place them at the end of content blocks you want cached. { "system": { "type": "text", "text": "You are a code review assistant... long system prompt ", "cache control": {"type": "ephemeral"} } } You can place up to four cache breakpoints per request. Each one marks a boundary — everything up to that point in the prompt gets cached as a unit. Good candidates for caching: System prompts — Usually static across an entire workflow run Tool definitions — Rarely change mid-workflow Reference documents — Codebases, style guides, API specs passed as context Few-shot examples — If you include examples to shape Claude’s output format Poor candidates for caching: - The current user message changes every call - Dynamic tool outputs change every step - Timestamps or session-specific data embedded in the prompt Structuring Prompts for Maximum Cache Hits The order of content in your prompt determines how caching performs. Claude processes prompts from top to bottom, and the cache captures everything from the start up to the breakpoint. Put Static Content First The golden rule: static content before dynamic content. If your system prompt comes after dynamic context, the cache can’t capture it in isolation — any change above the breakpoint invalidates everything below it. Structure your prompts like this: System prompt → cache breakpoint Tool definitions → cache breakpoint Reference documents → cache breakpoint Conversation history / dynamic context → no cache breakpoint Current user message → no cache breakpoint With this structure, the first three sections get cached and reused. Only the last two sections — which legitimately change each call — get processed fresh. Separate Tool Definitions from Tool Outputs Tool definitions the schema describing what tools exist are static. Tool outputs the results of calling those tools are dynamic. Keep them in separate blocks. Mixing them forces the cache to treat both as dynamic, which defeats the purpose. In practice, this means building your tools array once and not embedding previous tool call results into the tool schema itself. Handling Conversation History Conversation history is the trickiest part. It grows with every step, which means the entire history block is always changing — bad for caching. One approach: cache the history up to a certain point. After step N, stop updating the history block and only add new content at the end. This lets earlier turns stay cached while new context gets processed fresh. Another approach: summarize older turns into a static summary block, cache that, and only keep the last few turns dynamic. You trade some fidelity for significantly lower token costs. Scope Bounding: Controlling What Context Gets Passed Prompt caching helps with repeated content. Scope bounding helps with unnecessary content — context that gets passed along not because it’s needed, but because it was easier to include everything. What Scope Bounding Means In a multi-step workflow, each step tends to receive the full context from all previous steps. This is often overkill. Step 5 might only need the output of step 3, not the entire history of decisions made in steps 1 through 4. Scope bounding means deliberately deciding — at the workflow design level — what context each step actually needs. Then only passing that. This requires more up-front thought, but the token savings are immediate: smaller inputs, smaller outputs, faster responses, lower costs. Practical Techniques Output distillation: After each step, have Claude produce both a detailed output and a condensed summary. Pass only the summary to the next step unless detail is specifically required. Explicit context gates: Define which previous outputs each step is allowed to access. Don’t accumulate context automatically — make each step declare its dependencies. State objects: Instead of passing raw conversation history, maintain a structured state object that gets updated at each step. The state object holds only the current relevant state, not the full history of how you got there. Checkpointing: For long workflows, periodically compress the accumulated context into a checkpoint. The checkpoint captures the current state without the history of every intermediate decision. When Full Context Is Necessary Some tasks genuinely require full context — complex debugging, multi-document synthesis, tasks where earlier decisions directly constrain later ones. Don’t scope-bound those aggressively. The goal isn’t to minimize context at all costs. It’s to avoid passing context that doesn’t affect the current step’s output. Using Claude Haiku Sub-Agents for Cheaper Execution Prompt caching and scope bounding address the token side of costs. Model selection addresses the per-token price side. Other agents ship a demo. Remy ships an app. Real backend. Real database. Real auth. Real plumbing. Remy has it all. Not every step in a workflow requires Claude’s most capable model. Many sub-tasks are straightforward: formatting output, extracting specific fields, classifying inputs, routing between branches, generating short summaries. Claude Haiku handles these well and costs significantly less than Claude Sonnet or Opus — roughly 25x cheaper per token than Opus. For high-volume workflows with many small steps, routing the right tasks to Haiku can cut overall costs dramatically. Identifying Haiku-Appropriate Tasks Good candidates for Haiku sub-agents: Classification and routing — “Is this request about billing or technical support?” Structured extraction — “Pull the date, amount, and vendor from this invoice text” Format conversion — “Convert this JSON to a markdown table” Validation checks — “Does this code follow the style guide? Yes or no, and list violations” Short summaries — Condensing a paragraph or two Tasks that still warrant Sonnet or Opus: - Multi-step reasoning with ambiguous inputs - Code generation beyond simple snippets - Tasks where errors have high downstream costs - Synthesis across many documents or sources Implementing Model Routing The simplest approach: add a routing step at the start of your workflow that classifies the incoming task and assigns a model. This router can itself run on Haiku — meta, but appropriate. For programmatic workflows, you can hard-code the model assignments per step type. Step type “extract fields” always runs on Haiku. Step type “synthesize findings” always runs on Sonnet. A hybrid approach works well: use Haiku for the first pass, then conditionally escalate to Sonnet if Haiku’s output fails a quality check or returns a low-confidence result. The escalation cost is worth it when it happens rarely. Combining All Three Techniques Prompt caching, scope bounding, and Haiku routing aren’t independent — they compound when used together. Consider a workflow that processes incoming support tickets: Step 1 Haiku : Classify the ticket type and severity Step 2 Haiku : Extract key entities product name, error code, user account Step 3 Sonnet, cached system prompt + tool definitions : Retrieve relevant documentation and draft a response Step 4 Haiku : Format the draft into the standard reply template With prompt caching applied to step 3’s system prompt and tool definitions, every ticket processed after the first one in a session hits the cache. With scope bounding, step 3 only receives the output of step 2 structured entities rather than the full conversation history. With Haiku on steps 1, 2, and 4, three of the four steps run at low cost. The result: you’re paying Sonnet rates only for the reasoning-heavy step, and within that step, the static portions are cached. Measuring the Impact Track these metrics to understand your actual savings: Cache hit rate: The percentage of input tokens that come from cache. Aim for 60–80%+ on workflows with large static context. Input token distribution: What fraction of total tokens are from cached vs. fresh content. Cost per workflow run: Track this over time as you tune caching and routing. Anthropic’s API response includes cache read input tokens and cache creation input tokens fields, which make it straightforward to see whether caching is working as expected. How MindStudio Handles This for You One coffee. One working app. You bring the idea. Remy manages the project. If you’re building these workflows yourself, implementing prompt caching, scope bounding, and model routing requires writing and maintaining the logic yourself — managing cache breakpoints, structuring context, routing between models, handling retries. MindStudio https://mindstudio.ai handles much of this at the platform level. When you build an AI agent or workflow in MindStudio, you’re working with a visual builder that manages context flow between steps. You can explicitly configure what context passes between steps scope bounding by design , and you choose the model per step — including mixing Claude Haiku for cheaper steps with Sonnet for heavier reasoning. Because MindStudio abstracts the API layer, prompt caching behavior is handled without you manually managing cache control blocks. Static content like system prompts and tool definitions gets structured appropriately, and the platform’s infrastructure handles rate limiting, retries, and session management. For teams that want the cost benefits of these optimization techniques without building the plumbing from scratch, MindStudio is worth looking at. It gives you 200+ models including the full Claude family, lets you wire up multi-step workflows visually, and connects to 1,000+ integrations https://mindstudio.ai/integrations without needing separate API accounts for each. You can try it free at mindstudio.ai https://mindstudio.ai . Common Mistakes That Undermine Caching Even with the right intent, a few common mistakes prevent caching from working: Dynamic content embedded in the system prompt. If your system prompt includes a timestamp, user ID, or session variable, the cache invalidates on every call. Move any dynamic data out of the cached block and into the user message or a separate non-cached block. Inconsistent whitespace or formatting. A single extra space or newline makes the cached content a different string. If you’re generating prompts programmatically, normalize whitespace before sending. Caching too early in the messages array. If you put a cache breakpoint after the user message, you’re caching dynamic content. The breakpoint only helps when placed after genuinely static sections. Not verifying cache hits. It’s easy to assume caching is working when it isn’t. Check the cache read input tokens field in API responses to confirm hits are occurring. Ignoring cache warmup cost. The first call after a cache miss pays a slight premium for cache creation. For very short workflows with few repeated calls, the overhead can exceed the savings. Caching makes most sense when the same content repeats many times. FAQ What types of content should I cache in Claude workflows? Cache content that is identical across multiple API calls within a workflow. The best candidates are: system prompts, tool definitions and schemas, large reference documents codebases, style guides, API specs , and few-shot examples. Avoid caching dynamic content like user messages, tool call results, or any content that includes session-specific variables like timestamps or user IDs. How much can prompt caching actually save? Cached input tokens cost approximately 10% of the standard input token price. For workflows where 60–80% of input tokens come from repeated static content, total input costs can drop by 50–70%. The exact savings depend on your workflow structure, the size of your static context, and how many calls reuse it. Does prompt caching work across different workflow runs? Other agents start typing. Remy starts asking. Scoping, trade-offs, edge cases — the real work. Before a line of code. Claude’s cache has a 5-minute TTL that resets on each access. Within a single workflow run where API calls happen frequently, the cache stays warm. Across completely separate runs with gaps longer than 5 minutes, the cache will have expired. For scheduled workflows with infrequent runs, factor in cache warmup costs. When should I use Claude Haiku instead of Sonnet? Use Haiku for tasks that are well-defined, have structured inputs, and don’t require complex multi-step reasoning. Classification, entity extraction, format conversion, simple validation, and short summarization all work well on Haiku. Reserve Sonnet or Opus for tasks involving ambiguous inputs, complex reasoning, code generation, or high-stakes decisions where errors are costly. What is scope bounding and how does it reduce token costs? Scope bounding means explicitly limiting what context each step in a workflow receives, rather than passing the full accumulated history by default. It reduces costs by shrinking input size at each step — if step 5 only needs the output of step 3, passing the full history of steps 1 through 4 wastes tokens. Common techniques include output distillation condensing step outputs , structured state objects, and explicit dependency declarations between steps. Can I combine prompt caching with streaming? Yes. Prompt caching works with both streaming and non-streaming API requests. The cache control parameter operates at the request level before any streaming begins. Cache hits and misses behave the same way regardless of whether you’re streaming the response. Key Takeaways - Prompt caching reduces input token costs for repeated static content to roughly 10% of normal price — place breakpoints after your system prompt, tool definitions, and reference documents. - Cache invalidation happens on any change to the cached content, including whitespace differences. Generate prompts consistently and keep dynamic variables outside cached blocks. - Scope bounding cuts costs at the workflow design level by passing only the context each step genuinely needs, rather than accumulating everything by default. - Claude Haiku handles classification, extraction, formatting, and validation tasks effectively at a fraction of Sonnet’s cost. Route tasks to the cheapest model that can handle them reliably. - These three techniques compound: using all three on a multi-step workflow can reduce total token costs by 60–80% compared to a naive implementation. - Verify caching is working by checking cache read input tokens in API responses — don’t assume the cache is warm. For teams building these workflows on top of Claude, MindStudio https://mindstudio.ai provides a no-code environment that handles context management, model routing, and workflow orchestration out of the box — so you can focus on building the logic rather than the infrastructure. Start free at mindstudio.ai https://mindstudio.ai .