cd /news/ai-agents/how-to-manage-token-costs-in-claude-… · home topics ai-agents article
[ARTICLE · art-19546] src=mindstudio.ai pub= topic=ai-agents verified=true sentiment=· neutral

How to Manage Token Costs in Claude Code Dynamic Workflows: Haiku Sub-Agents and Scope Bounding

Anthropic's Claude Haiku sub-agents and scope bounding techniques can reduce token costs in dynamic workflows by routing simpler tasks to the cheaper model and preventing context bloat. Token costs escalate quickly in agentic loops due to compounding input tokens from conversation history, tool schemas, and sub-agent context inheritance. Engineers should assign Haiku to lightweight tasks like data extraction and classification while reserving Claude Opus or Sonnet for complex reasoning and planning.

read14 min publishedJun 2, 2026

Dynamic workflows can burn millions of tokens fast. Learn how to use Haiku sub-agents, scope bounding, and named deliverables to control costs.

Why Dynamic Workflows Burn Through Tokens Faster Than You Think #

Token costs in Claude workflows can escalate quickly. A single well-designed agentic loop can silently consume millions of tokens across a weekend run — and if you’re not paying attention, that bill lands without warning.

Dynamic workflows are particularly expensive because they’re designed to reason before acting. Every planning step, every self-correction, every tool call and response goes through context windows that compound on each other. Claude dynamic workflows, especially ones that spawn sub-agents or use multi-turn reasoning, can stack context at every iteration.

This guide covers two of the most effective techniques for controlling these costs: using Claude Haiku sub-agents for lightweight tasks, and scope bounding to prevent context bloat. Both approaches work independently, but they’re most effective together.

Understanding Where Token Costs Actually Come From #

Before you can reduce costs, you need to understand where tokens go in a dynamic workflow.

Input vs. Output Tokens

Most engineers focus on output tokens, but input tokens are often the bigger culprit in agentic systems. Every time an agent calls Claude, it sends the entire conversation history, the system prompt, tool definitions, and any retrieved context. That input grows with every turn.

In a 20-step agentic loop, you might send the same system prompt and partial history 20 times. If your system prompt is 2,000 tokens and your history grows by 500 tokens per turn, you’re looking at roughly 120,000 tokens of input just for history accumulation — before a single useful output token is generated.

Tool Schemas and Context Overhead

Tool definitions are included in every API call. If you’ve given your agent 15 tools, those schema definitions might add 1,500–3,000 tokens per call. Across 50 calls in a dynamic workflow run, that’s up to 150,000 tokens of pure overhead.

Sub-Agent Spawning Without Boundaries

Dynamic workflows often spawn sub-agents to handle subtasks. Without clear scope limits, each sub-agent can inherit a full copy of the parent context — including all prior reasoning — even when that context is irrelevant to the subtask.

This is where costs multiply fastest.

The Claude Model Tier Strategy #

Anthropic’s Claude model lineup provides a natural cost optimization lever. Claude Opus handles the most complex reasoning but costs significantly more per token. Claude Sonnet sits in the middle. Claude Haiku is the lightweight, fast model designed for high-volume, simpler tasks — and it costs a fraction of Opus.

The mistake most teams make is using one model for everything. They pick Sonnet or Opus and run every step of their workflow through it, regardless of what that step actually requires.

A smarter approach: route tasks by complexity, not by default.

When to Use Haiku

Haiku is well-suited for:

  • Extracting structured data from a formatted input (e.g., pulling fields from a JSON object)
  • Classifying content into a known set of categories
  • Summarizing short documents that don’t require nuanced interpretation
  • Validating outputs against a checklist
  • Formatting and templating tasks where the logic is deterministic
  • Simple yes/no or routing decisions in a workflow

These tasks don’t need Opus-level reasoning. They need speed and reliability, which is exactly what Haiku delivers.

When to Keep Sonnet or Opus

Sonnet or Opus should handle:

  • Open-ended planning where the agent must determine its own next steps
  • Multi-document synthesis where subtle differences matter
  • Code generation requiring reasoning about dependencies and edge cases
  • Tasks where output quality directly affects downstream decisions

The goal is to reserve your expensive model capacity for tasks where the quality difference actually matters.

Building Haiku Sub-Agents in Practice #

The pattern that works best is a coordinator-worker architecture. A primary Claude Sonnet (or Opus) agent handles orchestration and planning. It spawns Haiku sub-agents for execution tasks that fit Haiku’s capability profile.

Step 1: Identify Task Categories at Design Time

Before you build, go through your workflow and categorize every task. Ask: “Does this task require creative reasoning, or is it pattern-matching and extraction?”

Tasks that are structural, deterministic, or template-driven are Haiku candidates. Tasks requiring judgment calls go to Sonnet/Opus.

Step 2: Build Haiku Sub-Agents With Minimal Context

When you spawn a Haiku sub-agent, pass only what it needs for that specific task. Do not pass the full conversation history. Do not include tool definitions for tools the sub-agent won’t use.

A good rule of thumb: if your sub-agent prompt includes more than 500 tokens of context, audit what’s in there. Trim aggressively.

Step 3: Return Typed, Structured Outputs

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Have your Haiku sub-agents return structured outputs — JSON objects, enumerations, or defined schemas. This makes it easy for the parent agent to consume results without needing to re-interpret free-text responses.

It also prevents the parent agent from spending tokens re-reading and parsing verbose sub-agent outputs.

Step 4: Use Haiku for Validation Loops

One of the highest-leverage uses of Haiku is validation. After your Sonnet agent produces an output, route it through a Haiku validator that checks it against a defined rubric. If it fails, send it back. If it passes, proceed.

This keeps validation cheap without sacrificing quality control.

Scope Bounding: Preventing Context Bloat #

Scope bounding is the practice of explicitly defining what information an agent step is allowed to access and carry forward. It’s the single most effective architectural technique for keeping token costs manageable in long-running workflows.

Why Context Bloat Happens

Agents accumulate context naturally. Every tool call adds a result. Every reasoning step adds output. Without explicit boundaries, a 10-step workflow can arrive at step 8 carrying context from steps 1 through 7, most of which is no longer relevant.

This isn’t just a cost problem. Bloated context degrades model performance. Claude has to attend to more irrelevant information, which can dilute the quality of outputs at later stages.

Define a Context Window Budget Per Step

One approach is to set a token budget for each step in your workflow. Before invoking Claude, count the tokens in your prompt. If it exceeds your budget, trim older context.

You can implement this with a sliding window (keep the last N turns) or a summarization step (compress prior history into a condensed summary before passing to the next step).

Summarization is more expensive upfront but often produces better downstream results because the condensed summary preserves meaning without raw verbatim history.

Use Explicit Handoff Payloads

Instead of passing full conversation history between steps, define structured handoff objects. At the end of each step, your agent writes a handoff payload — a structured summary of decisions made, data collected, and what the next step needs to know.

The next step receives only that payload, not the entire prior context.

This is the single highest-leverage change you can make to a dynamic workflow architecture.

A handoff payload might look like:

{
  "task_id": "research_competitors",
  "status": "complete",
  "findings": {
    "competitors_identified": ["Company A", "Company B"],
    "key_differentiators": "...",
    "data_gaps": ["pricing for Company B"]
  },
  "next_step_context": "Proceed with pricing research for Company B only"
}

This replaces 3,000 tokens of conversation history with 150 tokens of structured context.

Limit Tool Access Per Step

Don’t give every agent step access to every tool. When Claude sees a list of 20 tools, it considers all of them as candidates for its next action — and those tool schemas consume tokens on every call.

Scope the tool list to what’s actually needed for that step. A research step gets search and fetch tools. A formatting step gets no tools. A writing step gets document creation tools.

This reduces per-call token overhead significantly.

Named Deliverables as a Cost Control Pattern #

#

Plans first. Then code.

Remy writes the spec, manages the build, and ships the app.

Named deliverables are one of the more underused patterns in Claude workflow design. The idea is simple: instead of letting an agent reason continuously until it “feels done,” you define explicit deliverables it must produce before moving forward.

What a Named Deliverable Is

A named deliverable is a specific artifact with a defined format and scope. Examples:

COMPETITOR_LIST

— a JSON array of competitor names with URLs, limited to 10 entriesRESEARCH_SUMMARY

— a plain-text summary of findings, 300 words maximumACTION_PLAN

— a numbered list of 5 tasks with estimated completion times

By naming the deliverable and defining its constraints (format, length, scope), you prevent the agent from over-generating. It knows exactly what “done” looks like.

How Named Deliverables Reduce Token Cost

Without named deliverables, agents tend to:

  • Reason extensively before producing output
  • Generate longer outputs than necessary because there’s no length constraint
  • Revisit earlier reasoning unnecessarily
  • Produce prose when structured data would serve better

Named deliverables create natural stopping points. When the deliverable is complete and valid, the step is done. There’s no ambiguity about when to stop generating.

Implementing Named Deliverables in Your System Prompt

Your system prompt can define deliverables explicitly:

“Your task is to produce the RESEARCH_SUMMARY deliverable. This is a plain-text summary of the research findings, no more than 300 words. Focus only on findings relevant to pricing strategy. Output the deliverable in the following format: [format spec].”

This single constraint change can reduce per-step output tokens by 40–60% compared to open-ended prompts.

Chaining Named Deliverables

In a multi-step workflow, each step produces one or more named deliverables, which become the input for the next step. This is essentially a typed pipeline.

The coordinator agent tracks which deliverables exist, validates their schemas, and passes only the relevant ones to each subsequent step. This is scope bounding and named deliverables working together.

Practical Cost Estimation Before You Build #

One of the biggest mistakes teams make is discovering token costs after deployment. A few minutes of upfront estimation can save significant spend.

Estimate Per-Run Token Usage

For each step in your workflow:

  • Estimate system prompt length in tokens
  • Estimate average input context (history + tool schemas + retrieved data)
  • Estimate average output length
  • Multiply by expected number of calls per run

Sum these across all steps. Multiply by expected runs per day/week. Use Anthropic’s published pricing to translate to dollars.

Anthropic’s model pricing page gives per-token costs for each model tier. As of mid-2025, Haiku is roughly 15-25x cheaper per token than Opus on input, making the model routing decision financially significant at scale.

Set Hard Token Limits

Use max_tokens parameters to cap output length per call. For Haiku sub-agents doing extraction or validation tasks, a limit of 200-500 tokens is often appropriate. This prevents runaway outputs and provides a forcing function for concise responses.

Add Cost Logging Early

Instrument your workflow to log token usage per step from day one. The usage data returned in Claude API responses includes input_tokens and output_tokens per call. Aggregate these by step, by run, and by agent type.

This visibility makes it obvious which steps are burning disproportionate tokens and where optimization will have the most impact.

Where MindStudio Fits Into This Architecture #

If you’re building Claude workflows and don’t want to hand-wire the coordinator-worker architecture from scratch, MindStudio provides a practical alternative.

MindStudio’s visual workflow builder supports multi-step AI pipelines with model selection at the step level. You can configure one step to run on Claude Sonnet for planning and route sub-tasks to Haiku for classification or extraction — without writing orchestration code. The platform handles context passing between steps through structured data objects, which maps directly to the handoff payload pattern described above.

The Agent Skills Plugin (@mindstudio-ai/agent

) is particularly relevant if you’re working in Claude Code. It exposes MindStudio’s 120+ typed capabilities as method calls your Claude agent can invoke directly — things like agent.searchGoogle()

or agent.runWorkflow()

. This lets your Claude Code agent delegate to purpose-built MindStudio workflows for specific subtasks, rather than handling everything in one monolithic context window. You get the reasoning quality of Claude Code with the cost efficiency of delegated, scoped execution.

For teams building at volume, this architecture — Claude Code as the coordinator, MindStudio workflows as Haiku-level executors for defined subtasks — can reduce per-run token costs substantially while keeping build time low.

You can try MindStudio free at mindstudio.ai.

Common Mistakes and How to Fix Them #

Mistake 1: Letting History Accumulate Unbounded

Fix: Implement a summarization step every N turns, or use structured handoff payloads instead of raw history.

Mistake 2: Routing Everything Through One Model

Fix: Categorize tasks at design time. Route structural and extraction tasks to Haiku. Reserve Sonnet/Opus for reasoning and judgment tasks.

Mistake 3: Including All Tools in Every Step

Fix: Scope tool availability to what each step actually needs. Remove unused tool definitions from the context.

Mistake 4: Open-Ended Prompts Without Stopping Conditions

Fix: Use named deliverables with explicit format and length constraints. Define “done” before the agent starts.

Mistake 5: No Cost Visibility Until the Bill Arrives

Fix: Log input_tokens and output_tokens per call from day one. Set alerts for per-run token budgets.

Frequently Asked Questions #

How much cheaper is Claude Haiku compared to Claude Opus?

As of mid-2025, Claude Haiku costs roughly 15-25x less per token than Claude Opus on input, and significantly less on output as well. The exact ratio shifts with Anthropic’s pricing updates, but the order-of-magnitude difference holds. For high-volume workflows where many steps are doing structured extraction or classification, routing those steps to Haiku can reduce overall API spend dramatically without meaningful quality loss.

What types of tasks are actually safe to run on Claude Haiku?

Haiku is reliable for tasks with clear, bounded inputs and outputs: JSON extraction, category classification, checklist validation, text formatting, short summarization of structured content, and simple routing decisions. It underperforms Sonnet or Opus on tasks requiring multi-step reasoning, nuanced interpretation, or novel synthesis. When in doubt, test both models on representative examples from your actual workload and compare outputs.

How do I implement scope bounding without losing important context?

The key is defining structured handoff payloads rather than passing raw conversation history. At the end of each workflow step, have the agent produce a typed summary of what was decided and what the next step needs. This preserves semantic continuity without accumulating verbose history. For very long workflows, add an explicit summarization step every 5-10 turns that compresses prior context into a condensed state object.

Can named deliverables hurt output quality by constraining the model too much?

They can if the constraints are too tight for the task. A 50-word limit on a research synthesis step will produce low-quality output. The fix is to calibrate constraints based on actual output requirements — run the task unconstrained first, observe typical output length and format, then set limits that are realistic. Named deliverables should prevent over-generation, not under-generation.

How do I know which steps in my workflow are causing the most token waste?

Instrument each step to log input_tokens and output_tokens from the API response. Sort by total tokens consumed per step across a sample of runs. Steps with high input token counts relative to output complexity are usually accumulating unnecessary context. Steps with high output token counts may lack appropriate constraints. Both are optimization targets.

Does context caching help with these costs?

Yes, significantly. Anthropic’s prompt caching feature allows you to cache static portions of the context — system prompts, tool definitions, document content that doesn’t change between calls — and pay a reduced rate for cache hits. For workflows with large system prompts or stable tool schemas, caching can cut input token costs by 60-90% on those cached portions. It’s complementary to the techniques in this guide, not a replacement.

Key Takeaways #

  • Token costs in dynamic workflows accumulate primarily through input token growth: history, tool schemas, and inherited context compound on every call.
  • Using Claude Haiku for extraction, classification, validation, and formatting tasks can reduce per-run costs by an order of magnitude compared to routing everything through Opus or Sonnet.
  • Structured handoff payloads between steps prevent context bloat more effectively than any other single architectural change.
  • Named deliverables with explicit format and length constraints reduce output token waste and create natural stopping points for agents.
  • Scoping tool availability per step reduces per-call token overhead for tool schemas.
  • Cost visibility requires instrumentation from day one — log token usage per step and per run before costs become a problem.

If you want to put these patterns into practice without building the orchestration layer from scratch, MindStudio’s visual workflow builder and Agent Skills Plugin give you model routing, structured data passing, and Claude Code integration in one place. Start free at mindstudio.ai.

── more in #ai-agents 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-to-manage-token-…] indexed:0 read:14min 2026-06-02 ·