GitHub's shift from premium requests to usage-based billing has triggered a wave of anxiety across engineering teams. The question echoing through Slack channels and leadership meetings is some variation of: "How do we reduce our token spend?"
It's the wrong question.
Focusing purely on cost diminishes the value you get from agents. A better framing is: "How do we get the most out of the tokens we spend?" That subtle reframing changes everything — from how you write prompts, to which model you reach for, to how you architect your codebase, to how you organize your team's workflows.
This article walks through the full case for quality-first token optimization, the foundational mental models you need to reason about it, and the concrete controls and techniques that move the needle.
When tokens were effectively free, agent accuracy didn't really matter. The dominant pattern became what's best described as "agent gambling": throw together a lazy prompt with minimal context, fire off an agent, and if it fails, fire off another one. Think of it as the NASA Artemis problem in reverse — if rockets were cheap, you'd send 20 in the general direction of the moon and hope one lands.
That worked when each developer ran a handful of agents per day. It stops working the moment developers — and especially AI engineers orchestrating fleets — are running dozens or hundreds of agents per day. The economics invert. The cost of misfires dwarfs the cost of doing the work properly.
The fix isn't to send fewer rockets blindly. It's to make sure each rocket actually lands. Higher per-agent quality means fewer retries, fewer wasted tokens, and better ROI on every dollar of usage.
The guiding equation for thinking about agent economics:
Agent ROI = (Value of Agent Output − Token Cost) / Token Cost × 100%
You can't calculate this precisely, but it's a directionally useful lens. Two things follow immediately:
Here's the math that should haunt anyone running multi-step agent workflows: errors compound multiplicatively.
LLMs are non-deterministic. Every step in an inner agent loop, every hop in an orchestrated workflow, every tool call — they all multiply against each other. This means every percentage point of per-step quality buys you a disproportionate improvement in overall reliability. And every miss isn't just a wasted token call — it triggers fix cycles, review overhead, reruns, debugging sessions, and burned human attention.
The takeaway: apply the same "shift-left" mindset to agents that you apply to quality, testing, and security in traditional engineering.
The whole philosophy collapses into one line worth pinning to your monitor:
Instead of counting tokens, make every token count.
Reduce token usage as a consequence of pursuing quality — not as a goal in itself. Send fewer, better-targeted rockets. The fuel savings follow automatically.
Before you can optimize anything, you need to internalize a few mechanical truths about how this technology actually works.
Strip away the marketing and what you have is a text-in, text-out system that predicts the next word given an input plus the patterns from its training data. When you type "GitHub Copilot is the world's most widely…" the model assigns probabilities to candidate next words — used, adopted, deployed, and so on — and picks one. In a coding context, it's predicting the next instruction.
Models have gotten dramatically better, but the underlying mechanism hasn't changed. This matters because the math doesn't distinguish hallucination from fact. A made-up function name and a real one occupy the same probability space. The model isn't "lying" when it hallucinates — it's just doing what it always does with insufficient signal.
Which leads to the single most important principle in this entire discipline:
Provide as little context as possible, but as much as required.
Two failure modes flank this principle:
Context engineering — the discipline of finding that sweet spot — is the fundamental skill of working with agents.
An agent is not magic. It's an app — code that sits between you and the LLM. The architecture is simple:
You and your project ↔ The agent (harness) ↔ The LLM
Harnesses are things like VS Code Chat, Copilot CLI, Copilot Cloud Agent, Claude Code, OpenAI Codex. Models are things like GPT 5.5, Claude Opus 4.7, Gemini Pro. The harness is the orchestrator; the model is the inference engine.
Two things are crucial to understand here:
A token is roughly ¾ of an English word. Smaller models offer 50K–200K token windows; larger ones like Opus and GPT-5.5 push toward 1M tokens. For scale: 1M tokens is roughly the entire Lord of the Rings trilogy plus The Hobbit.
Don't obsess over token counting at the character level. Think at the level of prompts, files, and responses — those are the units that compound on each loop.
Even with a huge window, models don't treat all positions equally. Two well-documented effects govern how attention is distributed:
The practical implications are significant:
The fix isn't compaction (which trades tokens for potential information loss). It's a new context window per task — /clear
liberally, divide work into discrete sessions, and don't let conversations sprawl.
Now to the controls themselves, ordered roughly by leverage.
Two archetypes exist on the agent maturity spectrum:
Calibrate effort accordingly.
Two controls vastly outweigh everything else: model choice and relevant context.
Model choice is the single highest-leverage decision. The cost gap between top-tier reasoning models (Claude Opus 4.7) and small models (GPT-5.4 mini) is roughly 24x. Match the model to the task:
A reasoning model on a trivial task isn't just expensive — it can actively make things worse, second-guessing tight specifications and "going rogue." Conversely, a small model on a planning task will produce shallow, brittle output.
Auto Mode (rolling out from June) detects task intent and selects the model for you. It's the lazy default for anyone who doesn't want to think about it — and it's usually right.
Relevant context is the other half of the equation. Don't stuff prompts with "might need" information. Let the agent discover what it needs. Compacting sessions trades tokens for potential info loss — use it cautiously. And use /clear
often — tokens don't carry across sessions, so a clean slate is free.
The prompt is always-on. It sits at the beginning of the context window and has outsized influence due to lost-in-middle effects.
A few rules:
A single context window doing research, planning, and implementation drags irrelevant files and stale reasoning through every phase. Quality degrades.
The pattern that works:
Each phase gets a fresh context window. The spec is the artifact that carries information across the boundary — clean, distilled, free of noise. This saves both time and tokens, and produces far higher-quality output than one monolithic session.
Tests, linters, security scanners, type checkers — anything code-enforced and deterministic — are essential context engineering tools. A test either fails or passes. There's no probability. Every passing test resets the compounding error rate to zero for the property it covers.
The contrast is stark:
The Copilot CLI team ships roughly 500 PRs per week. Roughly 53% of their codebase is tests. That's not overhead — that's the moat that lets them move that fast without burning down the production system.
Cheap in the short term means expensive in the medium term. Guardrails pay back many times over.
Modern agent harnesses pick up a stack of markdown files automatically. These are the surface you work with as a context engineer:
copilot-instructions.md
. Always loaded../github/agents/*.agent.md
. Role-based, manually invoked../github/skills/*/skill.md
. Conditionally loaded../github/instructions/*.instructions.md
. Path-pattern based../.github/prompts/*.prompt.md
. Manual starting points.Each has a place. Let's go through the high-leverage ones.
These are your always-on guidance, the proactive human-in-the-loop signal. Three things belong in them:
Critical rules: keep them small, don't use AI to generate them, and recreate them often. Research shows that "be concise" performs nearly as well as a 50-line "caveman" skill. AI-generated instructions bloat. Write them yourself, iterate, throw them away. The Copilot CLI team rewrites their entire instructions file every three months as a living document.
A custom agent forces the model into a specific role or workflow — for example, a /tdd-red
agent that only writes failing tests. The harness retrieves the agent file, injects the definition, restricts the available tools, and appends your prompt.
The token savings are modest (input is cached). The real win is preventing wrong paths. Restricting an agent to read-only access on GitHub issues, for instance, eliminates an entire class of mistakes.
Skills are conditionally loaded markdown. The harness puts the description of every skill into context; the LLM tells the harness when it needs the full skill loaded.
Two pitfalls:
MCPs add external tools and API calls. The harness offers tool descriptions to the LLM, which invokes them when needed.
Be rigorous. MCPs bloat tool descriptions and can lead to undesired tool calls. Deactivate MCPs you don't always need, or wrap them inside custom agents that scope when they're active.
The Playwright MCP is the canonical example: powerful for frontend work, but expensive (screenshots, page reads, full DOM parsing). If always-on, it triggers unnecessary work for trivial CSS changes. Pair it with a custom agent that only activates it when you're doing real UI work.
A subagent opens a second context window for a specific task — research, document summarization, etc. — and returns a compact summary to the main session. This keeps the main context clean.
The trade-off: more tokens are spent inside the subagent. It's a conditional optimization. Use it when the alternative is polluting your main session with hundreds of irrelevant files.
For orchestrators running hundreds or thousands of agents, additional levers exist — though they trade quality for token savings and require careful testing:
gh
. A CLI invocation can be leaner than the equivalent MCP, because the model doesn't need static tool descriptions injected./chronicle tip
regularly in Copilot CLIZooming out from the tactical playbook, three durable traits separate developers who'll thrive in the agent era from those who won't.
Coding itself was never the true source of developer value. Analytical thinking and deep domain proficiency were. Agents can write code; they can't decide what should be built, in what domain language, with what trade-offs. The ability to tell an agent precisely what to do, in the language of the domain, is the most valuable skill. Invest there.
Domain-Driven Design, Hexagonal Architecture, CQRS, Event-Driven Design — these matter more now, not less. Good architecture:
The old debates — five-line functions versus ten, semicolons, comment style — are noise. Architecture is signal.
Treat this with an engineering mindset. Keep configs fresh. Treat every agent miss like an incident — log it, fix the underlying instruction or skill, prevent recurrence. Use /chronicle
regularly in the CLI to surface patterns. This is continuous engineering work, not a one-time setup.
You are now a context engineer. That's the job.
If you take nothing else from this, take these five:
copilot-instructions.md
.The whole discipline reduces to one principle:
Write as little context as required, and as much as necessary.
Token cost optimization isn't really about tokens. It's about quality, precision, and engineering rigor applied to a new substrate. The teams that internalize this — that stop counting tokens and start making every token count — will out-ship, out-quality, and out-economize everyone still gambling with cheap agents.
I'm happy to answer your questions, and to help your team or organization with agent quality and token optimizations techniques - send me a message on LinkedIn.