In short: the context tax is what you pay when every agent step re-sends the whole session transcript as input again, so step N re-bills turns 1..N and total cost grows with n(n+1)/2. Cheaper tokens lower the unit, not the shape. context_tax.py
meters the re-bill multiplier offline; one debugging session measured 42.8x.
AI disclosure:I drafted this with an AI writing assistant. The tool, the fixtures, and every number below come from a real local run of the script in this post on tiktoken o200k_base. I reviewed and edited it before publishing.
Token prices have been sliding all year. Your agent bill probably hasn't.
I kept running into the same confusion in my own FinOps notes: per-token rates drop, and the monthly number goes the other way. The usual answers ("you're using a bigger model," "you have more users") didn't explain a single session getting more expensive as it ran. So I wrote a 40-line meter to look at the one thing nobody charts: the session transcript itself. On a synthetic-but-realistic debugging session, the last step billed 42.8x the input of the first step. Same model. Same task. No new users.
That gap has a boring cause and an annoying consequence. Here's both, plus the script.
TL;DR. Every step of an agent loop re-sends the whole conversation so far (history plus tool outputs) as input. So step N pays for turns 1..N again, and total input grows roughly with n(n+1)/2. Cheaper tokens don't fix the shape; they just lower the unit on a number that's still climbing. context_tax.py
(below, keyless, offline) meters three things from a session JSON: the re-bill curve, the re-bill multiplier, and a dead-weight estimate. On my bloated fixture it reported a 42.8x multiplier and 19.3% dead weight, and exited 1 as a CI gate.
Here's the part that trips people up. An LLM call is stateless. The model doesn't "remember" turn 3 when you make turn 12. Your framework re-sends turns 1 through 11 as input so the model can see them. Every. Single. Step.
So the cost of one step isn't the cost of that step's new text. It's the cost of the entire history up to that point. Step 1 bills a short user message. Step 12 bills the user message plus a file dump plus a wide grep plus a stack trace plus every assistant reply in between. The new tokens at step 12 might be tiny. The billed input is not.
Logan (Waxell) put the shape plainly in * The Compounding Math Your Architecture Is Hiding*: "total cost grows roughly with n(n+1)/2," and a turn-10 context can sit at 80,000–200,000 tokens. That post nails the problem and then points you at a proprietary runtime. I wanted the opposite: a tiny script I can run on my own transcript and check into CI. So that's what this is.
And it's why "tokens got cheaper" is the wrong consolation. Edwin Lisowski's * Token Prices Are Falling. So Why Is Your AI Bill Going Up?* lists the drivers: full context re-sent each step, tool schemas eating 30–60% of the window before any user content, retries and sub-agents running around the clock. That schema overhead is a sibling tax worth metering on its own — I did exactly that for MCP servers in
There's a second reason to meter instead of guess. Agents are bad at predicting their own spend.
The arXiv paper * How Do AI Agents Spend Your Money?* (Bai, Huang, Wang, Sun, Mihalcea, Brynjolfsson, Pentland, Pei) measured agentic coding tasks and found three things worth pinning to the wall: agentic runs burn roughly
So the takeaway writes itself: meter the transcript, don't trust the estimate. If the model can't call its own number, your gut can't either.
context_tax.py
reads one JSON file: a session transcript as a list of turns (role
content
, tool results included). It tokenizes with tiktoken
's o200k_base
and reports four things.
The exit code is the point. 0
if the multiplier is under threshold (a disciplined session), 1
if it's over (the architecture is compounding, so fail the build), 2
for usage. Drop it in CI and a session that balloons becomes a red check, not a surprise line item.
#!/usr/bin/env python3
"""context_tax.py - meter the re-bill tax on a single agent session's transcript."""
import json, re, sys
THRESHOLD = 12.0 # re-bill multiplier above this = compounding architecture
DEAD_OVERLAP = 0.15 # a turn is dead weight if <15% of its terms resurface later
STOP = set("the a an of to in is it on for and or but with as at by from this that be are was you your i we they it's".split())
try:
import tiktoken
_enc = tiktoken.get_encoding("o200k_base")
def count(t): return len(_enc.encode(t))
TOKENIZER = "tiktoken o200k_base (exact)"
except Exception: # honest fallback, ~+-15% vs real BPE
def count(t): return max(1, round(len(t) / 4))
TOKENIZER = "len/4 heuristic (tiktoken not installed; ~+-15%)"
def words(t): return {w for w in re.findall(r"[a-z0-9_]{4,}", t.lower()) if w not in STOP}
def main(argv):
if len(argv) < 2:
print("usage: context_tax.py <session_transcript.json>"); return 2
s = json.load(open(argv[1], encoding="utf-8"))
rate = float(s.get("input_usd_per_mtok", 3.0)) # $/1M input tok; configurable, NOT a vendor quote
turns = [t["content"] for t in s["turns"]]
tok = [count(c) for c in turns]
later = [words(" ".join(turns[i + 1:])) for i in range(len(turns))]
billed, dead = [], 0
for n in range(len(turns)): # step n re-bills the running history 1..n
billed.append(sum(tok[: n + 1]))
if n < len(turns) - 1:
w = words(turns[n])
overlap = len(w & later[n]) / len(w) if w else 1.0
if overlap < DEAD_OVERLAP:
dead += tok[n]
mult = billed[-1] / billed[0] if billed[0] else 0
total_billed = sum(billed)
print(f"context_tax | {argv[1]} | tokenizer: {TOKENIZER} | rate=${rate}/Mtok | threshold x{THRESHOLD}")
print("-" * 78)
for n, b in enumerate(billed):
bar = "#" * round(b / billed[-1] * 40)
print(f" step {n + 1:>2} billed_input={b:>6}t {bar}")
print("-" * 78)
print(f" re-bill multiplier (step {len(billed)} / step 1) : x{mult:.1f}")
print(f" dead-weight (never referenced later) : {dead}t = {dead / billed[-1] * 100:.1f}% of the final payload")
print(f" total billed input across session : {total_billed}t (${total_billed / 1_000_000 * rate:.4f} at ${rate}/Mtok)")
print(f" exit : {1 if mult > THRESHOLD else 0}")
return 1 if mult > THRESHOLD else 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
No key, no network, read-only. pip install tiktoken
, point it at a transcript JSON, done. If tiktoken
isn't installed it falls back to a len/4 heuristic and says so out loud (~±15% off real BPE). I'd rather print the caveat than pretend the number is exact.
Two fixtures ship with the script. Both are synthetic coding sessions (no private data) but shaped like the real thing.
** session_lean.json** is a disciplined session: small tool outputs, and a deliberate scope reset before the second task. Here's the actual output:
context_tax | session_lean.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
step 1 billed_input= 25t ####
step 2 billed_input= 46t ########
step 3 billed_input= 75t #############
...
step 10 billed_input= 235t ########################################
------------------------------------------------------------------------------
re-bill multiplier (step 10 / step 1) : x9.4
dead-weight (never referenced later) : 56t = 23.8% of the final payload
total billed input across session : 1335t ($0.0040 at $3.0/Mtok)
exit : 0
Multiplier 9.4x, under the 12x threshold, exit 0. Green. Note the dead weight is still 23.8%: that's the first task's context the model no longer needs in the second task. Even a clean session carries dead weight until you actually trim. The scope reset kept the multiplier down; it didn't zero the waste.
** session_bloated.json** is the one that hurts. A 12-step debugging session that never trims: a full module dump, a wide repo grep, a long stack trace, and the kicker, a verbose
pip check
dependency log that gets re-sent on every step after it. Real output:
context_tax | session_bloated.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
step 1 billed_input= 40t #
step 2 billed_input= 72t ##
step 3 billed_input= 421t ##########
step 4 billed_input= 480t ###########
step 5 billed_input= 857t ####################
...
step 12 billed_input= 1713t ########################################
------------------------------------------------------------------------------
re-bill multiplier (step 12 / step 1) : x42.8
dead-weight (never referenced later) : 331t = 19.3% of the final payload
total billed input across session : 11774t ($0.0353 at $3.0/Mtok)
exit : 1
42.8x. Over threshold, exit 1: a failed build. Watch step 3 in the curve. The full file dump jumps billed input from 72 to 421 tokens, and you pay that bump again on every one of the nine steps that follow. The 331 dead-weight tokens are mostly that pip check
log (boto3 versions, urllib3 pins) that never came up again but kept riding along in the payload.
Both numbers are reproducible. I hashed two consecutive bloated runs with shasum -a 256
and got identical digests, so the output is deterministic, not a fluke of one run.
One honest correction. I'd guessed the multiplier would land near 16x when I started (that's the figure floating around the n(n+1)/2 discussions). The real run said 42.8x. The bloated fixture front-loads a big file dump on a small first turn, which stretches the ratio. The lesson isn't "16x vs 42x." It's that the number depends entirely on your transcript shape, which is exactly why you measure your own instead of borrowing mine.
The fixes aren't exotic. The point of the meter is to tell you which one you need, and to prove it worked.
pip check
dump was 19.3% dead weight. Replace a 300-token log with a one-line "deps OK, no conflicts" and you stop re-billing it nine times.Then re-run the meter. If the multiplier drops back under threshold, the exit code flips to 0 and your CI gate goes green. That's the whole loop: measure, cut, prove. Not "trust me, I optimized it," but a number that moved. This meter slots in alongside the other checks in my pre-execution gate for AI agents — same philosophy, fail fast before the spend, not after the invoice.
$/session
figure uses a rate What's the worst re-bill multiplier you've measured on one of your own long sessions? Run the script on a real transcript and tell me in the comments. I'm collecting shapes, and I read every reply. Follow for the next number from the next run.