The Context Tax: Why Step 12 Costs 42x Step 1 (Measure It in 40 Lines)

A developer created a 40-line Python script called context_tax.py to measure the 'context tax' in AI agent sessions, where each step re-sends the entire conversation history as input, causing total token cost to grow quadratically. The script found that in a synthetic debugging session, the last step billed 42.8 times the input of the first step, highlighting that falling token prices do not fix the compounding cost structure.

In short: the context tax is what you pay when every agent step re-sends the whole session transcript as input again, so step N re-bills turns 1..N and total cost grows with n n+1 /2. Cheaper tokens lower the unit, not the shape. context tax.py meters the re-bill multiplier offline; one debugging session measured 42.8x . AI disclosure:I drafted this with an AI writing assistant. The tool, the fixtures, and every number below come from a real local run of the script in this post on tiktoken o200k base. I reviewed and edited it before publishing. Token prices have been sliding all year. Your agent bill probably hasn't. I kept running into the same confusion in my own FinOps notes: per-token rates drop, and the monthly number goes the other way. The usual answers "you're using a bigger model," "you have more users" didn't explain a single session getting more expensive as it ran . So I wrote a 40-line meter to look at the one thing nobody charts: the session transcript itself. On a synthetic-but-realistic debugging session, the last step billed 42.8x the input of the first step. Same model. Same task. No new users. That gap has a boring cause and an annoying consequence. Here's both, plus the script. TL;DR. Every step of an agent loop re-sends the whole conversation so far history plus tool outputs as input . So step N pays for turns 1..N again, and total input grows roughly with n n+1 /2. Cheaper tokens don't fix the shape; they just lower the unit on a number that's still climbing. context tax.py below, keyless, offline meters three things from a session JSON: the re-bill curve, the re-bill multiplier, and a dead-weight estimate. On my bloated fixture it reported a 42.8x multiplier and 19.3% dead weight, and exited 1 as a CI gate. Here's the part that trips people up. An LLM call is stateless. The model doesn't "remember" turn 3 when you make turn 12. Your framework re-sends turns 1 through 11 as input so the model can see them. Every. Single. Step. So the cost of one step isn't the cost of that step's new text. It's the cost of the entire history up to that point. Step 1 bills a short user message. Step 12 bills the user message plus a file dump plus a wide grep plus a stack trace plus every assistant reply in between. The new tokens at step 12 might be tiny. The billed input is not. Logan Waxell put the shape plainly in The Compounding Math Your Architecture Is Hiding : "total cost grows roughly with n n+1 /2," and a turn-10 context can sit at 80,000–200,000 tokens. That post nails the problem and then points you at a proprietary runtime. I wanted the opposite: a tiny script I can run on my own transcript and check into CI. So that's what this is. And it's why "tokens got cheaper" is the wrong consolation. Edwin Lisowski's Token Prices Are Falling. So Why Is Your AI Bill Going Up? lists the drivers: full context re-sent each step, tool schemas eating 30–60% of the window before any user content, retries and sub-agents running around the clock. That schema overhead is a sibling tax worth metering on its own — I did exactly that for MCP servers in There's a second reason to meter instead of guess. Agents are bad at predicting their own spend. The arXiv paper How Do AI Agents Spend Your Money? Bai, Huang, Wang, Sun, Mihalcea, Brynjolfsson, Pentland, Pei measured agentic coding tasks and found three things worth pinning to the wall: agentic runs burn roughly So the takeaway writes itself: meter the transcript, don't trust the estimate. If the model can't call its own number, your gut can't either. context tax.py reads one JSON file: a session transcript as a list of turns role + content , tool results included . It tokenizes with tiktoken 's o200k base and reports four things. The exit code is the point. 0 if the multiplier is under threshold a disciplined session , 1 if it's over the architecture is compounding, so fail the build , 2 for usage. Drop it in CI and a session that balloons becomes a red check, not a surprise line item. bash /usr/bin/env python3 """context tax.py - meter the re-bill tax on a single agent session's transcript.""" import json, re, sys THRESHOLD = 12.0 re-bill multiplier above this = compounding architecture DEAD OVERLAP = 0.15 a turn is dead weight if <15% of its terms resurface later STOP = set "the a an of to in is it on for and or but with as at by from this that be are was you your i we they it's".split try: import tiktoken enc = tiktoken.get encoding "o200k base" def count t : return len enc.encode t TOKENIZER = "tiktoken o200k base exact " except Exception: honest fallback, ~+-15% vs real BPE def count t : return max 1, round len t / 4 TOKENIZER = "len/4 heuristic tiktoken not installed; ~+-15% " def words t : return {w for w in re.findall r" a-z0-9 {4,}", t.lower if w not in STOP} def main argv : if len argv < 2: print "usage: context tax.py <session transcript.json " ; return 2 s = json.load open argv 1 , encoding="utf-8" rate = float s.get "input usd per mtok", 3.0 $/1M input tok; configurable, NOT a vendor quote turns = t "content" for t in s "turns" tok = count c for c in turns later = words " ".join turns i + 1: for i in range len turns billed, dead = , 0 for n in range len turns : step n re-bills the running history 1..n billed.append sum tok : n + 1 if n < len turns - 1: w = words turns n overlap = len w & later n / len w if w else 1.0 if overlap < DEAD OVERLAP: dead += tok n mult = billed -1 / billed 0 if billed 0 else 0 total billed = sum billed print f"context tax | {argv 1 } | tokenizer: {TOKENIZER} | rate=${rate}/Mtok | threshold x{THRESHOLD}" print "-" 78 for n, b in enumerate billed : bar = " " round b / billed -1 40 print f" step {n + 1: 2} billed input={b: 6}t {bar}" print "-" 78 print f" re-bill multiplier step {len billed } / step 1 : x{mult:.1f}" print f" dead-weight never referenced later : {dead}t = {dead / billed -1 100:.1f}% of the final payload" print f" total billed input across session : {total billed}t ${total billed / 1 000 000 rate:.4f} at ${rate}/Mtok " print f" exit : {1 if mult THRESHOLD else 0}" return 1 if mult THRESHOLD else 0 if name == " main ": sys.exit main sys.argv No key, no network, read-only. pip install tiktoken , point it at a transcript JSON, done. If tiktoken isn't installed it falls back to a len/4 heuristic and says so out loud ~±15% off real BPE . I'd rather print the caveat than pretend the number is exact. Two fixtures ship with the script. Both are synthetic coding sessions no private data but shaped like the real thing. session lean.json is a disciplined session: small tool outputs, and a deliberate scope reset before the second task. Here's the actual output: context tax | session lean.json | tokenizer: tiktoken o200k base exact | rate=$3.0/Mtok | threshold x12.0 ------------------------------------------------------------------------------ step 1 billed input= 25t step 2 billed input= 46t step 3 billed input= 75t ... step 10 billed input= 235t ------------------------------------------------------------------------------ re-bill multiplier step 10 / step 1 : x9.4 dead-weight never referenced later : 56t = 23.8% of the final payload total billed input across session : 1335t $0.0040 at $3.0/Mtok exit : 0 Multiplier 9.4x, under the 12x threshold, exit 0. Green. Note the dead weight is still 23.8%: that's the first task's context the model no longer needs in the second task. Even a clean session carries dead weight until you actually trim. The scope reset kept the multiplier down; it didn't zero the waste. session bloated.json is the one that hurts. A 12-step debugging session that never trims: a full module dump, a wide repo grep, a long stack trace, and the kicker, a verbose pip check dependency log that gets re-sent on every step after it. Real output: context tax | session bloated.json | tokenizer: tiktoken o200k base exact | rate=$3.0/Mtok | threshold x12.0 ------------------------------------------------------------------------------ step 1 billed input= 40t step 2 billed input= 72t step 3 billed input= 421t step 4 billed input= 480t step 5 billed input= 857t ... step 12 billed input= 1713t ------------------------------------------------------------------------------ re-bill multiplier step 12 / step 1 : x42.8 dead-weight never referenced later : 331t = 19.3% of the final payload total billed input across session : 11774t $0.0353 at $3.0/Mtok exit : 1 42.8x. Over threshold, exit 1: a failed build. Watch step 3 in the curve. The full file dump jumps billed input from 72 to 421 tokens, and you pay that bump again on every one of the nine steps that follow. The 331 dead-weight tokens are mostly that pip check log boto3 versions, urllib3 pins that never came up again but kept riding along in the payload. Both numbers are reproducible. I hashed two consecutive bloated runs with shasum -a 256 and got identical digests, so the output is deterministic, not a fluke of one run. One honest correction. I'd guessed the multiplier would land near 16x when I started that's the figure floating around the n n+1 /2 discussions . The real run said 42.8x. The bloated fixture front-loads a big file dump on a small first turn, which stretches the ratio. The lesson isn't "16x vs 42x." It's that the number depends entirely on your transcript shape, which is exactly why you measure your own instead of borrowing mine. The fixes aren't exotic. The point of the meter is to tell you which one you need, and to prove it worked. pip check dump was 19.3% dead weight. Replace a 300-token log with a one-line "deps OK, no conflicts" and you stop re-billing it nine times.Then re-run the meter. If the multiplier drops back under threshold, the exit code flips to 0 and your CI gate goes green. That's the whole loop: measure, cut, prove. Not "trust me, I optimized it," but a number that moved. This meter slots in alongside the other checks in my pre-execution gate for AI agents https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/ — same philosophy, fail fast before the spend, not after the invoice. $/session figure uses a rate What's the worst re-bill multiplier you've measured on one of your own long sessions? Run the script on a real transcript and tell me in the comments. I'm collecting shapes, and I read every reply. Follow for the next number from the next run.