How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs

A developer measured the token cost of context provisioning for AI coding agents across 936 runs on the Apache Superset repository, comparing grep, graph-based, and LSP-based approaches. The results show that the optimal method depends on task difficulty and model choice, with no single approach dominating across all scenarios.

In my last post https://dev.to/neko1313 4/graphlens-a-polyglot-code-analysis-framework-that-turns-your-repo-into-a-typed-graph-4mhi I described graphlens — what it does, how it works — and along the way I casually claimed that an agent "burns tokens grepping around a repo." I gave exactly zero numbers to back that up. This post fixes that. Here are the measurements, the data, and a reproducible harness. Spoiler: the conclusion is not the one I expected going in, and that's the interesting part. I took one agent Claude Code , changed exactly one thing — which MCP server feeds it code context — and ran it over 26 tasks on apache/superset . Four "arms": filesystem grep + read , graphlens structural graph , serena LSP , and codegraph . Three models haiku / sonnet / opus , three seeds — 936 runs . The headline: the answer flips depending on the kind of task. If I'd only measured the easy tasks, I'd have written "you don't need a graph, grep is fine." If only the hard ones, "you don't need grep, get a graph." The truth sits in the middle, and it's about what work you hand the agent. Picture a familiar situation. You have a large project: hundreds of thousands of lines, a Python backend, a TypeScript front end, legacy code you're scared to touch. You wire an AI agent into it — for review, refactoring, answering questions like "what breaks if I change this method's signature?" The agent can't see the whole repo at once. Something has to feed it context: which functions live where, who calls whom, what inherits from what. And here's an architectural decision with a price tag : what exactly do you feed it? There are basically four classes of answer: Each option costs money tokens , time latency , and risk the agent gives up and hits a turn cap . apache/superset is an almost perfect stand-in for this case: ~400k LOC, Python + TypeScript, an /api/v1/... boundary between front and back. A big polyglot project — exactly when this question is worth asking. So how much does each option cost? Let's measure. The whole methodology rests on one principle: fix everything except one thing. Model, system prompt, settings, task set — constants. Only the context-providing MCP server changes. Then any difference in the numbers is the contribution of that tool, not a config accident. No tool is designated "the baseline to beat." All four are measured on equal footing, and the numbers rank them. | Arm | Context provider MCP server | Indexing step | |---|---|---| filesystem | @modelcontextprotocol/server-filesystem read file + grep | none | graphlens | graphlens graph over MCP | graphlens analyze | serena | Serena LSP | LSP workspace warm-up | codegraph | a graph-based competitor | codegraph init | One detail that matters for fairness: Claude Code's built-in tools Read / Grep / Bash, etc. are disabled. If you don't take them away, the agent ignores the MCP server and falls back to its usual path — and you'd be measuring the wrong thing. So the harness runs claude -p in a clean room: a fresh CLAUDE CONFIG DIR with only subscription credentials no hooks, plugins, skills, memory , --strict-mcp-config only this arm's server is visible , --disallowedTools on every built-in an explicit deny , because in headless mode an allow-list alone forbids nothing , and --allowedTools mcp <server to auto-approve the one server. In parallel I varied the model answering the question: | Key | model id | |---|---| haiku | claude-haiku-4-5 | sonnet | claude-sonnet-4-6 | opus | claude-opus-4-8 | Why a second axis becomes clear near the end: the optimal tool depends on which model you picked. That's probably the least obvious finding in the whole thing. Total: 4 arms × 3 models × 26 tasks × 3 seeds = 936 runs on Claude Code 2.1.187 . Benchmarks are easy to bend toward the conclusion you want. So the rules are fixed up front — without them the numbers aren't trustworthy: 6.0.0 every task carries a file:line reference . Crucially, ast . filesystem is grep + read, not "an agent with no tools." Naive ≠ toolless. temperature=0 does not make these models deterministic. So 3 seeds, and the report shows cost usd is an API-equivalent, not your bill. cost usd emitted by the CLI is what the same tokens NO TOOLS . Answering "from memory" about a well-known repo wouldn't measure the context provider.And separately: failure counts as accuracy 0. If grep hits the 50-turn cap and never produces an answer, that's not "no data" — it's "the tool didn't get there within budget." That's how it's scored. 26 tasks split into two classes. SIMPLE — 20 pinpoint lookups "where is X defined / what does X inherit from" . One-point answers, checked by substring: | Kind | | What it probes | |---|---|---| where defined | 7 | Python class → defining file | inherits from | 5 | Python class → base class | abstract methods | 1 | ABC → its abstract methods | ts where defined | 1 | TS hook → defining file | ts route call | 4 | /api/v1/... route → the TS hook that calls it | xlang link | 2 | TS consumer → Python handler across the API boundary | HARD — 6 blast-radius and disambiguation tasks. This is the regime where structure and semantics should beat text search — and which pinpoint lookups simply can't measure: | Kind | | What it probes | Scoring | |---|---|---|---| disambiguate | 2 | an ambiguous bare method name e.g. cache key , defined on many classes → the right class | substring | overrides count | 2 | the full set of subclasses overriding a base method | set F1 | impact set | 2 | every file calling a given method the blast radius | set F1 | Set tasks are scored by F1: reward for recall find them all , penalty for precision text search loves to dump every occurrence of .get indexes . Gold sets are kept small 3–5 elements, one ≈17 so they can be exhaustively checked by hand. The set is deliberately unbalanced — 20 simple vs 6 hard. A single blended average would be entirely dictated by the easy tasks and would hide exactly the difference the hard ones expose. So I report each regime separately, and never mix them. And no, I deliberately don't "balance to 50/50" by dropping simple tasks. That would throw away data and statistical power, and open the door to cherry-picking. Stratification neutralizes the skew without discarding data . General principle: if regimes give different answers, it's more honest to show both than to bury the conflict under an average. | Tool | accuracy | complete | tokens | calls | $/task | sec | |---|---|---|---|---|---|---| | filesystem | 0.97 | 100% | 1780 | 10 | $0.063 | 43 | | graphlens | 0.98 | 100% | 690 | 3 | $0.038 | 13 | | serena | 0.99 | 100% | 402 | 3 | $0.031 | 20 | | codegraph | 0.99 | 100% | 372 | 1 | $0.022 | 10 | Accuracy is a tie formally: Friedman χ²=0.40, not significant . The tools differ only on cost — a ~3× spread — and the terse ones win. graphlens is unremarkable here — a solid mid-pack. This is exactly the story a benchmark that only measured pinpoint lookups would tell: "structural tools are nice, but grep nearly keeps up, and codegraph gives the cheapest answer." And it would be an incomplete truth. | Tool | accuracy | complete | tokens | calls | $/task | sec | |---|---|---|---|---|---|---| | filesystem | 0.71 | 83% | 12596 | 27 | $0.424 | 165 | | graphlens | 0.84 | 100% | 748 | 1 | $0.018 | 9 | | serena | 0.85 | 98% | 1368 | 5 | $0.065 | 29 | | codegraph | 0.93 | 100% | 1114 | 2 | $0.036 | 16 | Now the tools separate. grep collapses. Lowest accuracy 0.71 , only 83% of runs finish the rest hit the 50-turn cap , and the ones that finish cost 6–24× more $0.42 vs $0.018–0.065 and take 6–18× longer ~165s vs 9–29s . Text search drowns in noise when the question is "every call to this" or "which of a dozen identically-named methods." And the key bit: graphlens — the mid-pack tool on easy tasks — is here the cheapest $0.018 and fastest 9s . Its semantic graph finally pays off: one call instead of twenty-seven. The most accurate tool is codegraph 0.93 . serena is competitive 0.85 . So the same graphlens that looked unremarkable on pinpoint lookups becomes the most economical the moment the work is real — blast radius, refactoring. The ranking inverts between regimes. Fairness note. MCP resourcesare disabled for all arms. graphlens was the only server exposing resources, and in an early run the agent wandered into enumerating them and inflated cost ~24% until I denied them. All numbers above are from the clean re-run. The cost difference is mostly how many times the agent calls the tool , which follows from how a server slices its primitives. On a simple "symbol → file" where defined , one call is enough for everyone. The gap opens on relationship queries — inheritance, route → handler, cross-language links. There graphlens chains fine-grained primitives find → neighbors → references , while codegraph packs "source + call paths in one shot" explore / node . This isn't a difference in what the graph knows — graphs know roughly the same things. It's a difference in API granularity: fewer round-trips → cheaper and faster. That's why codegraph has the efficiency edge on simple tasks, and why grep bankrupts itself on hard ones — it makes 27 round-trips where the graph needs one or two. This is the least obvious part. Take median $/task across both regimes broken down by model: | Tool | haiku | sonnet | opus | |---|---|---|---| | filesystem | $0.053 | $0.080 | $0.087 | | graphlens | $0.020 | $0.041 | $0.046 | | serena | $0.026 | $0.033 | $0.042 | | codegraph | $0.023 | $0.041 | $0.031 | Cheapest-first ranking within each model : Watch what happens to graphlens. On haiku it's the cheapest of all. On opus it becomes the most expensive of the structural tools still cheaper than grep, though . The mechanism: graphlens results are token-heavy — graph neighborhoods, reference lists. On a cheap model that verbose context is nearly free; on an expensive one, opus prices the same tokens far higher, and verbosity hits the wallet. serena and codegraph stay cheap on any model because they return pinpoint results — they're robust to model choice; graphlens isn't. Which gives the most valuable takeaway of the lot: a cheap model on a structural tool beats an expensive model on grep. codegraph + haiku ~$0.023, accuracy ~0.99 beats filesystem + opus ~$0.087, accuracy 0.93 on every axis at once. I planted the two xlang link tasks as a stress test: a TS call resolves to a Python handler across the /api/v1/... boundary, and I was sure single-language tools would trip on it. They didn't. Every arm, grep included, solved both cross-language tasks. The agent steps across the boundary itself, regardless of the context provider. On this set the hypothesis failed, and I report that as loudly as the findings that held. A benchmark that only reports what it hoped to see isn't a benchmark. Friedman test across the four tools, over task blocks, within each regime df=3; critical values: 0.05 → 7.82, 0.01 → 11.34 : SIMPLE: accuracy n=20 χ²= 0.40 n.s. — tie cost n=20 χ²=18.42 p<.01 — serena < codegraph < graphlens < filesystem HARD: accuracy n= 6 χ²= 3.50 n.s. — underpowered cost n= 6 χ²=11.80 p<.01 — graphlens < codegraph < serena < filesystem What's honest to claim from this: I'm leaving this in the article on purpose. The temptation to write "graphlens/codegraph are more accurate than grep, proven" is real, but n=6 doesn't carry it, and pretending otherwise would be dishonest. The structural tools build an index once — pure static work, zero LLM tokens , wall-clock only: | Tool | one-time index | |---|---| | filesystem | 0s | | codegraph | 48s | | graphlens | 84s | | serena | 94s | grep pays nothing up front but pays more per query. These are different currencies seconds vs $/tokens , so I draw no single "break-even point" — that'd be a stretch. The picture is simple: the index is a one-time time cost with not a single token spent, while the $/task savings drip on every task. Over a long session the structural tools amortize; on a couple of one-off queries, grep's zero setup can win on time-to-first-answer. Back to the original question: what do you feed the agent on a large project? There is no "this tool is always best" answer. There's a "depends on what work you hand it" answer: And the honest caveats, without which you can't transfer the conclusions to your project: One repository apache/superset @ 6.0.0 , one harness, 26 tasks 20 simple / 6 hard . Regimes are reported separately andnever blended. cost usd is an API-equivalent, not a subscription bill. Failure = accuracy 0. This isnot a universal ranking— it's a reproducible measurement on one concrete case. Since this is a follow-up to the graphlens post https://dev.to/neko1313 4/graphlens-a-polyglot-code-analysis-framework-that-turns-your-repo-into-a-typed-graph-4mhi , let me say it straight. This benchmark does not prove graphlens is "the best." It shows the specific regime where its structural graph pays off impact analysis, cheap and fast on cheaper models , and just as plainly shows where it lags on opus its verbose output costs more than codegraph and serena; codegraph is more accurate on hard tasks . For me that's more useful than any victory lap. graphlens was built as an engine and a precise polyglot graph model , not a turnkey app — and the benchmark confirms exactly that: on structural questions the graph beats text search by a wide margin, and there's clear room to grow — MCP tool granularity fewer round-trips, like codegraph and output compactness so it doesn't bankrupt itself on expensive models . That's my next work item, now backed by numbers instead of intuition. The whole harness and the raw data are open. A run reassembles deterministically from data/ . metrics.ipynb all charts and per-section stats and README.md methodology . uv run main.py runs the full pipeline clone superset → build indices → 936 runs, resumable within subscription limits , then open metrics.ipynb .If you've got a large project of your own and the itch to run the harness on it — issues and results welcome. The more independent runs across different codebases, the closer we get to an answer that transfers, rather than "works on superset."