How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs

wpnews.pro

In my last post I described graphlens — what it does, how it works — and along the way I casually claimed that an agent "burns tokens grepping around a repo." I gave exactly zero numbers to back that up.

This post fixes that. Here are the measurements, the data, and a reproducible harness. Spoiler: the conclusion is not the one I expected going in, and that's the interesting part.

I took one agent (Claude Code), changed exactly one thing — which MCP server feeds it code context — and ran it over 26 tasks on apache/superset

. Four "arms": filesystem

(grep + read), graphlens

(structural graph), serena

(LSP), and codegraph

. Three models (haiku / sonnet / opus), three seeds — 936 runs.

The headline: the answer flips depending on the kind of task.

If I'd only measured the easy tasks, I'd have written "you don't need a graph, grep is fine." If only the hard ones, "you don't need grep, get a graph." The truth sits in the middle, and it's about what work you hand the agent.

Picture a familiar situation. You have a large project: hundreds of thousands of lines, a Python backend, a TypeScript front end, legacy code you're scared to touch. You wire an AI agent into it — for review, refactoring, answering questions like "what breaks if I change this method's signature?"

The agent can't see the whole repo at once. Something has to feed it context: which functions live where, who calls whom, what inherits from what. And here's an architectural decision with a price tag: what exactly do you feed it?

There are basically four classes of answer:

Each option costs money (tokens), time (latency), and risk (the agent gives up and hits a turn cap). apache/superset

is an almost perfect stand-in for this case: ~400k LOC, Python + TypeScript, an /api/v1/...

boundary between front and back. A big polyglot project — exactly when this question is worth asking.

So how much does each option cost? Let's measure.

The whole methodology rests on one principle: fix everything except one thing. Model, system prompt, settings, task set — constants. Only the context-providing MCP server changes. Then any difference in the numbers is the contribution of that tool, not a config accident.

No tool is designated "the baseline to beat." All four are measured on equal footing, and the numbers rank them.

Arm	Context provider (MCP server)	Indexing step
`filesystem`
`@modelcontextprotocol/server-filesystem` (read_file + grep)
none
`graphlens`
graphlens graph over MCP	`graphlens analyze`
`serena`
Serena (LSP)	LSP workspace warm-up
`codegraph`
a graph-based competitor	`codegraph init`

One detail that matters for fairness: Claude Code's built-in tools (Read / Grep / Bash, etc.) are disabled. If you don't take them away, the agent ignores the MCP server and falls back to its usual path — and you'd be measuring the wrong thing. So the harness runs claude -p

in a clean room: a fresh CLAUDE_CONFIG_DIR

with only subscription credentials (no hooks, plugins, skills, memory), --strict-mcp-config

(only this arm's server is visible), --disallowedTools

on every built-in (an explicit deny, because in headless mode an allow-list alone forbids nothing), and --allowedTools mcp__<server>

to auto-approve the one server.

In parallel I varied the model answering the question:

Key	model id
`haiku`
`claude-haiku-4-5`
`sonnet`
`claude-sonnet-4-6`
`opus`
`claude-opus-4-8`

Why a second axis becomes clear near the end: the optimal tool depends on which model you picked. That's probably the least obvious finding in the whole thing.

Total: 4 arms × 3 models × 26 tasks × 3 seeds = 936 runs (on Claude Code 2.1.187).

Benchmarks are easy to bend toward the conclusion you want. So the rules are fixed up front — without them the numbers aren't trustworthy:

6.0.0

(every task carries a file:line

reference). Crucially, ast

.filesystem

is grep + read, not "an agent with no tools." Naive ≠ toolless.temperature=0

does not make these models deterministic. So 3 seeds, and the report shows cost_usd

is an API-equivalent, not your bill.cost_usd

(emitted by the CLI) is what the same tokens __NO_TOOLS__

). Answering "from memory" about a well-known repo wouldn't measure the context provider.And separately: failure counts as accuracy 0. If grep hits the 50-turn cap and never produces an answer, that's not "no data" — it's "the tool didn't get there within budget." That's how it's scored.

26 tasks split into two classes.

SIMPLE — 20 pinpoint lookups ("where is X defined / what does X inherit from"). One-point answers, checked by substring:

Kind	#	What it probes
`where_defined`
7	Python class → defining file
`inherits_from`
5	Python class → base class
`abstract_methods`
1	ABC → its abstract methods
`ts_where_defined`
1	TS hook → defining file
`ts_route_call`
4
`/api/v1/...` route → the TS hook that calls it
`xlang_link`
2	TS consumer → Python handler across the API boundary

HARD — 6 blast-radius and disambiguation tasks. This is the regime where structure and semantics should beat text search — and which pinpoint lookups simply can't measure:

Kind	#	What it probes
`disambiguate`
2	an ambiguous bare method name (e.g. `cache_key` , defined on many classes) → the right class
substring
`overrides_count`
2	the full set of subclasses overriding a base method	set F1
`impact_set`
2	every file calling a given method (the blast radius)	set F1

Set tasks are scored by F1: reward for recall (find them all), penalty for precision (text search loves to dump every occurrence of .get_indexes(

). Gold sets are kept small (3–5 elements, one ≈17) so they can be exhaustively checked by hand.

The set is deliberately unbalanced — 20 simple vs 6 hard. A single blended average would be entirely dictated by the easy tasks and would hide exactly the difference the hard ones expose. So I report each regime separately, and never mix them.

And no, I deliberately don't "balance to 50/50" by dropping simple tasks. That would throw away data and statistical power, and open the door to cherry-picking. Stratification neutralizes the skew without discarding data. (General principle: if regimes give different answers, it's more honest to show both than to bury the conflict under an average.)

Tool	accuracy	complete	tokens	calls	$/task	sec
filesystem	0.97	100%	1780	10	$0.063	43
graphlens	0.98	100%	690	3	$0.038	13
serena	0.99	100%	402	3	$0.031	20
codegraph	0.99	100%	372	1	$0.022	10

Accuracy is a tie (formally: Friedman χ²=0.40, not significant). The tools differ only on cost — a ~3× spread — and the terse ones win. graphlens is unremarkable here — a solid mid-pack.

This is exactly the story a benchmark that only measured pinpoint lookups would tell: "structural tools are nice, but grep nearly keeps up, and codegraph gives the cheapest answer." And it would be an incomplete truth.

Tool	accuracy	complete	tokens	calls	$/task	sec
filesystem	0.71	83%	12596	27	$0.424	165
graphlens	0.84	100%	748	1	$0.018	9
serena	0.85	98%	1368	5	$0.065	29
codegraph	0.93	100%	1114	2	$0.036	16

Now the tools separate.

grep collapses. Lowest accuracy (0.71), only 83% of runs finish (the rest hit the 50-turn cap), and the ones that finish cost 6–24× more ($0.42 vs $0.018–0.065) and take 6–18× longer (~165s vs 9–29s). Text search drowns in noise when the question is "every call to this" or "which of a dozen identically-named methods."

And the key bit: graphlens — the mid-pack tool on easy tasks — is here the cheapest ($0.018) and fastest (9s). Its semantic graph finally pays off: one call instead of twenty-seven. The most accurate tool is codegraph (0.93). serena is competitive (0.85).

So the same graphlens that looked unremarkable on pinpoint lookups becomes the most economical the moment the work is real — blast radius, refactoring. The ranking inverts between regimes.

Fairness note. MCP

resourcesare disabled for all arms. graphlens was the only server exposing resources, and in an early run the agent wandered into enumerating them and inflated cost ~24% until I denied them. All numbers above are from the clean re-run.

The cost difference is mostly how many times the agent calls the tool, which follows from how a server slices its primitives.

On a simple "symbol → file" (where_defined

), one call is enough for everyone. The gap opens on relationship queries — inheritance, route → handler, cross-language links. There graphlens

chains fine-grained primitives (find

→ neighbors

→ references

), while codegraph

packs "source + call paths in one shot" (explore

/ node

).

This isn't a difference in what the graph knows — graphs know roughly the same things. It's a difference in API granularity: fewer round-trips → cheaper and faster. That's why codegraph has the efficiency edge on simple tasks, and why grep bankrupts itself on hard ones — it makes 27 round-trips where the graph needs one or two.

This is the least obvious part. Take median $/task (across both regimes) broken down by model:

Tool	haiku	sonnet	opus
filesystem	$0.053	$0.080	$0.087
graphlens	$0.020	$0.041	$0.046
serena	$0.026	$0.033	$0.042
codegraph	$0.023	$0.041	$0.031

Cheapest-first ranking within each model:

Watch what happens to graphlens. On haiku it's the cheapest of all. On opus it becomes the most expensive of the structural tools (still cheaper than grep, though).

The mechanism: graphlens results are token-heavy — graph neighborhoods, reference lists. On a cheap model that verbose context is nearly free; on an expensive one, opus prices the same tokens far higher, and verbosity hits the wallet. serena and codegraph stay cheap on any model because they return pinpoint results — they're robust to model choice; graphlens isn't.

Which gives the most valuable takeaway of the lot: a cheap model on a structural tool beats an expensive model on grep. codegraph + haiku (~$0.023, accuracy ~0.99) beats filesystem + opus (~$0.087, accuracy 0.93) on every axis at once.

I planted the two xlang_link

tasks as a stress test: a TS call resolves to a Python handler across the /api/v1/...

boundary, and I was sure single-language tools would trip on it.

They didn't. Every arm, grep included, solved both cross-language tasks. The agent steps across the boundary itself, regardless of the context provider. On this set the hypothesis failed, and I report that as loudly as the findings that held. A benchmark that only reports what it hoped to see isn't a benchmark.

Friedman test across the four tools, over task blocks, within each regime (df=3; critical values: 0.05 → 7.82, 0.01 → 11.34):

SIMPLE:
  accuracy  n=20  χ²= 0.40  (n.s.)    — tie
  cost      n=20  χ²=18.42  (p<.01)   — serena < codegraph < graphlens < filesystem

HARD:
  accuracy  n= 6  χ²= 3.50  (n.s.)    — underpowered
  cost      n= 6  χ²=11.80  (p<.01)   — graphlens < codegraph < serena < filesystem

What's honest to claim from this:

I'm leaving this in the article on purpose. The temptation to write "graphlens/codegraph are more accurate than grep, proven" is real, but n=6 doesn't carry it, and pretending otherwise would be dishonest.

The structural tools build an index once — pure static work, zero LLM tokens, wall-clock only:

Tool	one-time index
filesystem	0s
codegraph	48s
graphlens	84s
serena	94s

grep pays nothing up front but pays more per query. These are different currencies (seconds vs $/tokens), so I draw no single "break-even point" — that'd be a stretch. The picture is simple: the index is a one-time time cost with not a single token spent, while the $/task savings drip on every task. Over a long session the structural tools amortize; on a couple of one-off queries, grep's zero setup can win on time-to-first-answer.

Back to the original question: what do you feed the agent on a large project?

There is no "this tool is always best" answer. There's a "depends on what work you hand it" answer:

And the honest caveats, without which you can't transfer the conclusions to your project:

One repository (

apache/superset

@ 6.0.0), one harness, 26 tasks (20 simple / 6 hard). Regimes are reported separately andnever blended.cost_usd

is an API-equivalent, not a subscription bill. Failure = accuracy 0. This isnot a universal ranking— it's a reproducible measurement on one concrete case.

Since this is a follow-up to the graphlens post, let me say it straight. This benchmark does not prove graphlens is "the best." It shows the specific regime where its structural graph pays off (impact analysis, cheap and fast on cheaper models), and just as plainly shows where it lags (on opus its verbose output costs more than codegraph and serena; codegraph is more accurate on hard tasks).

For me that's more useful than any victory lap. graphlens was built as an engine and a precise polyglot graph model, not a turnkey app — and the benchmark confirms exactly that: on structural questions the graph beats text search by a wide margin, and there's clear room to grow — MCP tool granularity (fewer round-trips, like codegraph) and output compactness (so it doesn't bankrupt itself on expensive models). That's my next work item, now backed by numbers instead of intuition.

The whole harness and the raw data are open. A run reassembles deterministically from data/

.

metrics.ipynb

(all charts and per-section stats) and README.md

(methodology).uv run main.py

runs the full pipeline (clone superset → build indices → 936 runs, resumable within subscription limits), then open metrics.ipynb

.If you've got a large project of your own and the itch to run the harness on it — issues and results welcome. The more independent runs across different codebases, the closer we get to an answer that transfers, rather than "works on superset."

source & further reading

dev.to — original article Is OpusClip’s subscription model becoming a real pain point for creators? I'm 18, Self-Taught, and I Built an AI Study App for Nigerian Students — Here's How How I Deployed Hermes Agent on AWS

How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs

Run your AI side-project on zahid.host