{"slug": "how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936", "title": "How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs", "summary": "A developer measured the token cost of context provisioning for AI coding agents across 936 runs on the Apache Superset repository, comparing grep, graph-based, and LSP-based approaches. The results show that the optimal method depends on task difficulty and model choice, with no single approach dominating across all scenarios.", "body_md": "In my [last post](https://dev.to/neko1313_4/graphlens-a-polyglot-code-analysis-framework-that-turns-your-repo-into-a-typed-graph-4mhi) I described **graphlens** — what it does, how it works — and along the way I casually claimed that an agent \"burns tokens grepping around a repo.\" I gave exactly **zero** numbers to back that up.\n\nThis post fixes that. Here are the measurements, the data, and a reproducible harness. Spoiler: the conclusion is not the one I expected going in, and that's the interesting part.\n\nI took **one** agent (Claude Code), changed **exactly one thing** — which MCP server feeds it code context — and ran it over 26 tasks on `apache/superset`\n\n. Four \"arms\": `filesystem`\n\n(grep + read), `graphlens`\n\n(structural graph), `serena`\n\n(LSP), and `codegraph`\n\n. Three models (haiku / sonnet / opus), three seeds — **936 runs**.\n\nThe headline: **the answer flips depending on the kind of task.**\n\nIf I'd only measured the easy tasks, I'd have written \"you don't need a graph, grep is fine.\" If only the hard ones, \"you don't need grep, get a graph.\" The truth sits in the middle, and it's about **what work you hand the agent.**\n\nPicture a familiar situation. You have a large project: hundreds of thousands of lines, a Python backend, a TypeScript front end, legacy code you're scared to touch. You wire an AI agent into it — for review, refactoring, answering questions like \"what breaks if I change this method's signature?\"\n\nThe agent can't see the whole repo at once. Something has to feed it context: which functions live where, who calls whom, what inherits from what. And here's an **architectural decision with a price tag**: what exactly do you feed it?\n\nThere are basically four classes of answer:\n\nEach option costs money (tokens), time (latency), and risk (the agent gives up and hits a turn cap). `apache/superset`\n\nis an almost perfect stand-in for this case: ~400k LOC, Python + TypeScript, an `/api/v1/...`\n\nboundary between front and back. A big polyglot project — exactly when this question is worth asking.\n\nSo how much does each option cost? Let's measure.\n\nThe whole methodology rests on one principle: **fix everything except one thing.** Model, system prompt, settings, task set — constants. Only the context-providing MCP server changes. Then any difference in the numbers is the contribution of that tool, not a config accident.\n\nNo tool is designated \"the baseline to beat.\" All four are measured on equal footing, and the numbers rank them.\n\n| Arm | Context provider (MCP server) | Indexing step |\n|---|---|---|\n`filesystem` |\n`@modelcontextprotocol/server-filesystem` (read_file + grep) |\nnone |\n`graphlens` |\ngraphlens graph over MCP | `graphlens analyze` |\n`serena` |\nSerena (LSP) | LSP workspace warm-up |\n`codegraph` |\na graph-based competitor | `codegraph init` |\n\nOne detail that matters for fairness: **Claude Code's built-in tools (Read / Grep / Bash, etc.) are disabled.** If you don't take them away, the agent ignores the MCP server and falls back to its usual path — and you'd be measuring the wrong thing. So the harness runs `claude -p`\n\nin a clean room: a fresh `CLAUDE_CONFIG_DIR`\n\nwith only subscription credentials (no hooks, plugins, skills, memory), `--strict-mcp-config`\n\n(only this arm's server is visible), `--disallowedTools`\n\non every built-in (an explicit *deny*, because in headless mode an allow-list alone forbids nothing), and `--allowedTools mcp__<server>`\n\nto auto-approve the one server.\n\nIn parallel I varied the model answering the question:\n\n| Key | model id |\n|---|---|\n`haiku` |\n`claude-haiku-4-5` |\n`sonnet` |\n`claude-sonnet-4-6` |\n`opus` |\n`claude-opus-4-8` |\n\nWhy a second axis becomes clear near the end: **the optimal tool depends on which model you picked.** That's probably the least obvious finding in the whole thing.\n\nTotal: 4 arms × 3 models × 26 tasks × 3 seeds = **936 runs** (on Claude Code 2.1.187).\n\nBenchmarks are easy to bend toward the conclusion you want. So the rules are fixed up front — without them the numbers aren't trustworthy:\n\n`6.0.0`\n\n(every task carries a `file:line`\n\nreference). Crucially, `ast`\n\n.`filesystem`\n\nis grep + read, not \"an agent with no tools.\" Naive ≠ toolless.`temperature=0`\n\ndoes not make these models deterministic. So 3 seeds, and the report shows `cost_usd`\n\nis an API-equivalent, not your bill.`cost_usd`\n\n(emitted by the CLI) is what the same tokens `__NO_TOOLS__`\n\n). Answering \"from memory\" about a well-known repo wouldn't measure the context provider.And separately: **failure counts as accuracy 0.** If grep hits the 50-turn cap and never produces an answer, that's not \"no data\" — it's \"the tool didn't get there within budget.\" That's how it's scored.\n\n26 tasks split into two classes.\n\n**SIMPLE — 20 pinpoint lookups** (\"where is X defined / what does X inherit from\"). One-point answers, checked by substring:\n\n| Kind | # | What it probes |\n|---|---|---|\n`where_defined` |\n7 | Python class → defining file |\n`inherits_from` |\n5 | Python class → base class |\n`abstract_methods` |\n1 | ABC → its abstract methods |\n`ts_where_defined` |\n1 | TS hook → defining file |\n`ts_route_call` |\n4 |\n`/api/v1/...` route → the TS hook that calls it |\n`xlang_link` |\n2 | TS consumer → Python handler across the API boundary |\n\n**HARD — 6 blast-radius and disambiguation tasks.** This is the regime where structure and semantics *should* beat text search — and which pinpoint lookups simply can't measure:\n\n| Kind | # | What it probes | Scoring |\n|---|---|---|---|\n`disambiguate` |\n2 | an ambiguous bare method name (e.g. `cache_key` , defined on many classes) → the right class |\nsubstring |\n`overrides_count` |\n2 | the full set of subclasses overriding a base method | set F1 |\n`impact_set` |\n2 | every file calling a given method (the blast radius) | set F1 |\n\nSet tasks are scored by F1: reward for recall (find them all), penalty for precision (text search loves to dump every occurrence of `.get_indexes(`\n\n). Gold sets are kept small (3–5 elements, one ≈17) so they can be exhaustively checked by hand.\n\nThe set is **deliberately unbalanced** — 20 simple vs 6 hard. A single blended average would be entirely dictated by the easy tasks and would **hide** exactly the difference the hard ones expose. So I report each regime **separately, and never mix them.**\n\nAnd no, I deliberately don't \"balance to 50/50\" by dropping simple tasks. That would throw away data and statistical power, and open the door to cherry-picking. Stratification neutralizes the skew **without discarding data**. (General principle: if regimes give different answers, it's more honest to show both than to bury the conflict under an average.)\n\n| Tool | accuracy | complete | tokens | calls | $/task | sec |\n|---|---|---|---|---|---|---|\n| filesystem | 0.97 | 100% | 1780 | 10 | $0.063 | 43 |\n| graphlens | 0.98 | 100% | 690 | 3 | $0.038 | 13 |\n| serena | 0.99 | 100% | 402 | 3 | $0.031 | 20 |\n| codegraph | 0.99 | 100% | 372 | 1 | $0.022 | 10 |\n\nAccuracy is a **tie** (formally: Friedman χ²=0.40, not significant). The tools differ only on cost — a ~3× spread — and the terse ones win. **graphlens is unremarkable here** — a solid mid-pack.\n\nThis is exactly the story a benchmark that *only* measured pinpoint lookups would tell: \"structural tools are nice, but grep nearly keeps up, and codegraph gives the cheapest answer.\" And it would be an **incomplete** truth.\n\n| Tool | accuracy | complete | tokens | calls | $/task | sec |\n|---|---|---|---|---|---|---|\n| filesystem | 0.71 | 83% | 12596 | 27 | $0.424 | 165 |\n| graphlens | 0.84 | 100% | 748 | 1 | $0.018 | 9 |\n| serena | 0.85 | 98% | 1368 | 5 | $0.065 | 29 |\n| codegraph | 0.93 | 100% | 1114 | 2 | $0.036 | 16 |\n\nNow the tools **separate.**\n\n**grep collapses.** Lowest accuracy (0.71), only 83% of runs finish (the rest hit the 50-turn cap), and the ones that finish cost **6–24× more** ($0.42 vs $0.018–0.065) and take **6–18× longer** (~165s vs 9–29s). Text search drowns in noise when the question is \"every call to this\" or \"which of a dozen identically-named methods.\"\n\nAnd the key bit: **graphlens — the mid-pack tool on easy tasks — is here the cheapest ($0.018) and fastest (9s).** Its semantic graph finally pays off: one call instead of twenty-seven. The most *accurate* tool is codegraph (0.93). serena is competitive (0.85).\n\nSo the same graphlens that looked unremarkable on pinpoint lookups becomes the most economical the moment the work is real — blast radius, refactoring. The ranking **inverts** between regimes.\n\nFairness note. MCP\n\nresourcesare disabled for all arms. graphlens was the only server exposing resources, and in an early run the agent wandered into enumerating them and inflated cost ~24% until I denied them. All numbers above are from the clean re-run.\n\nThe cost difference is mostly **how many times the agent calls the tool**, which follows from how a server slices its primitives.\n\nOn a simple \"symbol → file\" (`where_defined`\n\n), one call is enough for everyone. The gap opens on **relationship queries** — inheritance, route → handler, cross-language links. There `graphlens`\n\nchains fine-grained primitives (`find`\n\n→ `neighbors`\n\n→ `references`\n\n), while `codegraph`\n\npacks \"source + call paths in one shot\" (`explore`\n\n/ `node`\n\n).\n\nThis isn't a difference in *what the graph knows* — graphs know roughly the same things. It's a difference in API granularity: fewer round-trips → cheaper and faster. That's why codegraph has the efficiency edge on simple tasks, and why grep bankrupts itself on hard ones — it makes 27 round-trips where the graph needs one or two.\n\nThis is the least obvious part. Take median $/task (across both regimes) broken down by model:\n\n| Tool | haiku | sonnet | opus |\n|---|---|---|---|\n| filesystem | $0.053 | $0.080 | $0.087 |\n| graphlens | $0.020 | $0.041 | $0.046 |\n| serena | $0.026 | $0.033 | $0.042 |\n| codegraph | $0.023 | $0.041 | $0.031 |\n\nCheapest-first ranking **within each model**:\n\nWatch what happens to graphlens. On **haiku it's the cheapest of all.** On **opus it becomes the most expensive of the structural tools** (still cheaper than grep, though).\n\nThe mechanism: graphlens results are **token-heavy** — graph neighborhoods, reference lists. On a cheap model that verbose context is nearly free; on an expensive one, opus prices the same tokens far higher, and verbosity hits the wallet. **serena and codegraph stay cheap on any model** because they return pinpoint results — they're robust to model choice; graphlens isn't.\n\nWhich gives the most valuable takeaway of the lot: **a cheap model on a structural tool beats an expensive model on grep.** codegraph + haiku (~$0.023, accuracy ~0.99) beats filesystem + opus (~$0.087, accuracy 0.93) on every axis at once.\n\nI planted the two `xlang_link`\n\ntasks as a stress test: a TS call resolves to a Python handler across the `/api/v1/...`\n\nboundary, and I was sure single-language tools would trip on it.\n\n**They didn't.** Every arm, grep included, solved both cross-language tasks. The agent steps across the boundary itself, regardless of the context provider. On this set the hypothesis failed, and I report that as loudly as the findings that held. A benchmark that only reports what it hoped to see isn't a benchmark.\n\nFriedman test across the four tools, over task blocks, within each regime (df=3; critical values: 0.05 → 7.82, 0.01 → 11.34):\n\n```\nSIMPLE:\n  accuracy  n=20  χ²= 0.40  (n.s.)    — tie\n  cost      n=20  χ²=18.42  (p<.01)   — serena < codegraph < graphlens < filesystem\n\nHARD:\n  accuracy  n= 6  χ²= 3.50  (n.s.)    — underpowered\n  cost      n= 6  χ²=11.80  (p<.01)   — graphlens < codegraph < serena < filesystem\n```\n\nWhat's honest to claim from this:\n\nI'm leaving this in the article on purpose. The temptation to write \"graphlens/codegraph are more accurate than grep, proven\" is real, but n=6 doesn't carry it, and pretending otherwise would be dishonest.\n\nThe structural tools build an index once — **pure static work, zero LLM tokens**, wall-clock only:\n\n| Tool | one-time index |\n|---|---|\n| filesystem | 0s |\n| codegraph | 48s |\n| graphlens | 84s |\n| serena | 94s |\n\ngrep pays nothing up front but pays more per query. These are **different currencies** (seconds vs $/tokens), so I draw no single \"break-even point\" — that'd be a stretch. The picture is simple: the index is a one-time time cost with not a single token spent, while the $/task savings drip on every task. Over a long session the structural tools amortize; on a couple of one-off queries, grep's zero setup can win on time-to-first-answer.\n\nBack to the original question: what do you feed the agent on a large project?\n\n**There is no \"this tool is always best\" answer.** There's a \"depends on what work you hand it\" answer:\n\nAnd the honest caveats, without which you can't transfer the conclusions to your project:\n\nOne repository (\n\n`apache/superset`\n\n@ 6.0.0), one harness, 26 tasks (20 simple / 6 hard). Regimes are reported separately andnever blended.`cost_usd`\n\nis an API-equivalent, not a subscription bill. Failure = accuracy 0. This isnot a universal ranking— it's a reproducible measurement on one concrete case.\n\nSince this is a follow-up to [the graphlens post](https://dev.to/neko1313_4/graphlens-a-polyglot-code-analysis-framework-that-turns-your-repo-into-a-typed-graph-4mhi), let me say it straight. This benchmark does **not** prove graphlens is \"the best.\" It shows the **specific regime where its structural graph pays off** (impact analysis, cheap and fast on cheaper models), and just as plainly shows **where it lags** (on opus its verbose output costs more than codegraph and serena; codegraph is more accurate on hard tasks).\n\nFor me that's more useful than any victory lap. graphlens was built as an **engine and a precise polyglot graph model**, not a turnkey app — and the benchmark confirms exactly that: on structural questions the graph beats text search by a wide margin, and there's clear room to grow — MCP tool granularity (fewer round-trips, like codegraph) and output compactness (so it doesn't bankrupt itself on expensive models). That's my next work item, now backed by numbers instead of intuition.\n\nThe whole harness and the raw data are open. A run reassembles deterministically from `data/`\n\n.\n\n`metrics.ipynb`\n\n(all charts and per-section stats) and `README.md`\n\n(methodology).`uv run main.py`\n\nruns the full pipeline (clone superset → build indices → 936 runs, resumable within subscription limits), then open `metrics.ipynb`\n\n.If you've got a large project of your own and the itch to run the harness on it — issues and results welcome. The more independent runs across different codebases, the closer we get to an answer that transfers, rather than \"works on superset.\"", "url": "https://wpnews.pro/news/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936", "canonical_source": "https://dev.to/neko1313_4/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-across-936-runs-33m8", "published_at": "2026-06-24 23:10:19+00:00", "updated_at": "2026-06-24 23:42:52.879859+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "developer-tools"], "entities": ["Claude Code", "Apache Superset", "graphlens", "Serena", "codegraph", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936", "markdown": "https://wpnews.pro/news/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936.md", "text": "https://wpnews.pro/news/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936.txt", "jsonld": "https://wpnews.pro/news/how-much-does-context-cost-an-ai-coding-agent-grep-vs-graph-vs-lsp-measured-936.jsonld"}}