How much does context cost an AI coding agent? grep vs graph vs LSP, measured across 936 runs A developer measured the token cost of context provisioning for AI coding agents across 936 runs on the Apache Superset repository, comparing grep, graph-based, and LSP-based approaches. The results show that the optimal method depends on task difficulty and model choice, with no single approach dominating across all scenarios. In my last post https://dev.to/neko1313 4/graphlens-a-polyglot-code-analysis-framework-that-turns-your-repo-into-a-typed-graph-4mhi I described graphlens — what it does, how it works — and along the way I casually claimed that an agent "burns tokens grepping around a repo." I gave exactly zero numbers to back that up. This post fixes that. Here are the measurements, the data, and a reproducible harness. Spoiler: the conclusion is not the one I expected going in, and that's the interesting part. I took one agent Claude Code , changed exactly one thing — which MCP server feeds it code context — and ran it over 26 tasks on apache/superset . Four "arms": filesystem grep + read , graphlens structural graph , serena LSP , and codegraph . Three models haiku / sonnet / opus , three seeds — 936 runs . The headline: the answer flips depending on the kind of task. If I'd only measured the easy tasks, I'd have written "you don't need a graph, grep is fine." If only the hard ones, "you don't need grep, get a graph." The truth sits in the middle, and it's about what work you hand the agent. Picture a familiar situation. You have a large project: hundreds of thousands of lines, a Python backend, a TypeScript front end, legacy code you're scared to touch. You wire an AI agent into it — for review, refactoring, answering questions like "what breaks if I change this method's signature?" The agent can't see the whole repo at once. Something has to feed it context: which functions live where, who calls whom, what inherits from what. And here's an architectural decision with a price tag : what exactly do you feed it? There are basically four classes of answer: Each option costs money tokens , time latency , and risk the agent gives up and hits a turn cap . apache/superset is an almost perfect stand-in for this case: ~400k LOC, Python + TypeScript, an /api/v1/... boundary between front and back. A big polyglot project — exactly when this question is worth asking. So how much does each option cost? Let's measure. The whole methodology rests on one principle: fix everything except one thing. Model, system prompt, settings, task set — constants. Only the context-providing MCP server changes. Then any difference in the numbers is the contribution of that tool, not a config accident. No tool is designated "the baseline to beat." All four are measured on equal footing, and the numbers rank them. | Arm | Context provider MCP server | Indexing step | |---|---|---| filesystem | @modelcontextprotocol/server-filesystem read file + grep | none | graphlens | graphlens graph over MCP | graphlens analyze | serena | Serena LSP | LSP workspace warm-up | codegraph | a graph-based competitor | codegraph init | One detail that matters for fairness: Claude Code's built-in tools Read / Grep / Bash, etc. are disabled. If you don't take them away, the agent ignores the MCP server and falls back to its usual path — and you'd be measuring the wrong thing. So the harness runs claude -p in a clean room: a fresh CLAUDE CONFIG DIR with only subscription credentials no hooks, plugins, skills, memory , --strict-mcp-config only this arm's server is visible , --disallowedTools on every built-in an explicit deny , because in headless mode an allow-list alone forbids nothing , and --allowedTools mcp