{"slug": "show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation", "title": "Show HN: Is grep enough? A transparent benchmark for agentic code navigation", "summary": "A new benchmark compares grep, structural (tree-sitter), and semantic (LSP) tools for agentic code navigation across 50 tasks. Results show grep is sufficient for correctness but structural tools reduce token usage significantly, especially on complex tasks.", "body_md": "# is grep enough?\n\nWhen a coding agent explores a large, unfamiliar codebase, are basic text tools (grep/read) enough — or does it need something **fast and light** (structural, tree-sitter) or something more **authoritative and accurate** (semantic, LSP)? This measures the three side by side and lets you check the answer yourself.\n\n## TL;DR — is grep enough?\n\n**For correctness, usually. For tokens through the model, increasingly not.** The same 50 navigation tasks (10 repos × 5 complexity rungs), run three ways and blind-judged, land the right answer almost everywhere — mean grounding **—** and completeness **—** are a near-tie across all three arms. Grep *is* enough to be *correct*.\n\nWhere they separate is **how much context they push through the model to get there**, and that gap compounds with complexity. Structural navigation (grove) reaches the same answer on far fewer tokens — about **—** on average, against **—** for lsp and **—** for baseline text search (grep + read, which fans out and re-reads). At trivial tasks everything is light and grep wins on simplicity; by the hardest rung (L5) baseline pushes **—** of context against grove's **—** — **—** as many tokens — while staying tightest on quality.\n\nToken throughput is not the billed bill: most of baseline's volume is cheap cache reads, so in *dollars* the arms are much closer (see the cost panel). The lean-context win matters most for context-window pressure, latency, and any setting without aggressive prompt caching. The thesis is a curve, not a winner — grep suffices for shallow lookups; structural tools pay off as the code gets harder to navigate. n=1 per cell; descriptive, not significance-tested.\n\n## What this shows\n\nThe same exploration task, given to an agent with three rungs of navigation power — plain text search (baseline), fast-light structural (grove), and authoritative semantic (lsp) — across a task-complexity ladder. The question is where on that ladder the extra power stops paying for itself. Everything below links to the raw run that produced it.\n\n## Metrics\n\nOne metric at a time — select a tab to compare arms across rungs. Each bar is one repo; bars are grouped by arm within each rung panel. The *token economy* tab stacks context, cache, and cost together since they are causally linked.\n\n## Cell detail\n\n## Methodology & provenance\n\nHow the numbers above were produced, and what protects their fairness. Everything here is a standing claim you can check against the evidence linked throughout the page.\n\n### The genesis wall\n\nPrompts and their reference answer keys are generated **offline, before any arm runs**. A running arm sees **only the bare prompt** — never the reference key, the rationale, or the pinned source under `experiment/repos/`\n\n. Judging reads the keys; running never does. The keys are **judge-only** and appear on this page only as post-hoc *key revisions* in a cell's detail, never as the answer itself.\n\n### Blind judging\n\nEach cell's three answers are scrubbed to **A / B / C** with the arm→letter mapping withheld, graded against the reference key on **grounding** (do the cited `file:line`\n\nanchors resolve in pinned source?) and **completeness** (does it cover the key's required spine?), and only un-blinded to record the score. Where the key itself was wrong, it is corrected in place and the correction is shown with its cite — proof the grader corrected *itself*, not the arms.\n\n### Cite-link verification\n\nEvery `file:line`\n\nin a transcript links to the GitHub blob at that repo's **pinned SHA**. The build doesn't just link them — it **re-resolves them against the pinned source**: a cite is confirmed when its file is located (exact path, or a unique basename match in the tree) and the line is within the file. The result across all harvested cells:\n\n—\n\n### Pricing\n\nEvery dollar figure on this page is the **billed total_cost_usd** reported by the run itself — not a recomputed list-price estimate. The table below is the public list price for the model used (—), shown so a reader can sanity-check the billed figures against the token split. n=1 per cell; cost is a direction, not a benchmark.\n\n### Data sources\n\nThe feed is a pure function of committed evidence, synthesized by `site/build.mjs`\n\n. The cell ledger is written **only** through the validated `statectl`\n\nCLI — never hand-edited. These are the exact files behind the current view:\n\n### Reproduce this page\n\nThe feed is deterministic — re-running `build.mjs`\n\nagainst the same evidence reproduces `site/data/`\n\nbyte-for-byte (only the stamped SHA/timestamp differ). Rebuild and serve locally, then diff against what's published:\n\n## How to trust this\n\n- Every number is recomputed from a run's own stream-json; click any transcript to read the full reasoning trail and the raw evidence path.\n- The\n**engagement** line per arm proves it used its capability (`bash/grove/lsp > 0`\n\n) — a fairness gate, not an assertion. - Answer quality was\n**blind-judged**(A/B/C, mapping withheld); where the reference key was wrong it is corrected in-place (*key revisions*), shown in the cell detail.", "url": "https://wpnews.pro/news/show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation", "canonical_source": "https://entelligentsia.github.io/is-grep-enough/", "published_at": "2026-06-30 12:06:46+00:00", "updated_at": "2026-06-30 12:20:45.268445+00:00", "lang": "en", "topics": ["developer-tools", "ai-agents", "large-language-models"], "entities": ["grep", "tree-sitter", "LSP", "grove"], "alternates": {"html": "https://wpnews.pro/news/show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation", "markdown": "https://wpnews.pro/news/show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation.md", "text": "https://wpnews.pro/news/show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation.txt", "jsonld": "https://wpnews.pro/news/show-hn-is-grep-enough-a-transparent-benchmark-for-agentic-code-navigation.jsonld"}}