Show HN: Is grep enough? A transparent benchmark for agentic code navigation

A new benchmark compares grep, structural (tree-sitter), and semantic (LSP) tools for agentic code navigation across 50 tasks. Results show grep is sufficient for correctness but structural tools reduce token usage significantly, especially on complex tasks.

is grep enough? When a coding agent explores a large, unfamiliar codebase, are basic text tools grep/read enough — or does it need something fast and light structural, tree-sitter or something more authoritative and accurate semantic, LSP ? This measures the three side by side and lets you check the answer yourself. TL;DR — is grep enough? For correctness, usually. For tokens through the model, increasingly not. The same 50 navigation tasks 10 repos × 5 complexity rungs , run three ways and blind-judged, land the right answer almost everywhere — mean grounding — and completeness — are a near-tie across all three arms. Grep is enough to be correct . Where they separate is how much context they push through the model to get there , and that gap compounds with complexity. Structural navigation grove reaches the same answer on far fewer tokens — about — on average, against — for lsp and — for baseline text search grep + read, which fans out and re-reads . At trivial tasks everything is light and grep wins on simplicity; by the hardest rung L5 baseline pushes — of context against grove's — — — as many tokens — while staying tightest on quality. Token throughput is not the billed bill: most of baseline's volume is cheap cache reads, so in dollars the arms are much closer see the cost panel . The lean-context win matters most for context-window pressure, latency, and any setting without aggressive prompt caching. The thesis is a curve, not a winner — grep suffices for shallow lookups; structural tools pay off as the code gets harder to navigate. n=1 per cell; descriptive, not significance-tested. What this shows The same exploration task, given to an agent with three rungs of navigation power — plain text search baseline , fast-light structural grove , and authoritative semantic lsp — across a task-complexity ladder. The question is where on that ladder the extra power stops paying for itself. Everything below links to the raw run that produced it. Metrics One metric at a time — select a tab to compare arms across rungs. Each bar is one repo; bars are grouped by arm within each rung panel. The token economy tab stacks context, cache, and cost together since they are causally linked. Cell detail Methodology & provenance How the numbers above were produced, and what protects their fairness. Everything here is a standing claim you can check against the evidence linked throughout the page. The genesis wall Prompts and their reference answer keys are generated offline, before any arm runs . A running arm sees only the bare prompt — never the reference key, the rationale, or the pinned source under experiment/repos/ . Judging reads the keys; running never does. The keys are judge-only and appear on this page only as post-hoc key revisions in a cell's detail, never as the answer itself. Blind judging Each cell's three answers are scrubbed to A / B / C with the arm→letter mapping withheld, graded against the reference key on grounding do the cited file:line anchors resolve in pinned source? and completeness does it cover the key's required spine? , and only un-blinded to record the score. Where the key itself was wrong, it is corrected in place and the correction is shown with its cite — proof the grader corrected itself , not the arms. Cite-link verification Every file:line in a transcript links to the GitHub blob at that repo's pinned SHA . The build doesn't just link them — it re-resolves them against the pinned source : a cite is confirmed when its file is located exact path, or a unique basename match in the tree and the line is within the file. The result across all harvested cells: — Pricing Every dollar figure on this page is the billed total cost usd reported by the run itself — not a recomputed list-price estimate. The table below is the public list price for the model used — , shown so a reader can sanity-check the billed figures against the token split. n=1 per cell; cost is a direction, not a benchmark. Data sources The feed is a pure function of committed evidence, synthesized by site/build.mjs . The cell ledger is written only through the validated statectl CLI — never hand-edited. These are the exact files behind the current view: Reproduce this page The feed is deterministic — re-running build.mjs against the same evidence reproduces site/data/ byte-for-byte only the stamped SHA/timestamp differ . Rebuild and serve locally, then diff against what's published: How to trust this - Every number is recomputed from a run's own stream-json; click any transcript to read the full reasoning trail and the raw evidence path. - The engagement line per arm proves it used its capability bash/grove/lsp 0 — a fairness gate, not an assertion. - Answer quality was blind-judged A/B/C, mapping withheld ; where the reference key was wrong it is corrected in-place key revisions , shown in the cell detail.