# Show HN: Is grep enough? A transparent benchmark for agentic code navigation

> Source: <https://entelligentsia.github.io/is-grep-enough/>
> Published: 2026-06-30 12:06:46+00:00

# is grep enough?

When a coding agent explores a large, unfamiliar codebase, are basic text tools (grep/read) enough — or does it need something **fast and light** (structural, tree-sitter) or something more **authoritative and accurate** (semantic, LSP)? This measures the three side by side and lets you check the answer yourself.

## TL;DR — is grep enough?

**For correctness, usually. For tokens through the model, increasingly not.** The same 50 navigation tasks (10 repos × 5 complexity rungs), run three ways and blind-judged, land the right answer almost everywhere — mean grounding **—** and completeness **—** are a near-tie across all three arms. Grep *is* enough to be *correct*.

Where they separate is **how much context they push through the model to get there**, and that gap compounds with complexity. Structural navigation (grove) reaches the same answer on far fewer tokens — about **—** on average, against **—** for lsp and **—** for baseline text search (grep + read, which fans out and re-reads). At trivial tasks everything is light and grep wins on simplicity; by the hardest rung (L5) baseline pushes **—** of context against grove's **—** — **—** as many tokens — while staying tightest on quality.

Token throughput is not the billed bill: most of baseline's volume is cheap cache reads, so in *dollars* the arms are much closer (see the cost panel). The lean-context win matters most for context-window pressure, latency, and any setting without aggressive prompt caching. The thesis is a curve, not a winner — grep suffices for shallow lookups; structural tools pay off as the code gets harder to navigate. n=1 per cell; descriptive, not significance-tested.

## What this shows

The same exploration task, given to an agent with three rungs of navigation power — plain text search (baseline), fast-light structural (grove), and authoritative semantic (lsp) — across a task-complexity ladder. The question is where on that ladder the extra power stops paying for itself. Everything below links to the raw run that produced it.

## Metrics

One metric at a time — select a tab to compare arms across rungs. Each bar is one repo; bars are grouped by arm within each rung panel. The *token economy* tab stacks context, cache, and cost together since they are causally linked.

## Cell detail

## Methodology & provenance

How the numbers above were produced, and what protects their fairness. Everything here is a standing claim you can check against the evidence linked throughout the page.

### The genesis wall

Prompts and their reference answer keys are generated **offline, before any arm runs**. A running arm sees **only the bare prompt** — never the reference key, the rationale, or the pinned source under `experiment/repos/`

. Judging reads the keys; running never does. The keys are **judge-only** and appear on this page only as post-hoc *key revisions* in a cell's detail, never as the answer itself.

### Blind judging

Each cell's three answers are scrubbed to **A / B / C** with the arm→letter mapping withheld, graded against the reference key on **grounding** (do the cited `file:line`

anchors resolve in pinned source?) and **completeness** (does it cover the key's required spine?), and only un-blinded to record the score. Where the key itself was wrong, it is corrected in place and the correction is shown with its cite — proof the grader corrected *itself*, not the arms.

### Cite-link verification

Every `file:line`

in a transcript links to the GitHub blob at that repo's **pinned SHA**. The build doesn't just link them — it **re-resolves them against the pinned source**: a cite is confirmed when its file is located (exact path, or a unique basename match in the tree) and the line is within the file. The result across all harvested cells:

—

### Pricing

Every dollar figure on this page is the **billed total_cost_usd** reported by the run itself — not a recomputed list-price estimate. The table below is the public list price for the model used (—), shown so a reader can sanity-check the billed figures against the token split. n=1 per cell; cost is a direction, not a benchmark.

### Data sources

The feed is a pure function of committed evidence, synthesized by `site/build.mjs`

. The cell ledger is written **only** through the validated `statectl`

CLI — never hand-edited. These are the exact files behind the current view:

### Reproduce this page

The feed is deterministic — re-running `build.mjs`

against the same evidence reproduces `site/data/`

byte-for-byte (only the stamped SHA/timestamp differ). Rebuild and serve locally, then diff against what's published:

## How to trust this

- Every number is recomputed from a run's own stream-json; click any transcript to read the full reasoning trail and the raw evidence path.
- The
**engagement** line per arm proves it used its capability (`bash/grove/lsp > 0`

) — a fairness gate, not an assertion. - Answer quality was
**blind-judged**(A/B/C, mapping withheld); where the reference key was wrong it is corrected in-place (*key revisions*), shown in the cell detail.
