# AI Coding Agents Search Like It's 2009. Provenant Cuts Tokens by 65 .

> Source: <https://dev.to/corpsekiller/ai-coding-agents-search-like-its-2009-provenant-cuts-tokens-by-65x-3jg9>
> Published: 2026-05-28 14:40:02+00:00

Here's what happens every time you ask an AI coding agent a question:

This is BM25 keyword search on raw source code. It's the same algorithm that powered web search in 2009. And it's still the shape of most coding-agent retrieval systems: keyword search, grep, file search, context stuffing.

I spent the last few months building something better. Here's what I found.

When you ask *"how does Flask handle URL routing?"*, you're writing in English. The answer lives in `scaffold.py`

, `app.py`

, and `wrappers.py`

— files full of Python syntax, decorator patterns, and Werkzeug internals.

BM25 tries to match your words against those files. It mostly fails.

The word "routing" appears 4 times in Flask's source. "URL" appears 31 times — mostly in docstrings and variable names scattered across 70+ files. BM25 retrieves 15 of them and hopes for the best.

The agent doesn't just have a retrieval problem. It has a vocabulary problem.

Natural language queries describe *behavior*. Source code implements *syntax*. These are different vocabularies, and no amount of BM25 tuning bridges that gap.

**Generate a human-readable wiki page for every file and module, then search the wiki.**

A wiki page for `flask/sansio/scaffold.py`

reads like this:

Scaffold is the shared base class for Flask and Blueprint.`@route()`

calls`add_url_rule()`

, which creates a Werkzeug Rule and inserts it into`url_map`

. View callables are stored in`view_functions`

keyed by endpoint name.

Search that for *"how does Flask handle URL routing?"* — the query and the document speak the same language. No vocabulary gap.

That's **Provenant**. Index once, search a wiki forever.

I ran this against **SWE-bench Verified** — 500 real GitHub issues across 12 major Python repos. The metric is **Coverage@5**: does the correct file appear in the top 5 retrieved results?

| Method | Coverage@5 | Tokens/query | Delta |
|---|---|---|---|
| Raw BM25 (baseline) | ~40% | ~65,000 | — |
Provenant (wiki + BM25) |
63.8% |
~1,030 |
+24pp |
| Provenant + HyDE | 66.2% | ~1,030 | +26pp |

**+24 percentage points.** From 40% to 63.8%. On 500 tasks. Across 12 repos.

And the token numbers aren't rounding errors:

| Repo | Naive tokens | Provenant tokens | Reduction |
|---|---|---|---|
| Flask (30 queries) | 69,044 | 1,070 | 64.5× |
| Django (20 queries) | 59,634 | 994 | 60.0× |

Answer quality delta: **−0.15 on a 5-point blind-judge scale.** In this sample, that was not a meaningful drop. The model answers just as well with 1k tokens as it does with 69k — it just wasn't using the other 68k anyway.

**Step 1: Index your repo once.**

```
provenant init /path/to/your/repo
```

Provenant parses every file with tree-sitter, generates a wiki page per module via LLM, and stores everything in SQLite/FTS5 + LanceDB. 6,122 pages across 12 repos. Done in minutes.

**Step 2: Start the MCP server.**

```
provenant serve --repo /path/to/your/repo
```

That's it. Provenant is now a local MCP server exposing tools your agent can call natively.

**Step 3: Just use Claude. No special commands.**

Add it to your `claude_desktop_config.json`

:

```
{
  "mcpServers": {
    "provenant": {
      "command": "provenant",
      "args": ["serve", "--repo", "/path/to/your/repo"]
    }
  }
}
```

Now when you ask Claude *"how does authentication work?"* — it doesn't grep your codebase. It calls `provenant_ask`

, gets 3 wiki pages (~1k tokens), and answers. You never change how you work. The retrieval layer is just better.

```
You ask Claude a question
         ↓
Claude calls provenant_ask (MCP tool)
         ↓
Provenant: BM25 over wiki pages → top-k results
         ↓
Claude synthesizes answer from ~1,030 tokens
         ↓
Attribution confidence logged → weak pages auto-repaired
```

I asked a fresh repo — a Java Android music player it had never seen — *"How does this app play music?"* Here's the actual response after calling `provenant_ask`

:

*Screenshot: Claude's unedited response after Provenant retrieved 3 wiki pages (~1k tokens). Discovery phase: ~30 seconds.*

"Provenant compressed the discovery phase from ~5–10 minutes of grepping/reading to ~30 seconds. It's like having an experienced teammate say 'here's the 3 files you need and what they do' before you dive in."

— Claude, unprompted

That's on a Java codebase. Provenant indexes Python — but the wiki pages are plain English, and Claude reads English just fine.

Nobody measures when a retrieval index is wrong. BM25 returns 5 results and acts confident. The model uses 2. The other 3 were noise. The index degrades silently as your codebase changes.

I built a metric for this:

```
attribution confidence = pages actually cited / pages retrieved
```

Zero extra LLM calls. Derived from the citation structure already in the answer. It correlates with answer quality (r = 0.415 against a blind LLM judge) — high-confidence retrievals score 5.0/5 on average; low-confidence score 4.5.

When a page's confidence drops below 0.35, Provenant queues a background repair:

```
# Fires silently after low-confidence answers
asyncio.create_task(_background_repair(uncited_pages))
```

**75% of low-confidence queries improved after one repair cycle.** Cost: ~$0.02. Touches only 0.7% of pages.

**The index improves the more you use it.** Without you doing anything.

Some repos benefit more than others. The pattern: **small, well-documented repos see the biggest gains.** Large monoliths still improve, just from a harder baseline.

| Repo | Coverage@5 | Improvement | Wiki pages |
|---|---|---|---|
| requests | 78% |
+38pp | 58 |
| pytest | 72% |
+32pp | 186 |
| seaborn | 71% |
+31pp | 94 |
| flask | 69% |
+29pp | 74 |
| xarray | 66% | +26pp | 218 |
| sphinx | 63% | +23pp | 412 |
| django | 61% | +21pp | 1,393 |
| scikit-learn | 57% | +17pp | 1,124 |
| matplotlib | 55% | +15pp | 634 |

requests at 78% makes sense — it's a small, well-structured library with clean module boundaries. Each file does one thing. The wiki pages are precise. The retrieval is nearly perfect.

Django at 61% is still a +21pp improvement on a 1,393-page codebase. That's not nothing.

For the ~3% of queries where even wiki vocabulary doesn't match, Provenant generates a hypothetical wiki snippet that *would* answer the question, then searches against that. Merged with BM25 via Reciprocal Rank Fusion.

+2.4pp [Coverage@5](mailto:Coverage@5). One extra LLM call. Not the headline — but it's there when it helps. The fact that it only fires 3% of the time is the point: the wiki handles the rest.

**Speculative prefetching** — I built a hook that pre-fetches wiki context whenever your agent greps a file, warming the cache. Median speedup: 1.0×. The DB reads were already fast enough. Keeping the code, not claiming a win.

**Compression/pruning** — removing low-attribution pages before synthesis. Firing rate on test set: 0%. The threshold was too conservative. Needs tuning before it's useful.

**Self-healing at scale** — the repair loop is only evaluated on Django (20 questions). I can't claim it generalises yet. It's early evidence, not a proven result.

```
pip install provenant

# Index
provenant init /path/to/your/repo

# Serve (MCP)
provenant serve --repo /path/to/your/repo
```

Works with Claude Code, Cursor, or anything MCP-compatible. Your agent gets `provenant_ask`

, `provenant_search`

, `provenant_context`

, and `provenant_risk`

as native tools. It stops grepping. It starts reading the wiki.

⭐ **GitHub: github.com/shreyashsharma/provenant**

The retrieval problem in AI coding tools is real and under-measured. BM25 on raw source code is the floor, not the ceiling.

If you try Provenant on your repo, I'm especially interested in two numbers:

Those two data points are more honest than any eval I can run on my own repos. Happy to compare notes.

*Benchmarked with DeepSeek-V3.2 · nomic-embed-text-v1.5 · SWE-bench Verified (500 tasks) · 12 Python OSS repos*
