Here's what happens every time you ask an AI coding agent a question:
This is BM25 keyword search on raw source code. It's the same algorithm that powered web search in 2009. And it's still the shape of most coding-agent retrieval systems: keyword search, grep, file search, context stuffing.
I spent the last few months building something better. Here's what I found.
When you ask "how does Flask handle URL routing?", you're writing in English. The answer lives in scaffold.py
, app.py
, and wrappers.py
β files full of Python syntax, decorator patterns, and Werkzeug internals.
BM25 tries to match your words against those files. It mostly fails.
The word "routing" appears 4 times in Flask's source. "URL" appears 31 times β mostly in docstrings and variable names scattered across 70+ files. BM25 retrieves 15 of them and hopes for the best.
The agent doesn't just have a retrieval problem. It has a vocabulary problem.
Natural language queries describe behavior. Source code implements syntax. These are different vocabularies, and no amount of BM25 tuning bridges that gap.
Generate a human-readable wiki page for every file and module, then search the wiki.
A wiki page for flask/sansio/scaffold.py
reads like this:
Scaffold is the shared base class for Flask and Blueprint.@route()
callsadd_url_rule()
, which creates a Werkzeug Rule and inserts it intourl_map
. View callables are stored inview_functions
keyed by endpoint name.
Search that for "how does Flask handle URL routing?" β the query and the document speak the same language. No vocabulary gap.
That's Provenant. Index once, search a wiki forever.
I ran this against SWE-bench Verified β 500 real GitHub issues across 12 major Python repos. The metric is Coverage@5: does the correct file appear in the top 5 retrieved results?
| Method | Coverage@5 | Tokens/query | Delta |
|---|---|---|---|
| Raw BM25 (baseline) | ~40% | ~65,000 | β |
| Provenant (wiki + BM25) | |||
| 63.8% | |||
| ~1,030 | |||
| +24pp | |||
| Provenant + HyDE | 66.2% | ~1,030 | +26pp |
+24 percentage points. From 40% to 63.8%. On 500 tasks. Across 12 repos.
And the token numbers aren't rounding errors:
| Repo | Naive tokens | Provenant tokens | Reduction |
|---|---|---|---|
| Flask (30 queries) | 69,044 | 1,070 | 64.5Γ |
| Django (20 queries) | 59,634 | 994 | 60.0Γ |
Answer quality delta: β0.15 on a 5-point blind-judge scale. In this sample, that was not a meaningful drop. The model answers just as well with 1k tokens as it does with 69k β it just wasn't using the other 68k anyway.
Step 1: Index your repo once.
provenant init /path/to/your/repo
Provenant parses every file with tree-sitter, generates a wiki page per module via LLM, and stores everything in SQLite/FTS5 + LanceDB. 6,122 pages across 12 repos. Done in minutes.
Step 2: Start the MCP server.
provenant serve --repo /path/to/your/repo
That's it. Provenant is now a local MCP server exposing tools your agent can call natively.
Step 3: Just use Claude. No special commands.
Add it to your claude_desktop_config.json
:
{
"mcpServers": {
"provenant": {
"command": "provenant",
"args": ["serve", "--repo", "/path/to/your/repo"]
}
}
}
Now when you ask Claude "how does authentication work?" β it doesn't grep your codebase. It calls provenant_ask
, gets 3 wiki pages (~1k tokens), and answers. You never change how you work. The retrieval layer is just better.
You ask Claude a question
β
Claude calls provenant_ask (MCP tool)
β
Provenant: BM25 over wiki pages β top-k results
β
Claude synthesizes answer from ~1,030 tokens
β
Attribution confidence logged β weak pages auto-repaired
I asked a fresh repo β a Java Android music player it had never seen β "How does this app play music?" Here's the actual response after calling provenant_ask
:
Screenshot: Claude's unedited response after Provenant retrieved 3 wiki pages (~1k tokens). Discovery phase: ~30 seconds.
"Provenant compressed the discovery phase from ~5β10 minutes of grepping/reading to ~30 seconds. It's like having an experienced teammate say 'here's the 3 files you need and what they do' before you dive in."
β Claude, unprompted
That's on a Java codebase. Provenant indexes Python β but the wiki pages are plain English, and Claude reads English just fine.
Nobody measures when a retrieval index is wrong. BM25 returns 5 results and acts confident. The model uses 2. The other 3 were noise. The index degrades silently as your codebase changes.
I built a metric for this:
attribution confidence = pages actually cited / pages retrieved
Zero extra LLM calls. Derived from the citation structure already in the answer. It correlates with answer quality (r = 0.415 against a blind LLM judge) β high-confidence retrievals score 5.0/5 on average; low-confidence score 4.5.
When a page's confidence drops below 0.35, Provenant queues a background repair:
asyncio.create_task(_background_repair(uncited_pages))
75% of low-confidence queries improved after one repair cycle. Cost: ~$0.02. Touches only 0.7% of pages.
The index improves the more you use it. Without you doing anything.
Some repos benefit more than others. The pattern: small, well-documented repos see the biggest gains. Large monoliths still improve, just from a harder baseline.
| Repo | Coverage@5 | Improvement | Wiki pages |
|---|---|---|---|
| requests | 78% | ||
| +38pp | 58 | ||
| pytest | 72% | ||
| +32pp | 186 | ||
| seaborn | 71% | ||
| +31pp | 94 | ||
| flask | 69% | ||
| +29pp | 74 | ||
| xarray | 66% | +26pp | 218 |
| sphinx | 63% | +23pp | 412 |
| django | 61% | +21pp | 1,393 |
| scikit-learn | 57% | +17pp | 1,124 |
| matplotlib | 55% | +15pp | 634 |
requests at 78% makes sense β it's a small, well-structured library with clean module boundaries. Each file does one thing. The wiki pages are precise. The retrieval is nearly perfect.
Django at 61% is still a +21pp improvement on a 1,393-page codebase. That's not nothing.
For the ~3% of queries where even wiki vocabulary doesn't match, Provenant generates a hypothetical wiki snippet that would answer the question, then searches against that. Merged with BM25 via Reciprocal Rank Fusion.
+2.4pp Coverage@5. One extra LLM call. Not the headline β but it's there when it helps. The fact that it only fires 3% of the time is the point: the wiki handles the rest.
Speculative prefetching β I built a hook that pre-fetches wiki context whenever your agent greps a file, warming the cache. Median speedup: 1.0Γ. The DB reads were already fast enough. Keeping the code, not claiming a win.
Compression/pruning β removing low-attribution pages before synthesis. Firing rate on test set: 0%. The threshold was too conservative. Needs tuning before it's useful.
Self-healing at scale β the repair loop is only evaluated on Django (20 questions). I can't claim it generalises yet. It's early evidence, not a proven result.
pip install provenant
provenant init /path/to/your/repo
provenant serve --repo /path/to/your/repo
Works with Claude Code, Cursor, or anything MCP-compatible. Your agent gets provenant_ask
, provenant_search
, provenant_context
, and provenant_risk
as native tools. It stops grepping. It starts reading the wiki.
β GitHub: github.com/shreyashsharma/provenant
The retrieval problem in AI coding tools is real and under-measured. BM25 on raw source code is the floor, not the ceiling.
If you try Provenant on your repo, I'm especially interested in two numbers:
Those two data points are more honest than any eval I can run on my own repos. Happy to compare notes.
Benchmarked with DeepSeek-V3.2 Β· nomic-embed-text-v1.5 Β· SWE-bench Verified (500 tasks) Β· 12 Python OSS repos