# What It Actually Takes to Index Open Source

> Source: <https://githits.com/blog/what-it-actually-takes-to-index-open-source/>
> Published: 2026-06-16 00:00:00+00:00

[Back to blog](/blog/)

June 16, 2026 · 9 min read

# What It Actually Takes to Index Open Source

A look under the hood of GitHits, what a real open-source index requires, and why search-and-rerank depends on its source.

Developers keep asking us a version of the same question. If my coding agent can already clone a repo, grep it, and run `gh search`

, what is GitHits actually adding? Sometimes it comes out sharper: why is this a product at all, when an agent can do all of that on its own?

It is a fair question, and I want to answer it properly. Searching and reading GitHub directly solves a surprising number of problems, and for a simple lookup your agent doing it itself is fine. But the honest answer is really about what sits underneath the product. So this post is about that: what indexing open source actually means, and why most of the work is the part nobody sees.

## What your agent gets on its own

Let me take the question literally. Your agent can clone a dependency, grep it, read the files, and hit GitHub’s code search. For a quick lookup, that works. But look at what it is working from. A clone hands it the latest commit on the default branch, and [GitHub code search only covers the default branch too](https://docs.github.com/en/search-github/github-code-search/about-github-code-search#limitations). So the moment your problem is version-specific, and most are, the agent is reading the wrong code: the current `main`

, not the `2.3.1`

you depend on. It is also working one repository at a time, burning context on whole files it mostly discards, and treating code as text with no sense of what to trust. Cloning and searching give you a snapshot, and a snapshot is not the same as knowing.

Doing this properly, for any package at the version you are on, is what an index is for. The word gets used loosely, though. A common approach is search-and-rerank: call a search API, reorder the results with a model, and return the top few. That is useful, and honestly how GitHits started. Our example feature still works that way. But it is limited by the source it starts from, because reordering a snapshot does not give you source that snapshot never contained.

A real index is different. You fetch the actual source, parse it, turn it into structure, and keep that structure version by version, so it can be queried later. The code is understood as code, once, and stored.

When your agent asks something, it reads from that stored structure instead of cloning and grepping in the moment. Building that structure is the hard part, and it is most of what follows.

## What indexing actually involves

Indexing a single repository at a single version runs through a chain of steps. None of them are exotic on their own. There are just a lot of them, they all have to be correct, and they all have to keep running.

**Resolve the version.** Before anything else, you have to turn a version like `1.2.3`

into the exact commit it refers to. That sounds like a lookup and it is not, because every ecosystem, and a lot of individual projects, tag their releases differently. It is worth a closer look on its own, below.

**Fetch and pin the source.** Pull that commit and treat it as immutable. A pinned version always maps to the same code, so a result you get today is the same one you would get in six months.

**Parse it, across many languages.** Real-world code is messy. Files are large, sometimes half-broken, full of language-specific constructs, and every ecosystem has its own grammar. Parsing all of it into something structured, reliably, is most of the unglamorous work, and a file that does not parse cleanly still has to yield something useful rather than nothing.

**Turn text into structure.** Once a file is parsed we pull out the things that make code navigable: the files, the symbols defined in them and what kind each one is, the imports and exports, and how they connect. That structure is what lets an agent search for a symbol and read its actual definition, or follow an import to the right file, instead of grepping for a name and guessing which of the matches is the one that matters.

**Make it searchable in a way that understands code.** We build a text index from that parsed structure, with filters for things like file type and symbol, so a query can ask for an exact identifier or scope itself to the right kind of file. We tried embeddings first, and for code they fell short: similarity is good at finding code that reads alike, but a lot of the time what you need is the thing that is exactly named, in the right place.

**Tell near-duplicates apart.** A huge amount of open source is copied, forked, and vendored, so the same file shows up in a hundred repositories. We compute structural fingerprints of code so we can recognize when two things are really the same implementation and surface the canonical one instead of ten copies of it.

**Pack it and serve it fast.** The built index for a repository at a given version becomes a compact artifact we store in object storage, so we can serve search, file listings, grep, and exact line-range reads against it without cloning anything at request time. The slow work happens once, when we build the artifact, not while your agent waits.

**Keep all of it current and correct, continuously.** Repositories change, packages release, and our own parsing improves, which means re-indexing source we have already seen. Something has to notice when the index has drifted from reality and reconcile it. Across a large and growing number of packages, this never really stops.

One of these is worth looking at up close, because it shows how an easy-sounding step turns into the real work.

## Resolving a version to a commit

This is the one I reach for first, because it sounds trivial and it is so much worse than anyone expects. To fetch the source for a version, you have to know which commit that version is, and there is no standard answer.

One project tags it `1.2.3`

, the next `v1.2.3`

, the next `mylib-1.2.3`

or `mylib@1.2.3`

. Maven projects might tag `rel/1.2.3`

, `REL1.2.3`

, `r1.2.3`

, or `1.2.3-RELEASE`

. Go modules carry `+incompatible`

suffixes. Some libraries publish `33.4.0-jre`

to the registry while the git tag is a plain `v33.4.0`

. Date-based versions appear as `2024.1.5`

in one place and `2024.01.05`

in another.

And then there are monorepos, where a single repository ships dozens of packages and every tag is scoped to one of them, like `react-dom@18.2.0`

, so the `1.2.3`

you actually want is buried among thousands of tags for its siblings.

So the part of our indexer whose only job is to map a version to a commit generates candidate tags for all of these schemes, tries a repository’s releases first and then its tags, and falls back gracefully when a project did something we have not seen before. It is more than seven hundred lines for that one job. We have rewritten the candidate rules six times as we kept finding new conventions in the wild, and that is before we have parsed a single line of the actual code. Even comparing two version numbers is ecosystem-specific: npm follows semver, Python follows PEP 440, and they disagree about what counts as a prerelease and which version is newer, so each registry has to be resolved by its own rules.

Every step in the list above is like this. Each one sounds simple, and each one carries edge cases and caveats that have to be handled, once and centrally, so we can hand your agent a clean, token-efficient result. The alternative is letting the agent rediscover each of these problems itself, session by session, which burns through tokens fast and still lands on the wrong answer more often than not.

## The part you cannot fake

Of everything in that list, version-awareness is the part that cannot be faked. A clone, a search, and a reranker all work off whatever single snapshot they were handed, almost always the latest one. You can reorder that snapshot, but you cannot reorder your way to a version it never contained.

This is the failure mode every agent user has already seen. The agent writes code that looks right, that even matches the documentation, and it does not work, because it is written against a version of the library you are not using. A version-aware index pins every version to its own commit, so the agent reads the source it is actually depending on, instead of an average of every version that ever existed. Ranking still matters, but ranking over a versioned index is a different thing from ranking over a single snapshot of `main`

.

## And that is only the code

Everything above is what it takes to index the code of one package at one version. We do all of it again for the documentation, crawled and pinned version by version. Code shows your agent what a library actually does; the docs show how it is meant to be used, the configuration, the conventions, and the intent the source alone never makes obvious. When both are relevant, grounding the agent in the two together gives it a fuller and more reliable picture than either one alone. And bringing them together so the agent sees one picture instead of two piles is a harder problem than any single step that came before it.

What all of that adds up to is plain: your agent gets the real code and the real docs for the version you are actually on, put in front of it together so it has something solid to reason over instead of a confident guess. Finding code was never the bottleneck. Giving the agent a token-efficient, grounded view of what your context actually depends on is. If you want to see it, run `npx githits@latest init`

.
