# RAG for Codebases Is Harder Than It Looks

> Source: <https://dev.to/mahima_thacker/rag-for-codebases-is-harder-than-it-looks-1nhg>
> Published: 2026-05-27 13:20:36+00:00

Building RepoChat, an AI tool that explains GitHub repos

I built a small AI tool called RepoChat.

The idea is simple:

Paste a GitHub repo. Ask questions. Get answers from the codebase.

Something like:

What does this project do?

How is the backend structured?

Where is authentication handled?

How do I run this locally?

Which files should I read first?

I did not want to build another chatbot. I wanted to build something useful for developers.

Because every developer has faced this problem:

You open a new repo.

There are 50 files.

The README is either missing, outdated, or too high-level.

You don’t know where to start.

So I wanted to see if RAG could help make codebase onboarding faster.

RepoChat takes a GitHub repository and turns it into something you can ask questions about.

The first version is intentionally small.

It has two main flows.

Indexing happens once for a repo:

Querying happens every time someone asks a question:

question

This split helped me think about the system more clearly.

The indexing pipeline prepares the repo.

The query pipeline uses that indexed repo to answer questions.

The first version has a frontend and a backend.

repo URL input

question input

answer section

sources section

fetch repo files

filter useful files

chunk text/code

create embeddings

store chunks

retrieve relevant chunks

call the LLM

I used this kind of structure:

```
repochat-ai/
  apps/
    web/
    api/
  data/
    chroma/
  README.md
```

Frontend: Next.js, React, Tailwind

Backend: FastAPI

Chunking: LangChain text splitters

Embeddings: OpenAI text-embedding-3-small

Vector DB: Chroma

LLM: Anthropic Claude Sonnet 4.6

Repo data: GitHub API

**I used two providers for different jobs.**

OpenAI handles embeddings because its embedding models are simple and reliable for this use case.

Claude handles generation because I wanted stronger reasoning and clearer explanations for codebase questions.

This felt better than forcing one provider to do everything.

**Step 1: Fetching the repo**

The first problem was getting the right files.

At first, I thought:

Just fetch the repo and send everything to the AI.

That sounds simple, but it breaks quickly.

Repos contain a lot of files that are not useful for understanding the project:

```
node_modules/
.git/
dist/
build/
lock files
generated files
images
large JSON files
```

So I had to filter files.

For the first version, I focused on files like:

`README / readme files`

.md

.py

.ts

.tsx

.js

.jsx

.json

This already made the answers better.

One thing I learned here:

Good RAG starts before embeddings. It starts with choosing what data should enter the pipeline.

If you put garbage into the vector database, retrieval will return garbage too.

**Step 2: Chunking code is not the same as chunking docs**

This was the first part that felt harder than expected.

Most RAG tutorials use normal text:

paragraph

paragraph

paragraph

But code is different.

A code file has:

```
imports
functions
classes
comments
config
repeated names
small pieces that only make sense together
```

If chunks are too small, the model loses context.

If chunks are too large, retrieval becomes noisy.

Example:

```
function getUser() {
  ...
}
```

This function alone may not be enough.

The useful context may include:

``` js
import { db } from "./db"
import { users } from "./schema"

function getUser() {
  ...
}
```

So I had to think more carefully about chunk size and metadata.

For every chunk, I kept metadata like:

The file path is very important.

A chunk from:

`apps/api/auth.py`

means something different from:

`apps/web/components/Login.tsx`

Even if both mention “user” or “auth”.

The file extension is not stored separately, but it is still available through the path.

**Step 3: Embeddings and retrieval**

Once the files were chunked, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in Chroma.

When a user asks a question, I embed the question too.

Then the system searches Chroma for chunks that are close to that question.

So if someone asks:

**Where is authentication handled?**

the system may find related chunks even if the exact word “authentication” is not used everywhere.

It can still find related words like:

auth

login

session

token

middleware

jwt

user

This is where RAG becomes useful.

But retrieval is not magic.

Sometimes it retrieves:

the README instead of the actual code

a frontend file when the backend file is more useful

a config file because it has matching words

a chunk that mentions the right term but does not answer the question

That was a good reminder:

**RAG is not just “put data in vector DB and ask questions.” Retrieval quality matters a lot.**

**Step 4: Asking questions**

The Q&A flow looks like this:

User asks a question:

I wanted answers to include sources because without sources, it is hard to trust the output.

For developer tools, source references are not optional.

If the AI says:

Auth is handled in middleware.

I want to know:

Which file?

Which function?

Where should I look?

So the answer should include something like:

Sources:

`- apps/api/middleware/auth.ts`

`- apps/api/routes/users.ts`

This makes the tool much more useful.

What broke or felt messy

This was the most useful part of the build.

**1. Large repos are noisy**

Small repos are easy.

Large repos need better filtering.

A real repo may contain:

docs

examples

tests

scripts

generated files

frontend

backend

infra

If everything is indexed equally, answers become messy.

A better version should rank files based on importance.

For example:

```
README.md
package.json
main entry files
routes
config files
src/
docs/
```

should probably matter more than random test snapshots.

**2. README is useful but not enough**

README files are helpful for high-level questions.

But if you ask:

How does auth work?

the README is usually not enough.

You need code.

This is where code-aware retrieval becomes important.

**3. File paths matter a lot**

At first, I treated chunks mostly as text.

But for codebases, metadata is part of the answer.

A chunk from:

**backend/routes/payment.ts**

It is not just text.

It tells you:

This is backend code

This is route-level logic

This likely handles payment APIs

So the file path helps both retrieval and explanation.

**4. The model needs strict instructions**

If the model does not know something, it should say so.

For example:

I could not find authentication logic in the indexed files.

is much better than:

The app probably uses JWT authentication.

For a developer tool, guessing is dangerous.

So the prompt rule was intentionally strict:

Answer using only the provided repo context.

If the answer is not present in the context, say that clearly.

Always mention the source files used.

Do not guess implementation details.

Building RepoChat made RAG feel much more real to me.

Before building it, RAG sounded simple:

embed docs

retrieve docs

ask LLM

After building it, I see it more like this:

choose the right data

clean the data

chunk it properly

store useful metadata

retrieve the right chunks

control the prompt

show sources

test bad answers

The retrieval part is only one piece.

The developer experience around it matters just as much.

A developer should not care about embeddings or vector DBs.

They should only feel:

“I understand this repo faster now.”

The first version is useful, but there are many things I would improve.

**1.Better code parsing**

Instead of splitting files only by text size, I want to split code by structure:

functions

classes

exports

API routes

components

Tools like Tree-sitter could help with this.

**2.Repo map**

Before answering questions, the app could build a repo map:

frontend

backend

API routes

database

auth

config

tests

This would help the model understand the project layout better.

Better source citations

Right now, file-level sources are useful.

But line-level sources would be better.

Example:

apps/api/auth.ts:45-72

That would make answers easier to verify.

**3.Evaluation questions**

Every repo could have test questions like:

How do I run this project?

Where is auth handled?

Where are API routes defined?

What database does it use?

Then I can test whether RepoChat answers correctly.

This is where evals become useful.

**4.MCP integration**

Later, RepoChat could expose repo search as an MCP tool.

Then an agent could ask:

search_codebase("where is auth handled?")

and use RepoChat as a codebase understanding tool.

RepoChat started as a small demo, but it taught me a lot.

The biggest lesson:

RAG is only useful when the developer can trust the answer.

For codebases, trust comes from:

good retrieval

useful chunks

file metadata

clear sources

honest “I don’t know” answers

I still want to improve RepoChat, but even this first version made one thing clear:

AI tools for developers should not try to replace understanding.

They should help developers reach understanding faster.

That is the part I find exciting.

GitHub: [https://github.com/mahimathacker/repochat-ai](https://github.com/mahimathacker/repochat-ai)

Live demo: [https://youtu.be/kSgZSqH6iXk](https://youtu.be/kSgZSqH6iXk)

I’m still improving it, especially around better code parsing, line-level citations, repo maps, evals, and MCP support.