RAG for Codebases Is Harder Than It Looks

A developer built RepoChat, an AI tool that uses retrieval-augmented generation (RAG) to answer questions about GitHub repositories. The tool indexes codebases by filtering relevant files, chunking code with metadata, and storing embeddings in Chroma, then retrieves context to generate answers via Anthropic’s Claude LLM. The developer found that effective RAG for codebases requires careful file selection and chunking strategies, as standard document-based approaches fail with code.

Building RepoChat, an AI tool that explains GitHub repos I built a small AI tool called RepoChat. The idea is simple: Paste a GitHub repo. Ask questions. Get answers from the codebase. Something like: What does this project do? How is the backend structured? Where is authentication handled? How do I run this locally? Which files should I read first? I did not want to build another chatbot. I wanted to build something useful for developers. Because every developer has faced this problem: You open a new repo. There are 50 files. The README is either missing, outdated, or too high-level. You don’t know where to start. So I wanted to see if RAG could help make codebase onboarding faster. RepoChat takes a GitHub repository and turns it into something you can ask questions about. The first version is intentionally small. It has two main flows. Indexing happens once for a repo: Querying happens every time someone asks a question: question This split helped me think about the system more clearly. The indexing pipeline prepares the repo. The query pipeline uses that indexed repo to answer questions. The first version has a frontend and a backend. repo URL input question input answer section sources section fetch repo files filter useful files chunk text/code create embeddings store chunks retrieve relevant chunks call the LLM I used this kind of structure: repochat-ai/ apps/ web/ api/ data/ chroma/ README.md Frontend: Next.js, React, Tailwind Backend: FastAPI Chunking: LangChain text splitters Embeddings: OpenAI text-embedding-3-small Vector DB: Chroma LLM: Anthropic Claude Sonnet 4.6 Repo data: GitHub API I used two providers for different jobs. OpenAI handles embeddings because its embedding models are simple and reliable for this use case. Claude handles generation because I wanted stronger reasoning and clearer explanations for codebase questions. This felt better than forcing one provider to do everything. Step 1: Fetching the repo The first problem was getting the right files. At first, I thought: Just fetch the repo and send everything to the AI. That sounds simple, but it breaks quickly. Repos contain a lot of files that are not useful for understanding the project: node modules/ .git/ dist/ build/ lock files generated files images large JSON files So I had to filter files. For the first version, I focused on files like: README / readme files .md .py .ts .tsx .js .jsx .json This already made the answers better. One thing I learned here: Good RAG starts before embeddings. It starts with choosing what data should enter the pipeline. If you put garbage into the vector database, retrieval will return garbage too. Step 2: Chunking code is not the same as chunking docs This was the first part that felt harder than expected. Most RAG tutorials use normal text: paragraph paragraph paragraph But code is different. A code file has: imports functions classes comments config repeated names small pieces that only make sense together If chunks are too small, the model loses context. If chunks are too large, retrieval becomes noisy. Example: function getUser { ... } This function alone may not be enough. The useful context may include: js import { db } from "./db" import { users } from "./schema" function getUser { ... } So I had to think more carefully about chunk size and metadata. For every chunk, I kept metadata like: The file path is very important. A chunk from: apps/api/auth.py means something different from: apps/web/components/Login.tsx Even if both mention “user” or “auth”. The file extension is not stored separately, but it is still available through the path. Step 3: Embeddings and retrieval Once the files were chunked, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in Chroma. When a user asks a question, I embed the question too. Then the system searches Chroma for chunks that are close to that question. So if someone asks: Where is authentication handled? the system may find related chunks even if the exact word “authentication” is not used everywhere. It can still find related words like: auth login session token middleware jwt user This is where RAG becomes useful. But retrieval is not magic. Sometimes it retrieves: the README instead of the actual code a frontend file when the backend file is more useful a config file because it has matching words a chunk that mentions the right term but does not answer the question That was a good reminder: RAG is not just “put data in vector DB and ask questions.” Retrieval quality matters a lot. Step 4: Asking questions The Q&A flow looks like this: User asks a question: I wanted answers to include sources because without sources, it is hard to trust the output. For developer tools, source references are not optional. If the AI says: Auth is handled in middleware. I want to know: Which file? Which function? Where should I look? So the answer should include something like: Sources: - apps/api/middleware/auth.ts - apps/api/routes/users.ts This makes the tool much more useful. What broke or felt messy This was the most useful part of the build. 1. Large repos are noisy Small repos are easy. Large repos need better filtering. A real repo may contain: docs examples tests scripts generated files frontend backend infra If everything is indexed equally, answers become messy. A better version should rank files based on importance. For example: README.md package.json main entry files routes config files src/ docs/ should probably matter more than random test snapshots. 2. README is useful but not enough README files are helpful for high-level questions. But if you ask: How does auth work? the README is usually not enough. You need code. This is where code-aware retrieval becomes important. 3. File paths matter a lot At first, I treated chunks mostly as text. But for codebases, metadata is part of the answer. A chunk from: backend/routes/payment.ts It is not just text. It tells you: This is backend code This is route-level logic This likely handles payment APIs So the file path helps both retrieval and explanation. 4. The model needs strict instructions If the model does not know something, it should say so. For example: I could not find authentication logic in the indexed files. is much better than: The app probably uses JWT authentication. For a developer tool, guessing is dangerous. So the prompt rule was intentionally strict: Answer using only the provided repo context. If the answer is not present in the context, say that clearly. Always mention the source files used. Do not guess implementation details. Building RepoChat made RAG feel much more real to me. Before building it, RAG sounded simple: embed docs retrieve docs ask LLM After building it, I see it more like this: choose the right data clean the data chunk it properly store useful metadata retrieve the right chunks control the prompt show sources test bad answers The retrieval part is only one piece. The developer experience around it matters just as much. A developer should not care about embeddings or vector DBs. They should only feel: “I understand this repo faster now.” The first version is useful, but there are many things I would improve. 1.Better code parsing Instead of splitting files only by text size, I want to split code by structure: functions classes exports API routes components Tools like Tree-sitter could help with this. 2.Repo map Before answering questions, the app could build a repo map: frontend backend API routes database auth config tests This would help the model understand the project layout better. Better source citations Right now, file-level sources are useful. But line-level sources would be better. Example: apps/api/auth.ts:45-72 That would make answers easier to verify. 3.Evaluation questions Every repo could have test questions like: How do I run this project? Where is auth handled? Where are API routes defined? What database does it use? Then I can test whether RepoChat answers correctly. This is where evals become useful. 4.MCP integration Later, RepoChat could expose repo search as an MCP tool. Then an agent could ask: search codebase "where is auth handled?" and use RepoChat as a codebase understanding tool. RepoChat started as a small demo, but it taught me a lot. The biggest lesson: RAG is only useful when the developer can trust the answer. For codebases, trust comes from: good retrieval useful chunks file metadata clear sources honest “I don’t know” answers I still want to improve RepoChat, but even this first version made one thing clear: AI tools for developers should not try to replace understanding. They should help developers reach understanding faster. That is the part I find exciting. GitHub: https://github.com/mahimathacker/repochat-ai https://github.com/mahimathacker/repochat-ai Live demo: https://youtu.be/kSgZSqH6iXk https://youtu.be/kSgZSqH6iXk I’m still improving it, especially around better code parsing, line-level citations, repo maps, evals, and MCP support.