{"slug": "rag-for-codebases-is-harder-than-it-looks", "title": "RAG for Codebases Is Harder Than It Looks", "summary": "A developer built RepoChat, an AI tool that uses retrieval-augmented generation (RAG) to answer questions about GitHub repositories. The tool indexes codebases by filtering relevant files, chunking code with metadata, and storing embeddings in Chroma, then retrieves context to generate answers via Anthropic’s Claude LLM. The developer found that effective RAG for codebases requires careful file selection and chunking strategies, as standard document-based approaches fail with code.", "body_md": "Building RepoChat, an AI tool that explains GitHub repos\n\nI built a small AI tool called RepoChat.\n\nThe idea is simple:\n\nPaste a GitHub repo. Ask questions. Get answers from the codebase.\n\nSomething like:\n\nWhat does this project do?\n\nHow is the backend structured?\n\nWhere is authentication handled?\n\nHow do I run this locally?\n\nWhich files should I read first?\n\nI did not want to build another chatbot. I wanted to build something useful for developers.\n\nBecause every developer has faced this problem:\n\nYou open a new repo.\n\nThere are 50 files.\n\nThe README is either missing, outdated, or too high-level.\n\nYou don’t know where to start.\n\nSo I wanted to see if RAG could help make codebase onboarding faster.\n\nRepoChat takes a GitHub repository and turns it into something you can ask questions about.\n\nThe first version is intentionally small.\n\nIt has two main flows.\n\nIndexing happens once for a repo:\n\nQuerying happens every time someone asks a question:\n\nquestion\n\nThis split helped me think about the system more clearly.\n\nThe indexing pipeline prepares the repo.\n\nThe query pipeline uses that indexed repo to answer questions.\n\nThe first version has a frontend and a backend.\n\nrepo URL input\n\nquestion input\n\nanswer section\n\nsources section\n\nfetch repo files\n\nfilter useful files\n\nchunk text/code\n\ncreate embeddings\n\nstore chunks\n\nretrieve relevant chunks\n\ncall the LLM\n\nI used this kind of structure:\n\n```\nrepochat-ai/\n  apps/\n    web/\n    api/\n  data/\n    chroma/\n  README.md\n```\n\nFrontend: Next.js, React, Tailwind\n\nBackend: FastAPI\n\nChunking: LangChain text splitters\n\nEmbeddings: OpenAI text-embedding-3-small\n\nVector DB: Chroma\n\nLLM: Anthropic Claude Sonnet 4.6\n\nRepo data: GitHub API\n\n**I used two providers for different jobs.**\n\nOpenAI handles embeddings because its embedding models are simple and reliable for this use case.\n\nClaude handles generation because I wanted stronger reasoning and clearer explanations for codebase questions.\n\nThis felt better than forcing one provider to do everything.\n\n**Step 1: Fetching the repo**\n\nThe first problem was getting the right files.\n\nAt first, I thought:\n\nJust fetch the repo and send everything to the AI.\n\nThat sounds simple, but it breaks quickly.\n\nRepos contain a lot of files that are not useful for understanding the project:\n\n```\nnode_modules/\n.git/\ndist/\nbuild/\nlock files\ngenerated files\nimages\nlarge JSON files\n```\n\nSo I had to filter files.\n\nFor the first version, I focused on files like:\n\n`README / readme files`\n\n.md\n\n.py\n\n.ts\n\n.tsx\n\n.js\n\n.jsx\n\n.json\n\nThis already made the answers better.\n\nOne thing I learned here:\n\nGood RAG starts before embeddings. It starts with choosing what data should enter the pipeline.\n\nIf you put garbage into the vector database, retrieval will return garbage too.\n\n**Step 2: Chunking code is not the same as chunking docs**\n\nThis was the first part that felt harder than expected.\n\nMost RAG tutorials use normal text:\n\nparagraph\n\nparagraph\n\nparagraph\n\nBut code is different.\n\nA code file has:\n\n```\nimports\nfunctions\nclasses\ncomments\nconfig\nrepeated names\nsmall pieces that only make sense together\n```\n\nIf chunks are too small, the model loses context.\n\nIf chunks are too large, retrieval becomes noisy.\n\nExample:\n\n```\nfunction getUser() {\n  ...\n}\n```\n\nThis function alone may not be enough.\n\nThe useful context may include:\n\n``` js\nimport { db } from \"./db\"\nimport { users } from \"./schema\"\n\nfunction getUser() {\n  ...\n}\n```\n\nSo I had to think more carefully about chunk size and metadata.\n\nFor every chunk, I kept metadata like:\n\nThe file path is very important.\n\nA chunk from:\n\n`apps/api/auth.py`\n\nmeans something different from:\n\n`apps/web/components/Login.tsx`\n\nEven if both mention “user” or “auth”.\n\nThe file extension is not stored separately, but it is still available through the path.\n\n**Step 3: Embeddings and retrieval**\n\nOnce the files were chunked, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in Chroma.\n\nWhen a user asks a question, I embed the question too.\n\nThen the system searches Chroma for chunks that are close to that question.\n\nSo if someone asks:\n\n**Where is authentication handled?**\n\nthe system may find related chunks even if the exact word “authentication” is not used everywhere.\n\nIt can still find related words like:\n\nauth\n\nlogin\n\nsession\n\ntoken\n\nmiddleware\n\njwt\n\nuser\n\nThis is where RAG becomes useful.\n\nBut retrieval is not magic.\n\nSometimes it retrieves:\n\nthe README instead of the actual code\n\na frontend file when the backend file is more useful\n\na config file because it has matching words\n\na chunk that mentions the right term but does not answer the question\n\nThat was a good reminder:\n\n**RAG is not just “put data in vector DB and ask questions.” Retrieval quality matters a lot.**\n\n**Step 4: Asking questions**\n\nThe Q&A flow looks like this:\n\nUser asks a question:\n\nI wanted answers to include sources because without sources, it is hard to trust the output.\n\nFor developer tools, source references are not optional.\n\nIf the AI says:\n\nAuth is handled in middleware.\n\nI want to know:\n\nWhich file?\n\nWhich function?\n\nWhere should I look?\n\nSo the answer should include something like:\n\nSources:\n\n`- apps/api/middleware/auth.ts`\n\n`- apps/api/routes/users.ts`\n\nThis makes the tool much more useful.\n\nWhat broke or felt messy\n\nThis was the most useful part of the build.\n\n**1. Large repos are noisy**\n\nSmall repos are easy.\n\nLarge repos need better filtering.\n\nA real repo may contain:\n\ndocs\n\nexamples\n\ntests\n\nscripts\n\ngenerated files\n\nfrontend\n\nbackend\n\ninfra\n\nIf everything is indexed equally, answers become messy.\n\nA better version should rank files based on importance.\n\nFor example:\n\n```\nREADME.md\npackage.json\nmain entry files\nroutes\nconfig files\nsrc/\ndocs/\n```\n\nshould probably matter more than random test snapshots.\n\n**2. README is useful but not enough**\n\nREADME files are helpful for high-level questions.\n\nBut if you ask:\n\nHow does auth work?\n\nthe README is usually not enough.\n\nYou need code.\n\nThis is where code-aware retrieval becomes important.\n\n**3. File paths matter a lot**\n\nAt first, I treated chunks mostly as text.\n\nBut for codebases, metadata is part of the answer.\n\nA chunk from:\n\n**backend/routes/payment.ts**\n\nIt is not just text.\n\nIt tells you:\n\nThis is backend code\n\nThis is route-level logic\n\nThis likely handles payment APIs\n\nSo the file path helps both retrieval and explanation.\n\n**4. The model needs strict instructions**\n\nIf the model does not know something, it should say so.\n\nFor example:\n\nI could not find authentication logic in the indexed files.\n\nis much better than:\n\nThe app probably uses JWT authentication.\n\nFor a developer tool, guessing is dangerous.\n\nSo the prompt rule was intentionally strict:\n\nAnswer using only the provided repo context.\n\nIf the answer is not present in the context, say that clearly.\n\nAlways mention the source files used.\n\nDo not guess implementation details.\n\nBuilding RepoChat made RAG feel much more real to me.\n\nBefore building it, RAG sounded simple:\n\nembed docs\n\nretrieve docs\n\nask LLM\n\nAfter building it, I see it more like this:\n\nchoose the right data\n\nclean the data\n\nchunk it properly\n\nstore useful metadata\n\nretrieve the right chunks\n\ncontrol the prompt\n\nshow sources\n\ntest bad answers\n\nThe retrieval part is only one piece.\n\nThe developer experience around it matters just as much.\n\nA developer should not care about embeddings or vector DBs.\n\nThey should only feel:\n\n“I understand this repo faster now.”\n\nThe first version is useful, but there are many things I would improve.\n\n**1.Better code parsing**\n\nInstead of splitting files only by text size, I want to split code by structure:\n\nfunctions\n\nclasses\n\nexports\n\nAPI routes\n\ncomponents\n\nTools like Tree-sitter could help with this.\n\n**2.Repo map**\n\nBefore answering questions, the app could build a repo map:\n\nfrontend\n\nbackend\n\nAPI routes\n\ndatabase\n\nauth\n\nconfig\n\ntests\n\nThis would help the model understand the project layout better.\n\nBetter source citations\n\nRight now, file-level sources are useful.\n\nBut line-level sources would be better.\n\nExample:\n\napps/api/auth.ts:45-72\n\nThat would make answers easier to verify.\n\n**3.Evaluation questions**\n\nEvery repo could have test questions like:\n\nHow do I run this project?\n\nWhere is auth handled?\n\nWhere are API routes defined?\n\nWhat database does it use?\n\nThen I can test whether RepoChat answers correctly.\n\nThis is where evals become useful.\n\n**4.MCP integration**\n\nLater, RepoChat could expose repo search as an MCP tool.\n\nThen an agent could ask:\n\nsearch_codebase(\"where is auth handled?\")\n\nand use RepoChat as a codebase understanding tool.\n\nRepoChat started as a small demo, but it taught me a lot.\n\nThe biggest lesson:\n\nRAG is only useful when the developer can trust the answer.\n\nFor codebases, trust comes from:\n\ngood retrieval\n\nuseful chunks\n\nfile metadata\n\nclear sources\n\nhonest “I don’t know” answers\n\nI still want to improve RepoChat, but even this first version made one thing clear:\n\nAI tools for developers should not try to replace understanding.\n\nThey should help developers reach understanding faster.\n\nThat is the part I find exciting.\n\nGitHub: [https://github.com/mahimathacker/repochat-ai](https://github.com/mahimathacker/repochat-ai)\n\nLive demo: [https://youtu.be/kSgZSqH6iXk](https://youtu.be/kSgZSqH6iXk)\n\nI’m still improving it, especially around better code parsing, line-level citations, repo maps, evals, and MCP support.", "url": "https://wpnews.pro/news/rag-for-codebases-is-harder-than-it-looks", "canonical_source": "https://dev.to/mahima_thacker/rag-for-codebases-is-harder-than-it-looks-1nhg", "published_at": "2026-05-27 13:20:36+00:00", "updated_at": "2026-05-27 13:40:42.347423+00:00", "lang": "en", "topics": ["ai-tools", "large-language-models", "generative-ai", "ai-products", "natural-language-processing"], "entities": ["RepoChat", "GitHub", "RAG", "LLM", "Chroma"], "alternates": {"html": "https://wpnews.pro/news/rag-for-codebases-is-harder-than-it-looks", "markdown": "https://wpnews.pro/news/rag-for-codebases-is-harder-than-it-looks.md", "text": "https://wpnews.pro/news/rag-for-codebases-is-harder-than-it-looks.txt", "jsonld": "https://wpnews.pro/news/rag-for-codebases-is-harder-than-it-looks.jsonld"}}