# How to Chat With Your Codebase Locally and Privately, No Code Leaves Your Machine

> Source: <https://pub.towardsai.net/how-to-chat-with-your-codebase-locally-and-privately-no-code-leaves-your-machine-05da7b3dd965?source=rss----98111c9905da---4>
> Published: 2026-06-26 17:01:06+00:00

*The AI coding tools everyone uses are reading your proprietary code on someone else’s servers, and on a large codebase they still hallucinate functions and conventions they have no way of knowing. You can build a local assistant that actually understands your repository, answers questions about it in plain language, and never sends a single line off your machine. Here is why it is worth doing, the one detail that makes a code assistant good instead of useless, and exactly how to set it up.*

If you work on a real codebase, you have probably had this experience with an AI coding tool. You ask it about a function, and it confidently describes behavior that does not exist, invents an API your project never used, or ignores a convention your team has followed for years. It’s not being stupid. It simply has no reliable knowledge of your specific code, so when its context runs out, it fills the gap with a plausible guess. On a large repository this happens constantly, because the whole codebase doesn’t fit in the model’s context window, and the tool is working from fragments.

There is a second problem sitting underneath the first, and for a lot of developers it’s the bigger one. When you use a hosted AI coding assistant, your code is being sent to a company’s servers to be processed. For a personal project that might not bother you. For proprietary code, a client’s repository, anything under an NDA, or work in a regulated field like finance, defense, or healthcare, it’s a real problem, sometimes a disqualifying one, and it’s why a lot of teams simply can’t use these tools on their most sensitive code.

Both problems have the same fix. You can build a local assistant that indexes your entire codebase, answers questions grounded in your actual code rather than guesses, and runs entirely on your own machine, so not a single line is ever transmitted anywhere. It’s free after the hardware you already own, it works offline, and once it’s set up you can ask it things like where a function is used, what a module does, or how a pattern flows through the project, and get answers drawn from your real code. Here is why it works, the detail that makes or breaks it, and how to build it.

The reason a plain language model is unreliable on your codebase is simple. It was trained on a vast amount of public code, so it knows general patterns well, but it knows nothing about your specific repository, your internal conventions, your architecture, the function you wrote last week. Ask it about those and it is guessing from the general patterns it learned, which is exactly why it invents things.

The fix is the same technique that powers a private document assistant, retrieval augmented generation, applied to code. Instead of relying on the model’s memory, you give it the relevant pieces of your actual codebase to read before it answers. Your code gets broken into pieces, each piece gets converted into a numerical fingerprint that captures its meaning and stored in a local index, and when you ask a question, the system finds the pieces most relevant to it, hands them to the model, and asks it to answer from those. The model stops guessing about your code and starts answering from it. That retrieval step, feeding it your real code instead of trusting its memory, is what turns a confident hallucinator into an assistant that actually knows your project.

Here is the part most tutorials skip, and it’s the single thing that separates a code assistant that works from one that is useless. How you break your code into pieces matters enormously, and the naive approach fails badly on code specifically.

Most generic retrieval setups chop text into fixed-size blocks, say every 500 tokens, regardless of what the text is. That’s fine for prose. It’s a disaster for code, because it will cut a function in half, split a class from its methods, and separate a piece of logic from the context that explains it. When the system later retrieves one of those mangled fragments, it hands the model half a function with no beginning, and the answer is garbage.

The fix is to chunk code along its natural boundaries. Instead of slicing by character count, you split by function, by class, by module, so each piece you store is a complete, meaningful unit, a whole function with its signature, a full class, a coherent block. This is often called function-aware or structure-aware chunking, and it’s the highest-impact decision in the whole build. A code assistant with good chunking and a modest model will run circles around one with a powerful model and code sliced into arbitrary fragments. If you take one thing from this guide, it’s that the quality of your code assistant is decided more by how intelligently you split the code than by which model you run.

There are two honest paths, and both keep your code entirely local. One needs almost no code, the other gives you full control.

The first way is to use an existing tool that already does codebase retrieval, and for most developers this is the right answer. The standout is Continue, the open-source assistant that has become the de facto replacement for cloud coding tools and plugs straight into VS Code, JetBrains, and other editors. It has a codebase feature built in, you point it at your project and ask questions referencing your whole repository, and you configure it to run on local models so nothing leaves your machine. You define which local model handles chat, which handles autocomplete, and which handles the embeddings that power the code search, all in one config file. Other tools like Open WebUI and AnythingLLM offer similar repository indexing behind a chat interface. If you want the capability without building it, install one of these, point it at your local models, and you’re working.

The second way is to build the pipeline yourself, which is the right choice when you want to control exactly how your code is chunked, indexed, and retrieved, or wire the assistant into your own tooling. The whole thing is a short pipeline you can write in a script, parse and chunk the codebase along function boundaries, embed each chunk with a local embedding model, store the vectors in a local database, and at query time retrieve the relevant chunks and hand them to a local model for the answer. The reason to do it yourself is the chunking, since building the pipeline lets you make that structure-aware splitting as smart as your codebase needs, which a packaged tool does not always expose.

Either way, you need one tool underneath and the right pair of models.

Here is the whole build as a sequence you can follow.

```
ollama pull nomic-embed-text          # turns your code into searchable vectorsollama pull qwen2.5-coder:14b         # the code model that answers (pick by hardware)
```

3. Choose your code model by your hardware. On a light machine, a small coder model works for basic help. On a typical developer machine with 16 gigabytes of memory, a 14-billion code model is a solid balance. On a strong desktop with a 24-gigabyte graphics card, a 30 to 33-billion coder model gives the best quality and gets close to the cloud tools on pure coding tasks. The current generation of code-focused models is genuinely strong at this size.

```
ollama pull qwen2.5-coder:1.5b        # light, also good for fast autocompleteollama pull qwen3-coder:30b           # strong desktop, best local code quality
```

4. Pick your path. For the tool route, install Continue in your editor, open its config, set your chat model, your autocomplete model, and nomic-embed-text as the embedding model, all pointed at Ollama, then use its codebase feature to ask questions across your project. For the build-it route, install a vector database like ChromaDB and write the index-then-query pipeline, making sure to chunk along function and class boundaries rather than by fixed size.

5. Index your codebase once, then ask. Point the tool or script at your repository and let it build the index, which takes anywhere from seconds to a few minutes depending on size. Then ask in plain language, where is this function called, what does this service do, how does data flow through this module, and the answers come back grounded in your actual code, with the relevant files in view.

That’s the whole process, and your code never leaves your machine at any step.

This is genuinely useful, but it’s not a wholesale replacement for the best cloud tools yet, and you should know where the limits are.

Smaller local models, under roughly 13 billion, hallucinate noticeably more on obscure APIs and edge cases, so for a private code assistant you want to run the largest code model your hardware comfortably holds. Local inference is also slower per token than a top cloud model, often several times slower, though the latency is near zero since there is no network round trip, so it feels more responsive than the raw speed suggests. And for the hardest agentic work, large multi-file refactors, complex multi-step planning, the best cloud models still lead, which is why many developers run a hybrid, local for sensitive or proprietary repositories where privacy is non-negotiable, and cloud for everything else. One practical security note, keep the local services bound to your own machine and do not expose their ports to the network, since the code index has no authentication of its own.

None of that undercuts the core value. For understanding a codebase, locating functionality, onboarding to an unfamiliar project, and answering questions about your own code, a local assistant is genuinely capable, and it does it without sending your proprietary code to anyone. For the code you are not allowed to or do not want to upload, that’s not a compromise, it’s the only version that works at all.

Set this up and you get the thing the cloud tools promise, an assistant that understands your codebase, without the two problems that come with them, the hallucinations from a model that doesn’t actually know your code, and the exposure of sending your code to someone else’s servers. The local version fixes both at once, it answers from your real repository because you fed it your real repository, and it keeps everything on hardware you control.

The larger point is that this capability is no longer something only the big tools can offer. The pieces, a local model runner, a good embedding model, a vector store, and crucially the knowledge to chunk your code intelligently, are all free and available to you. Build it once, get the chunking right, and you have a private assistant that knows your codebase and tells no one about it. For anyone working on code that cannot leave the building, that’s not just convenient, it’s the difference between using AI assistance and being locked out of it entirely.

*If you build a local codebase assistant, drop a comment with the tool or stack you used, the models you settled on, and how you handled chunking, because the chunking strategy is where these setups live or die and the next builder will want to know what worked for you.*

[How to Chat With Your Codebase Locally and Privately, No Code Leaves Your Machine](https://pub.towardsai.net/how-to-chat-with-your-codebase-locally-and-privately-no-code-leaves-your-machine-05da7b3dd965) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.