Your LLM Is Wrong. Your Codebase Is Why.

wpnews.pro

It happened on a Tuesday. I asked my AI coding assistant to explain a function I'd written three months earlier. It described a function that doesn't exist.

Not a total hallucination. The function did exist. Just not by that name, not with those parameters, not doing what the model confidently told me it was doing. The model had assembled a plausible story from vague signals and filled the gaps with fiction.

My first instinct was to blame the model. My second instinct, the one that actually helped, was to look at the code itself.

The model wasn't broken. My codebase was.

Technical debt is code that's hard to change. Comprehension debt is code that's hard to understand. Not just by future developers. By anything that has to read it cold: a new hire, a rubber duck, and increasingly, an AI assistant.

You've probably heard "write code as if the next maintainer is a serial killer who knows where you live." The LLM version is more forgiving. But not by much.

Comprehension debt shows up when the intent of your code isn't captured in your code. The logic works. The tests pass. But nothing in the source tells a reader why a function does what it does, what its constraints are, or what it absolutely should not do. That knowledge lives in someone's head, in a Slack thread from two months ago, or nowhere at all.

LLMs don't have access to the Slack thread. They only have your source.

When your AI assistant gets your own codebase wrong, it's not random. The errors cluster around specific failure modes, and each one points to a real gap.

1. It invents function names.

The model calls functions that don't exist, or calls existing functions by the wrong name. This usually means your naming is inconsistent or your barrel exports are incomplete. The model is pattern matching across conventions that don't agree with each other.

2. It gets parameter types wrong.

It passes a string where you want a typed enum, or a plain object where you've defined a specific interface. This almost always means missing or implicit type annotations in your function signatures. The model is guessing.

3. It imports packages you don't use.

It reaches for lodash

or axios

when you've got utility wrappers that wrap those already. Your actual internal abstractions aren't legible to the model because they aren't documented anywhere they can be found. The model falls back to what it knows from training.

4. It uses patterns you've deprecated.

It calls the old version of your API, the one you stopped using eight months ago. Your codebase still contains those old patterns (maybe for backward compatibility, maybe just because cleanup hasn't happened yet) and the model doesn't know which version is current. Deprecation comments cost thirty seconds to write. Their absence costs you five minutes of confusion per assistant interaction.

5. It doesn't know the business rule.

It gives you the technically valid version of a function, not the version that accounts for the actual constraint. "This user lookup should always check the soft delete flag first" lives in a comment in no file. It was decided in a call. The model can't know what was never written down.

Each of these errors is a free audit item. You didn't have to run a tool to find it. The model found it for you.

You don't need a formal process for this. You just need to treat your LLM's confusion as a signal instead of noise.

Pick a module. Any module that's been around for more than a few months. Feed it to your AI assistant and ask these questions:

Don't correct the model when it gets something wrong. Write down what it got wrong. That list is your comprehension debt register.

For a healthy module, the model will get most of this right. For a module with comprehension debt, you'll see the five signals show up fast. I ran this on an internal TypeScript service last quarter. Twelve exported functions. The model hallucinated the names of three of them, got the return type wrong on two others, and had no idea what the rate limit parameter was for. That's a 41% wrong answer rate on a module I thought was well maintained. It wasn't. It just worked.

Working and legible are not the same thing.

The instinct is to reach for RAG (chunk your codebase, embed it, retrieve relevant context before each LLM call). That helps. I cover the full approach in my production RAG guide if you want the implementation details.

But RAG retrieves your documentation. If your documentation is the code itself and the code is opaque, RAG gives the model better access to opaque code. The underlying problem doesn't change.

The actual fix is cheaper than you think:

Write the intent, not the implementation. A JSDoc comment that says "Validates and normalizes a user object. Always call this before persisting to the database. Does NOT check permissions." gives the model something to retrieve. A comment that says "validates user" does not.

Mark your deprecations inline. @deprecated Use getUserV2 instead

takes five seconds. It means the model stops confidently recommending the old API.

Put your business rules in the file that enforces them. Not in the ticket. Not in Confluence. In the file. A comment above the rate limit parameter that says "this is hardcoded per the billing agreement with enterprise customers, do not make it configurable" is documentation that actually travels with the code.

The goal isn't to write documentation for humans. It's to write documentation that your LLM assistant can parse so it can help you correctly. The secondary effect is that it also helps the next human on your team. That's free.

For teams working on larger AI agent systems, the memory and context patterns that help here are the same ones I break down in my post on AI agent memory management. Comprehension debt in your codebase and context gaps in your agents come from the same root cause: undocumented intent. You can also get a quick read on your current exposure with this LLM hallucination risk estimator. It won't diagnose specific debt, but it gives you a calibrated starting point for where to focus.

Your LLM assistant is, right now, the most honest reader your codebase has. It doesn't know the context you carry in your head. It doesn't remember the decision you made in 2024. It reads what's there and tries to make sense of it.

When it gets something wrong, that's signal. The model isn't failing. It's showing you exactly what a reader without your context has to work with.

That's a gift. Most code never gets that kind of external read until the next engineer joins and asks the same confused questions.

Use it. If you want this kind of thinking applied to your actual codebase or AI systems architecture, that is exactly the kind of work I take on.

If you want a deeper look at production AI systems, I cover it on mudassirkhan.me.

How wrong does your LLM get your own codebase? Drop a number in the comments. Curious what percentage of wrong answers people are seeing in production.

source & further reading

dev.to — original article DEV Passion Fuel Station: Keeping the Open Source Fire Burning Introducing App Store Release Agent – Automating my App Store Pipeline Two weekends into a Chrome side panel: the four state bugs that took longer than the UI

Your LLM Is Wrong. Your Codebase Is Why.

Run your AI side-project on zahid.host