Your AI Agent Isn’t Broken. Your Retrieval Is.

wpnews.pro

Unreliable AI agents are one of the most frustrating problems to debug — the agent ignores its tools, drifts off topic, or returns confidently wrong answers, and the root cause isn’t always obvious.

As an MLOps Data Engineer working on LLM systems, I’ve traced many of these failures back to the same source: poor knowledge retrieval. This series covers how to build reliable AI agents in Databricks, starting with the foundation that most production issues come back to — Retrieval-Augmented Generation (RAG) first.

Everyone already knows prompt is very essential in guiding LLMs to give us answer or output we want. To create a good prompt, common techniques like few-shot prompting (giving examples), persona adoption (to tell LLMs its role and goal) or negative constraint (what LLMs shouldn’t do) are used widely.

But prompting has boundaries that it can’t break through.

Industry leaders have increasingly shifted from talking about ‘context engineering’ than ‘prompt engineering’. Because beyond the prompt, we also need to supply more information (** context**) for LLMs to work well. For example, system instructions, conversation history, user constraints and retrieved knowledge.

Of all the context components, retrieved knowledge requires the most careful engineering. Its quality directly determines how well your agent reasons, plans, and uses its tools. That is why RAG is the right place to start: get this wrong, and no amount of prompt tuning will compensate for it.

RAG is the architectural pattern, and Retrieval AI Agents are the implementations of such a pattern

Many modern AI agents implement some form of Retrieval-Augmented Generation (RAG), especially when they need access to external or proprietary knowledge (Refer to the Appendix if you want to know why RAG is needed). For RAG to work well, we need to first make sure the knowledge we put in the RAG is well processed.

Typical RAG implementation consists of 3 things.

Why do we need a vector database to store knowledge when we can just upload the entire PDF file that has all the knowledge we need the LLMs to understand? Then, we can query the LLMs with whatever questions we have.

Up the full document works well enough for one-off queries on small files, but it breaks down quickly for production use cases. Although modern LLMs can process large documents, repeatedly sending entire documents is expensive, slow, and often less effective than retrieving only the relevant sections. Not only will the LLMs be slow and poor in replying, but this implementation will burn through your company’s budget for AI tokens.

By doing that, there is also a risk of context rot for LLMs

This is where chunking becomes essential.

Chunking is the process of splitting documents (500-page PDF) into pieces (chunks) of smaller, manageable yet semantically meaningful text. The purposes of this step are

You’re asking what the context window limits are? It is the maximum number of tokens that LLMs can process in 1 query. It includes both input tokens (system prompt + retrieved knowledge + chat history + latest user query) and output tokens. So that is also another reason why we can’t fit entire 500 pages PDF to LLMs, because it might just exceed the context window limits.

Context Window Limit = input tokens (system prompt + retrieved knowledge + chat history + latest user query) + LLM output tokens

For example, if the context window limit of an LLM model is 100k tokens. We can’t just chunk the PDF into chunks of 100k tokens each because you will still need to reserve part of the context window limit for the system prompt, chat history, user query and the output from the LLM. If the context window limit is exceeded, the application or model may truncate part of the context or reject the request entirely, depending on the implementation.

In most models, the combined input and output tokens must fit within the model’s context limits, though some providers also impose separate output-token limits.

This is the first point that require careful engineering, you need to take in consideration of the token size of your system prompt. Also, estimate what the average token size of the user query and LLM output (do your LLMs tend to reply in long message or also short answers). Then from there, to determine what the ideal size of a chunk is. The chunk size should fit within your available context budget while also balancing retrieval precision and contextual completeness.

During early prototyping, many teams prefer stronger models with larger context windows because they reduce engineering constraints and make experimentation easier. Then, only later, when things are stable, the teams will try to optimise for cost and latency by evaluating smaller models.

To do chunking, there are generally 2 methods to do so. The second method is increasingly popular because it preserves meaning better than the first method, though many production systems still use recursive or hybrid approaches.

And on top of these 2 methods, we can make the chunks reserve even more semantic integrity by applying one or more of these 3 techniques.

There are a lot of useful libraries out there that can help you achieve this easily. One of them is LangChain’s RecursiveCharacterTextSplitter. It handles chunking and overlap behind the scenes, making it easy to prepare text for embedding and search. (We will cover the technical implementation in the next article.)

Actually, whatLangChain’sRecursiveCharacterTextSplitter doesis hierarchical text splitting based on document structure and configurable separators. While it is not true semantic chunking, it generally preserves context better than simple fixed-size splitting.

After you have chunks of text, it is still not ready for LLMs to consume. Because LLMs is a algorithm made from computer, it only understands numbers, but not text. The text must first be converted into tokens and numerical vectors. And embedding is the way to do so.

"The cat sits on the mat"↓[0.12, -0.88, 0.34, 1.05, ...]

Embedding is a numerical representation of content, typically generated by a deep learning model. By converting content into numerical representation, we can use similarity calculation to map similar concepts close together in vector space. This allows semantically (topic) similar content to be located even when the wording is completely different.

More powerful multimodal embedding models can represent images, audio, and text within the same vector space, meaning similarity search works across content types. For RAG systems that need to retrieve from mixed-media knowledge bases (documents, diagrams, recorded audio), this is what makes cross-modal retrieval possible.

So with embeddings model, we will convert every chunk into embeddings and store these embeddings into vector database.

The key technical decision when working with embeddings is selecting the right model for your use case. For example, if we are building a chatbot for a pharmaceutical use case. A pharmaceutical-focused embedding model may perform better on biomedical content than on legal documents.

Selecting the right embedding model comes down to a few practical criteria:

One last point is that the embedding model must represent both source documents and user queries in the same vector space. This means you must use the same embedding model for both indexing your knowledge base and processing incoming queries at runtime. Using different models, even ones that are similarly capable, might produce vectors in incompatible spaces, where the distance between a query vector and a document vector becomes mathematically meaningless. The retriever would essentially be comparing measurements in different units, and your similarity scores would no longer reflect actual relevance.

Our embeddings model isn’t just used for processing the knowledge source documents. It will also convert user queries into embeddings so the embeddings can be used for our main topics today — Retrieval!

The retrieval that we talks about so much is actually vector search (A vector is a general mathematical object. An embedding is a specific type of vector used to represent meaning), which happens in a vector database. And for most of the vector databases, you will see it supports these search.

For now, let us just focus on similarity search.

How similarity search works is by calculating c**osine similarity between user queries embeddings and embeddings in the knowledge base. **It calculates the cosine of the angle between two vectors. A higher score means greater similarity.

Then imagine in your vector database you have 1000 chunks (already converted to embeddings), by calculating the cosine similarity of user queries vs all chunks, you can obtain the chunks that are highly relevant to user queries by looking at those that have the highest cosine similarity. The process above is what we call exact k-Nearest Neighbours (kNN) search. In an exact kNN search, the system identifies the k most similar vectors, often by comparing the query against all stored vectors.

But exact kNN is not scalable. What if the vector database grows into millions of chunks? Then, an exact kNN search will be time-consuming and cause the LLM response slow as well. It will also be very expensive. To improve performance, vector databases use Approximate Nearest Neighbour (ANN), which attempts to find vectors that are very close to the true nearest neighbours without exhaustively checking every vector.

ANN organize vectors/embeddings into specialised index structures and navigates vector space efficiently. And every time, it searches only a subset of the most promising vectors (in other words, it only calculates cosine similarity with a subset of chunks). This, although sacrifice some accuracy but it obtains a massive increase in performance. It makes the search practical on a large scale.

However, here’s a crucial insight: similarity does not always equal semantic relevance. A chunk with high vector similarity may still not be the most relevant piece of information for answering the user’s query in vector space and might still be factually irrelevant or contextually inappropriate.

Example:

Query:“How do I reset my password?”

Vector search may return:

“Enable two-factor authentication for account security.”

“Use strong passwords with special characters.”

These are related to passwords and accounts, so they gethigh similarity scores, but they do not explain how to reset a password.

A reranker then re-evaluates the results based on actual relevance to the query and pushes the real answer (e.g., “Click ‘Forgot Password’ on the login page…”) to the top.

Vector Search asks:"Which chunks look similar?"Reranking asks:"Which chunks are most useful for answering this specific query?"

To improve the chance that retrieved chunks are both semantically similar and relevant, we can add a reranking step after vector search. The process flow will become

Another common scenario for reranking is when you need to limit the number of chunks due to token constraints or processing costs. By doing reranking, you can filter out semantically weakly relevant and noisy chunks.

The benefits of reranking is it significantly improves the accuracy and quality of retrieved chunks that are provided to the language model. This will often reduce** LLM hallucinations** and improve** overall response quality**.

However, it also comes with some trade-offs, which are that it increases both latency and cost in your retrieval pipeline. The reranker must process the user query and knowledge chunks in real-time, adding computational overhead.

In some modern systems, we also see separate LLMs used as rerankers to judge, filter and sort the relevance of retrieved chunks.

In the beginning, I mentioned that there are 3 components in a typical RAG implementation. After going through this article, I should update it by adding 5 more items.

Each stage in the pipeline feeds the next. Chunking determines the granularity and context preserved in each piece of knowledge. Embeddings then represent those chunks in a vector space where similarity can be measured. The retriever uses those embeddings to surface relevant chunks for a given query, and the reranker filters that initial set down to what is genuinely most useful. A weakness at any earlier stage compounds. Poorly chunked text produces misleading embeddings, which in turn causes the retriever to surface the wrong chunks, leaving the reranker with nothing useful to promote.

We will cover document cleaning & ingestion in the next article, together with the Databricks implementation walkthrough. Since document ingestion is more about preparing and processing data, it makes sense to discuss it alongside the hands-on implementation rather than in this article, which focuses on RAG concepts.

Hope that my article has helped you gain a better understanding of RAG. Stay tuned for the next one!

Your AI Agent Isn’t Broken. Your Retrieval Is. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article I Gave Five AI Coding Agents a way to Fact-Check the Docs They Were handed. They Refused to Use it. I Tested the Viral “Caveman” AI Trick. Here’s What It Actually Saves (And What It Doesn’t) You Can’t Monitor an AI Agent Like a Web Service. Here’s What I Track Instead.

Your AI Agent Isn’t Broken. Your Retrieval Is.

Run your AI side-project on zahid.host