The 5 RAG Architectures and Exactly When to Use Each One in Production

wpnews.pro

Part 6 of the LangGraph Mental Model series — an expansion of the RAG chapter, going broader and deeper across the retrieval landscape that production systems live in today.

Part 4 of this series introduced you to one specific RAG pattern: load documents, build a LlamaIndex VectorStoreIndex, wrap the QueryEngine as a @tool, and hand it to a LangGraph agent. That pattern works, and it works well for the problems it was designed to solve.

But the word “RAG” today covers a family of meaningfully different architectures, each built to solve a different class of problem. Using the wrong one is not just a performance issue. It is the difference between a system that works and one that quietly gives your users confident, wrong answers at scale.

This article maps the entire family. By the end of it, you will be able to look at any retrieval problem and know, without guessing, which architecture it calls for — and how to build it using LangGraph and LlamaIndex.

Here is what we will cover, in order from simplest to most complex:

One more thing to carry with you through this entire article: these five architectures are not competitors. They are layers you add progressively as your problem demands more. Most production systems combine at least two of them.

Every RAG system exists to answer one fundamental question: how do you give a language model access to knowledge it was never trained on, at the moment it needs it, in the right form for it to reason over?

Training data has a cutoff. It has no memory of your company’s internal documents, your product specifications, or anything written after the model was frozen. Fine-tuning on that data is expensive, slow, and produces a model that still cannot update when the documents change.

RAG sidesteps the entire problem. Rather than teaching the model new knowledge, you retrieve the relevant knowledge at query time and include it in the prompt as context. The model never needed to memorize it — it just needs to read it when it matters.

The five architectures in this article are five different answers to the question of how to retrieve well. Each answer is better suited to a different retrieval problem.

Naive RAG is the baseline. It is the architecture that Part 4 of this series taught, and it is the right architecture for a large class of real problems — internal policy bots, FAQ assistants, documentation search tools, onboarding helpers. Do not let the word “naive” mislead you. This is a well-understood, well-tested production pattern used at scale today.

The pipeline has five sequential steps, and they map directly to what LlamaIndex gives you out of the box:

  Stages 1–4: Indexing time (run once)  Stage 5:    Query time (run on every user question)

The single most important distinction in all of RAG is between indexing time and query time. You pay the embedding cost once, up front. At query time, you are only paying for one similarity search and one LLM call.

When a user asks a question, their question is embedded into the same vector space as your document chunks. The chunks whose vectors sit geometrically closest to the question vector are the ones returned. The assumption is that semantic similarity in vector space corresponds to relevance in meaning. For clean, factual document corpora, this assumption holds surprisingly well.

Naive RAG makes one assumption that fails in two common situations.

The first is terminology mismatch. A user asks: “What’s the SLA for tier-1 clients?” The document says: “Gold-tier customers are guaranteed a 99.9% uptime commitment.” The words SLA, tier-1, and Gold-tier are semantically close but not identical. Vector similarity may not rank this chunk highly enough, and the answer gets missed.

The second is relational questions. A user asks: “Which of our products are affected if Supplier X goes offline?” Answering this requires traversing a chain of relationships across multiple documents. No single chunk answers it. Naive RAG returns chunks from different documents with no way to connect them.

These two failure modes are exactly what the next two architectures solve.

Hybrid RAG acknowledges a truth that practitioners discovered in production: semantic similarity and exact keyword match are complementary, not competing, signals of relevance. Neither one alone is sufficient.

Dense retrieval (vector search) is excellent at finding semantic equivalents — it will find the “Gold-tier uptime commitment” document when you ask about “SLA.” But it struggles when the query contains proper nouns, product codes, medical terms, legal citations, or any highly specific terminology that carries precise meaning in its exact form.

Sparse retrieval (BM25/keyword search) is the opposite. It is brilliant at exact term matching — it will always find “SKU-4829” if you search for “SKU-4829.” But it has no concept of semantic equivalence. It will not find “uptime guarantee” when you search for “SLA.”

Hybrid RAG runs both searches in parallel, then uses a reranker to produce a single, unified ranked list from the merged results.

The reranker is the critical piece here. Unlike embedding models, which compare a query and a chunk independently, a cross-encoder reranker reads the query and each candidate chunk together and produces a relevance score that reflects their relationship directly. It is slower than similarity search, but it is applied only to the already-reduced candidate set, keeping latency manageable.

With hybrid retrieval, you will often want to index with smaller chunks than in Naive RAG. Smaller chunks make BM25 matching more precise because a high-frequency term in a small chunk is a stronger signal of relevance than the same term in a large chunk. A setting of 256 tokens with 30-token overlap is a reasonable starting point when BM25 is in the mix.

Use Hybrid RAG any time your documents contain a mix of free-form prose and structured terminology. This covers nearly every serious enterprise use case: legal document review (where citation forms must match exactly), medical records (where drug names, dosage codes, and ICD codes are precise), financial analysis (ticker symbols, contract clause identifiers), and technical documentation (error codes, API method names, version numbers).

Graph RAG is a fundamentally different way of thinking about what a “document” is. In Naive and Hybrid RAG, a document is a blob of text, and retrieval finds the blobs whose text is most relevant to your query. In Graph RAG, a document is a set of entities and relationships, and retrieval follows a path through a network.

Consider this question: “Which of our enterprise clients would be affected if we deprecated the legacy authentication module?”

A naive retrieval system would search for chunks that mention “enterprise clients” and “authentication module” together. It might find a few. But what you actually need is to traverse a chain:

No single document chunk contains that answer. The answer emerges from the structure of the knowledge graph. This is what Graph RAG is built for: multi-hop reasoning, relationship tracing, and questions whose answers require connecting facts that live in different parts of your corpus.

Instead of chunking documents and embedding the chunks, Graph RAG runs an entity extraction pass over all documents first. It identifies named entities (products, people, organizations, concepts) and the relationships between them (“depends on,” “is a client of,” “is authored by,” “was superseded by”). These become nodes and edges in a knowledge graph. The graph is then organized into communities of closely related entities using graph clustering algorithms, and each community gets a summary written by an LLM. At query time, the system searches community summaries and then traverses the graph to find relevant entities.

Graph RAG’s entity extraction phase runs an LLM call over every chunk in your corpus. For a large document set, this means thousands of LLM calls at indexing time. This is intentionally expensive and slow — you are paying a one-time indexing cost for a much richer data structure. Do not build a Graph RAG index on every startup. Always persist the graph and reload it, exactly as shown for Naive RAG in Part 4.

Graph RAG is the right architecture when your questions require following chains of relationships: compliance and risk analysis (“which processes are affected by regulation X”), supply chain intelligence (“what products depend on this supplier”), organizational knowledge (“who owns what, and how do those ownership chains connect”), and software dependency mapping (“what breaks if we remove module Y”).

Advanced RAG is not a single new technique. It is a structured set of improvements that sit on top of whatever base retrieval mechanism you are already using. Where Naive RAG trusts its first retrieval pass, Advanced RAG questions it, refines it, and validates it.

There are three categories of improvement, and they slot into different parts of the pipeline:

The query a user types is rarely the optimal search query. “What’s the deal with our returns policy for enterprise?” is a perfectly natural human question that will retrieve worse results than “enterprise customer return and refund policy procedures.” Query rewriting uses the LLM to transform the user’s natural language question into a better search query before hitting the index.

HyDE (Hypothetical Document Embedding) takes a different approach: instead of searching with the question, it asks the LLM to generate a hypothetical document that would answer the question, then embeds that document to search. The insight is that an answer-shaped text will sit closer in vector space to other answer-shaped texts than a question-shaped text will.

Multi-step questions fail Naive RAG because they require multiple retrievals. Advanced RAG decomposes them first.

“Compare our refund policy for enterprise and retail customers, and summarize the key differences” is not one question. It is three: retrieve enterprise policy, retrieve retail policy, compare them. Query decomposition breaks this into sub-queries, retrieves against each, and merges the results before synthesis.

Reranking was covered in Hybrid RAG above. The same cross-encoder reranker applies here as a post-retrieval step over whatever chunks were retrieved, even if you only used dense retrieval.

CRAG (Corrective RAG) is the most sophisticated post-retrieval technique. After retrieval, it runs a lightweight evaluation: are the retrieved chunks actually relevant to the question? If the evaluator judges them insufficient, CRAG falls back to an alternative source (web search, a broader index) rather than forcing the LLM to answer from poor context.

Do not implement all of these techniques at once. Start with the single intervention most likely to help your specific failure mode. Reranking is almost always the highest-value first addition. Query decomposition is second. HyDE is a good third step for conceptual or abstract corpora. Add complexity incrementally, and measure recall after each addition.

Agentic RAG is the architecture that this entire series has been building toward. It does not just improve on how you retrieve — it changes who makes the retrieval decisions.

In all four previous architectures, retrieval is a pipeline: a fixed, predetermined sequence of operations that runs the same way every time. In Agentic RAG, retrieval is a loop: an LLM agent that decides what to search for, evaluates what it found, decides whether to search again, and keeps going until it has enough to answer — or until it determines the answer is unanswerable.

This is exactly what LangGraph was designed to do. The agent node, the tool node, the conditional edge — the entire seven-module structure from Part 1 of this series is an Agentic RAG scaffold. What changes is the richness of the tool suite you give it and the sophistication of the routing logic you build around it.

The real power of Agentic RAG in LangGraph is that you can give the agent access to every retrieval strategy discussed in this article simultaneously. The agent’s LLM decides which tool to use for each sub-question.

The critical difference between Agentic RAG and every other architecture in this article is self-correction. A pipeline cannot realize it retrieved the wrong thing. An agent can.

If the first retrieval returns weak results, the agent recognizes this in its next reasoning step and issues a different query with different search terms. If a question has an unexpected dependency, the agent discovers this mid-answer and makes an additional retrieval call to resolve it. If the question was ambiguous, the agent can ask for clarification before searching at all.

This is the architecture to reach for when the cost of a wrong answer is high — compliance, legal, financial, medical — because you can add verification steps, confidence thresholds, and human-in-the-loop checkpoints from Part 2 of this series directly into the agent graph.

Agentic RAG is genuinely slower. A single-tool pipeline runs in 200 to 500 milliseconds. An agent that makes three retrieval calls before answering may take 8 to 12 seconds. For real-time user-facing interfaces, this is often too slow for the primary interaction path. The two production patterns that resolve this are: streaming intermediate steps to the user so they see progress rather than silence, and running agentic retrieval asynchronously to pre-fetch answers for anticipated follow-up questions.

Every retrieval problem has a right answer among these five. Here is how to find it.

One of the most important things to understand about this family is that the architectures are composable. You do not pick one and discard the others. The most common production pattern is a stack.

In LangGraph, this stacking pattern translates directly to the tool list. An Agentic RAG agent with access to a Naive tool, a Hybrid tool, a Graph tool, and a Decomposed tool is exactly the five-architecture stack — the agent (Layer 5) selects from and orchestrates the others (Layers 1 to 4) on every turn.

An extension of the reference cards from Parts 1 through 5.

NAIVE RAGSimpleDirectoryReader           Load files into Document objectsVectorStoreIndex.from_documents Build the embed-and-store indexindex.as_query_engine()         Full retrieve-and-answer pipelineindex.as_retriever()            Retrieve only (no answer generation)Settings.chunk_size             Token size per NodeSettings.chunk_overlap          Token overlap between adjacent NodesHYBRID RAGBM25Retriever                   Keyword-based sparse retrieverVectorIndexRetriever            Dense embedding retrieverQueryFusionRetriever            Merges multiple retrievers (RRF algorithm)SentenceTransformerRerank       Cross-encoder reranker for post-retrievalGRAPH RAGPropertyGraphIndex              Builds a knowledge graph from documentsSimpleLLMPathExtractor          LLM-based entity and relation extractionImplicitPathExtractor           Heuristic-based entity extraction (fast)ADVANCED RAGSubQuestionQueryEngine          Decomposes complex queries into sub-queriesHyDEQueryTransform              Hypothetical Document Embedding transformTransformQueryEngine            Wraps any engine with a query transformnode_postprocessors             Where rerankers and filters attachAGENTIC RAG (LANGGRAPH LAYER)@tool                           The bridge - every LlamaIndex engine becomes                                a LangGraph tool through this decoratorToolNode                        Executes whatever tool the agent selectsbind_tools()                    Gives the agent LLM its tool registryMemorySaver / SqliteSaver       Thread-level memory across turns (Part 2)interrupt()                     Human approval checkpoint before retrieval (Part 3)

The five architectures in this article are not five ways to do the same thing. They are five answers to five different retrieval problems, and they sit in a clean progression from simple to sophisticated.

Naive RAG is fast, cheap, and right for most document Q&A problems. Hybrid RAG is the production default for anything with specialized terminology. Graph RAG is the answer when relationships matter more than individual documents. Advanced RAG is the pattern for when accuracy needs to go up and the problem is retrieval quality. Agentic RAG is the architecture for open-ended, high-stakes, autonomous reasoning tasks.

Combined, with LlamaIndex handling the data layer and LangGraph handling the orchestration layer, these five patterns cover the overwhelming majority of what a production AI application built on retrieval actually needs.

The seam between the two frameworks remains exactly what Part 4 taught: one @tool-decorated function. Everything else is a choice about what goes

Bessie Delight Kekeli — AI engineer. Writing about what actually works in production. Connect on LinkedIn: linkedin.com/in/delight-bessie

The 5 RAG Architectures and Exactly When to Use Each One in Production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article Three Eras of Quantitative Finance: How Rule-Based, ML, and Deep Learning Models React to the Same… How to Securely Connect Your AI Agent to Telegram with Azure Linear Trees: What If Every Decision-Tree Leaf Had Its Own Linear Model?

The 5 RAG Architectures and Exactly When to Use Each One in Production

Run your AI side-project on zahid.host