The 5 RAG Architectures and Exactly When to Use Each One in Production

A new guide maps five distinct RAG architectures for production systems, from naive RAG to advanced layered designs, explaining when to use each to avoid confident wrong answers at scale. The article provides implementation templates using LangGraph and LlamaIndex, emphasizing that most production systems combine multiple architectures.

Part 6 of the LangGraph Mental Model series — an expansion of the RAG chapter, going broader and deeper across the retrieval landscape that production systems live in today. Part 4 of this series introduced you to one specific RAG pattern: load documents, build a LlamaIndex VectorStoreIndex, wrap the QueryEngine as a @tool, and hand it to a LangGraph agent. That pattern works, and it works well for the problems it was designed to solve. But the word “RAG” today covers a family of meaningfully different architectures, each built to solve a different class of problem. Using the wrong one is not just a performance issue. It is the difference between a system that works and one that quietly gives your users confident, wrong answers at scale. This article maps the entire family. By the end of it, you will be able to look at any retrieval problem and know, without guessing, which architecture it calls for — and how to build it using LangGraph and LlamaIndex. Here is what we will cover, in order from simplest to most complex: One more thing to carry with you through this entire article: these five architectures are not competitors. They are layers you add progressively as your problem demands more. Most production systems combine at least two of them. Every RAG system exists to answer one fundamental question: how do you give a language model access to knowledge it was never trained on, at the moment it needs it, in the right form for it to reason over? Training data has a cutoff. It has no memory of your company’s internal documents, your product specifications, or anything written after the model was frozen. Fine-tuning on that data is expensive, slow, and produces a model that still cannot update when the documents change. RAG sidesteps the entire problem. Rather than teaching the model new knowledge, you retrieve the relevant knowledge at query time and include it in the prompt as context. The model never needed to memorize it — it just needs to read it when it matters. The five architectures in this article are five different answers to the question of how to retrieve well . Each answer is better suited to a different retrieval problem. Naive RAG is the baseline. It is the architecture that Part 4 of this series taught, and it is the right architecture for a large class of real problems — internal policy bots, FAQ assistants, documentation search tools, onboarding helpers. Do not let the word “naive” mislead you. This is a well-understood, well-tested production pattern used at scale today. The pipeline has five sequential steps, and they map directly to what LlamaIndex gives you out of the box: Stages 1–4: Indexing time run once Stage 5: Query time run on every user question The single most important distinction in all of RAG is between indexing time and query time. You pay the embedding cost once, up front. At query time, you are only paying for one similarity search and one LLM call. When a user asks a question, their question is embedded into the same vector space as your document chunks. The chunks whose vectors sit geometrically closest to the question vector are the ones returned. The assumption is that semantic similarity in vector space corresponds to relevance in meaning. For clean, factual document corpora, this assumption holds surprisingly well. ============================================================ NAIVE RAG — COMPLETE TEMPLATE ============================================================ ── MODULE 1: IMPORTS & CONFIGURATION ───────────────────────import osfrom typing import Literalfrom langchain openai import ChatOpenAIfrom langchain core.messages import HumanMessage, SystemMessagefrom langchain core.tools import toolfrom langgraph.graph import StateGraph, MessagesState, START, ENDfrom langgraph.prebuilt import ToolNodefrom langgraph.checkpoint.memory import MemorySaverfrom llama index.core import Settings, SimpleDirectoryReader, VectorStoreIndex, StorageContext, load index from storage from llama index.llms.openai import OpenAI as LlamaOpenAIfrom llama index.embeddings.openai import OpenAIEmbedding LangGraph's reasoning modelllm = ChatOpenAI model="gpt-4o", temperature=0 LlamaIndex's internal model - separate from the above These two model configs do NOT share state or conversation historySettings.llm = LlamaOpenAI model="gpt-4o-mini", temperature=0.1 Settings.embed model = OpenAIEmbedding model="text-embedding-3-small" Settings.chunk size = 512Settings.chunk overlap = 50 ── MODULE 2: STATE ──────────────────────────────────────────class State MessagesState : pass messages list inherited; extend if you need more ── MODULE 3: KNOWLEDGE BASE Build once, persist, reload ───PERSIST DIR = "./storage/naive"if os.path.exists PERSIST DIR : storage context = StorageContext.from defaults persist dir=PERSIST DIR index = load index from storage storage context else: documents = SimpleDirectoryReader "./data" .load data index = VectorStoreIndex.from documents documents, show progress=True index.storage context.persist persist dir=PERSIST DIR query engine = index.as query engine similarity top k=3 ── MODULE 3 cont. : TOOL ───────────────────────────────────@tooldef search knowledge base query: str - str: """Search internal company documents for policies, product specs, and procedures. Use this for any question requiring domain-specific knowledge rather than general world knowledge.""" response = query engine.query query return str response tools = search knowledge base llm with tools = llm.bind tools tools tool node = ToolNode tools ── MODULE 4: NODES ──────────────────────────────────────────def agent node state: State - dict: system prompt = SystemMessage content= "You are a helpful assistant with access to an internal knowledge base. " "Use search knowledge base for company-specific questions. " "Answer general questions directly." response = llm with tools.invoke system prompt + state "messages" return {"messages": response } ── MODULE 5: ROUTING ────────────────────────────────────────def should continue state: State - Literal "tools", " end " : last = state "messages" -1 if hasattr last, "tool calls" and last.tool calls: return "tools" return " end " ── MODULE 6: GRAPH ASSEMBLY ─────────────────────────────────builder = StateGraph State builder.add node "agent", agent node builder.add node "tools", tool node builder.add edge START, "agent" builder.add conditional edges "agent", should continue, {"tools": "tools", " end ": END} builder.add edge "tools", "agent" graph = builder.compile checkpointer=MemorySaver Naive RAG makes one assumption that fails in two common situations. The first is terminology mismatch. A user asks: “What’s the SLA for tier-1 clients?” The document says: “Gold-tier customers are guaranteed a 99.9% uptime commitment.” The words SLA, tier-1, and Gold-tier are semantically close but not identical. Vector similarity may not rank this chunk highly enough, and the answer gets missed. The second is relational questions. A user asks: “Which of our products are affected if Supplier X goes offline?” Answering this requires traversing a chain of relationships across multiple documents. No single chunk answers it. Naive RAG returns chunks from different documents with no way to connect them. These two failure modes are exactly what the next two architectures solve. Hybrid RAG acknowledges a truth that practitioners discovered in production: semantic similarity and exact keyword match are complementary, not competing, signals of relevance. Neither one alone is sufficient. Dense retrieval vector search is excellent at finding semantic equivalents — it will find the “Gold-tier uptime commitment” document when you ask about “SLA.” But it struggles when the query contains proper nouns, product codes, medical terms, legal citations, or any highly specific terminology that carries precise meaning in its exact form. Sparse retrieval BM25/keyword search is the opposite. It is brilliant at exact term matching — it will always find “SKU-4829” if you search for “SKU-4829.” But it has no concept of semantic equivalence. It will not find “uptime guarantee” when you search for “SLA.” Hybrid RAG runs both searches in parallel, then uses a reranker to produce a single, unified ranked list from the merged results. The reranker is the critical piece here. Unlike embedding models, which compare a query and a chunk independently, a cross-encoder reranker reads the query and each candidate chunk together and produces a relevance score that reflects their relationship directly. It is slower than similarity search, but it is applied only to the already-reduced candidate set, keeping latency manageable. ============================================================ HYBRID RAG — LLAMAINDEX IMPLEMENTATION pip install llama-index-retrievers-bm25 rank-bm25 ============================================================from llama index.core import SimpleDirectoryReader, VectorStoreIndexfrom llama index.core.retrievers import VectorIndexRetrieverfrom llama index.retrievers.bm25 import BM25Retrieverfrom llama index.core.retrievers import QueryFusionRetrieverfrom llama index.core.query engine import RetrieverQueryEnginefrom langchain core.tools import tool Build document store and index as beforedocuments = SimpleDirectoryReader "./data" .load data index = VectorStoreIndex.from documents documents ── DENSE RETRIEVER vector similarity ──────────────────────vector retriever = VectorIndexRetriever index=index, similarity top k=5, Fetch more candidates pre-rerank ── SPARSE RETRIEVER BM25 keyword match ──────────────────── BM25Retriever builds directly from the index's nodesbm25 retriever = BM25Retriever.from defaults index=index, similarity top k=5, ── FUSION: Merge both retrievers with Reciprocal Rank Fusion ─ QueryFusionRetriever is LlamaIndex's built-in hybrid combiner mode="reciprocal rerank" implements the RRF algorithm: it combines ranked lists without needing score calibrationhybrid retriever = QueryFusionRetriever retrievers= vector retriever, bm25 retriever , similarity top k=3, Final top-k after fusion num queries=1, No query expansion in this mode mode="reciprocal rerank", The fusion algorithm use async=True, Run both retrievers in parallel Build a QueryEngine on top of the hybrid retrieverhybrid query engine = RetrieverQueryEngine.from args retriever=hybrid retriever, ── LANGGRAPH TOOL ────────────────────────────────────────────@tooldef search hybrid query: str - str: """Search internal knowledge base using both semantic similarity and keyword matching. More precise than pure vector search, especially for technical terms, product codes, and exact names.""" response = hybrid query engine.query query return str response The LangGraph graph structure is identical to Naive RAG. You are only swapping the tool. Everything from Module 4 onward stays exactly the same. With hybrid retrieval, you will often want to index with smaller chunks than in Naive RAG. Smaller chunks make BM25 matching more precise because a high-frequency term in a small chunk is a stronger signal of relevance than the same term in a large chunk. A setting of 256 tokens with 30-token overlap is a reasonable starting point when BM25 is in the mix. Use Hybrid RAG any time your documents contain a mix of free-form prose and structured terminology. This covers nearly every serious enterprise use case: legal document review where citation forms must match exactly , medical records where drug names, dosage codes, and ICD codes are precise , financial analysis ticker symbols, contract clause identifiers , and technical documentation error codes, API method names, version numbers . Graph RAG is a fundamentally different way of thinking about what a “document” is. In Naive and Hybrid RAG, a document is a blob of text, and retrieval finds the blobs whose text is most relevant to your query. In Graph RAG, a document is a set of entities and relationships , and retrieval follows a path through a network. Consider this question: “Which of our enterprise clients would be affected if we deprecated the legacy authentication module?” A naive retrieval system would search for chunks that mention “enterprise clients” and “authentication module” together. It might find a few. But what you actually need is to traverse a chain: No single document chunk contains that answer. The answer emerges from the structure of the knowledge graph. This is what Graph RAG is built for: multi-hop reasoning, relationship tracing, and questions whose answers require connecting facts that live in different parts of your corpus. Instead of chunking documents and embedding the chunks, Graph RAG runs an entity extraction pass over all documents first. It identifies named entities products, people, organizations, concepts and the relationships between them “depends on,” “is a client of,” “is authored by,” “was superseded by” . These become nodes and edges in a knowledge graph. The graph is then organized into communities of closely related entities using graph clustering algorithms, and each community gets a summary written by an LLM. At query time, the system searches community summaries and then traverses the graph to find relevant entities. ============================================================ GRAPH RAG — LLAMAINDEX PROPERTY GRAPH INDEX pip install llama-index-core ============================================================from llama index.core import SimpleDirectoryReader, PropertyGraphIndexfrom llama index.core.indices.property graph import ImplicitPathExtractor, SimpleLLMPathExtractor, from langchain core.tools import tool Load documentsdocuments = SimpleDirectoryReader "./data" .load data ── BUILD THE PROPERTY GRAPH INDEX ─────────────────────────── LlamaIndex's PropertyGraphIndex handles the full extraction pipeline. SimpleLLMPathExtractor uses an LLM to extract subject, relation, object triples from each chunk - these become graph edges. ImplicitPathExtractor uses fast heuristics cheaper, less precise .index = PropertyGraphIndex.from documents documents, kg extractors= LLM-based extraction: higher quality, higher cost SimpleLLMPathExtractor llm=Settings.llm, max paths per chunk=10, , Heuristic extraction: fast fallback ImplicitPathExtractor , , show progress=True, ── CREATE A GRAPH-AWARE RETRIEVER ──────────────────────────── This retriever traverses the graph rather than searching flat vectors.kg retriever = index.as retriever include text=True, Include surrounding text with each entity retriever mode="hybrid", Combines keyword + embedding on the graph similarity top k=3, from llama index.core.query engine import RetrieverQueryEnginegraph query engine = RetrieverQueryEngine.from args retriever=kg retriever ── LANGGRAPH TOOL ────────────────────────────────────────────@tooldef search knowledge graph query: str - str: """Search the knowledge graph for questions involving relationships between entities - dependencies, organizational hierarchies, impact analysis, and multi-hop reasoning across connected information. Use this when the answer requires tracing a chain of relationships rather than finding a single relevant document.""" response = graph query engine.query query return str response ── COMBINING WITH NAIVE RAG: DUAL-TOOL AGENT ───────────────── In practice, Graph RAG and Naive RAG are often combined. The agent's LLM decides which tool fits the query.@tooldef search knowledge base query: str - str: """Search for factual information in company documents. Best for direct questions with answers in a single document.""" response = query engine.query query The naive query engine return str response Give the LangGraph agent BOTH tools. It will route to the right one based on the question type.tools = search knowledge base, search knowledge graph llm with tools = llm.bind tools tools tool node = ToolNode tools Graph assembly remains identical - only the tool list changes. Graph RAG’s entity extraction phase runs an LLM call over every chunk in your corpus. For a large document set, this means thousands of LLM calls at indexing time. This is intentionally expensive and slow — you are paying a one-time indexing cost for a much richer data structure. Do not build a Graph RAG index on every startup. Always persist the graph and reload it, exactly as shown for Naive RAG in Part 4. Graph RAG is the right architecture when your questions require following chains of relationships: compliance and risk analysis “which processes are affected by regulation X” , supply chain intelligence “what products depend on this supplier” , organizational knowledge “who owns what, and how do those ownership chains connect” , and software dependency mapping “what breaks if we remove module Y” . Advanced RAG is not a single new technique. It is a structured set of improvements that sit on top of whatever base retrieval mechanism you are already using. Where Naive RAG trusts its first retrieval pass, Advanced RAG questions it, refines it, and validates it. There are three categories of improvement, and they slot into different parts of the pipeline: The query a user types is rarely the optimal search query. “What’s the deal with our returns policy for enterprise?” is a perfectly natural human question that will retrieve worse results than “enterprise customer return and refund policy procedures.” Query rewriting uses the LLM to transform the user’s natural language question into a better search query before hitting the index. HyDE Hypothetical Document Embedding takes a different approach: instead of searching with the question, it asks the LLM to generate a hypothetical document that would answer the question, then embeds that document to search. The insight is that an answer-shaped text will sit closer in vector space to other answer-shaped texts than a question-shaped text will. Multi-step questions fail Naive RAG because they require multiple retrievals. Advanced RAG decomposes them first. “Compare our refund policy for enterprise and retail customers, and summarize the key differences” is not one question. It is three: retrieve enterprise policy, retrieve retail policy, compare them. Query decomposition breaks this into sub-queries, retrieves against each, and merges the results before synthesis. Reranking was covered in Hybrid RAG above. The same cross-encoder reranker applies here as a post-retrieval step over whatever chunks were retrieved, even if you only used dense retrieval. CRAG Corrective RAG is the most sophisticated post-retrieval technique. After retrieval, it runs a lightweight evaluation: are the retrieved chunks actually relevant to the question? If the evaluator judges them insufficient, CRAG falls back to an alternative source web search, a broader index rather than forcing the LLM to answer from poor context. ============================================================ ADVANCED RAG — MULTI-TECHNIQUE IMPLEMENTATION ============================================================from llama index.core import SimpleDirectoryReader, VectorStoreIndexfrom llama index.core.query engine import SubQuestionQueryEngine, Handles query decomposition RetrieverQueryEngine, from llama index.core.tools import QueryEngineTool, ToolMetadatafrom llama index.core.postprocessor import SentenceTransformerRerankfrom langchain core.tools import tooldocuments = SimpleDirectoryReader "./data" .load data index = VectorStoreIndex.from documents documents ── RERANKER: Post-retrieval cross-encoder scoring ──────────── Fetch 8 candidates, rerank down to 3 This is the most impactful single improvement in Advanced RAGreranker = SentenceTransformerRerank model="cross-encoder/ms-marco-MiniLM-L-2-v2", top n=3, ── RETRIEVER WITH RERANKING ──────────────────────────────────base retriever = index.as retriever similarity top k=8 reranked engine = RetrieverQueryEngine.from args retriever=base retriever, node postprocessors= reranker , Applied after retrieval ── QUERY DECOMPOSITION with SubQuestionQueryEngine ─────────── Wrap the base engine as a "tool" that the decomposer can callengine tools = QueryEngineTool query engine=reranked engine, metadata=ToolMetadata name="company knowledge base", description= "Searches company documents for policies, procedures, " "and product information." , , SubQuestionQueryEngine decomposes complex questions into sub-questions, runs each against the available tools, and synthesizes a final answer from all sub-answersdecomposed engine = SubQuestionQueryEngine.from defaults query engine tools=engine tools, use async=True, Sub-queries run in parallel when possible ── HYDE: Hypothetical Document Embedding ─────────────────────from llama index.core.indices.query.query transform.base import HyDEQueryTransform, from llama index.core.query engine import TransformQueryEnginehyde transform = HyDEQueryTransform include original=True hyde engine = TransformQueryEngine query engine=reranked engine, query transform=hyde transform, ── LANGGRAPH TOOLS: Different engines for different patterns ──@tooldef search with decomposition query: str - str: """Search for answers to complex questions that may require combining information from multiple documents. Automatically breaks the question into sub-questions and merges the results. Best for comparison questions, multi-part questions, and anything requiring synthesis across different topics.""" response = decomposed engine.query query return str response @tooldef search with hyde query: str - str: """Search the knowledge base using hypothetical document embedding. More effective than standard search for abstract or exploratory questions where the exact terminology in the answer differs from the terminology in the question.""" response = hyde engine.query query return str response The LangGraph agent now has specialized retrieval tools and its LLM decides which retrieval strategy each question needs.tools = search with decomposition, search with hyde Do not implement all of these techniques at once. Start with the single intervention most likely to help your specific failure mode. Reranking is almost always the highest-value first addition. Query decomposition is second. HyDE is a good third step for conceptual or abstract corpora. Add complexity incrementally, and measure recall after each addition. Agentic RAG is the architecture that this entire series has been building toward. It does not just improve on how you retrieve — it changes who makes the retrieval decisions . In all four previous architectures, retrieval is a pipeline: a fixed, predetermined sequence of operations that runs the same way every time. In Agentic RAG, retrieval is a loop: an LLM agent that decides what to search for, evaluates what it found, decides whether to search again, and keeps going until it has enough to answer — or until it determines the answer is unanswerable. This is exactly what LangGraph was designed to do. The agent node, the tool node, the conditional edge — the entire seven-module structure from Part 1 of this series is an Agentic RAG scaffold. What changes is the richness of the tool suite you give it and the sophistication of the routing logic you build around it. The real power of Agentic RAG in LangGraph is that you can give the agent access to every retrieval strategy discussed in this article simultaneously. The agent’s LLM decides which tool to use for each sub-question. ============================================================ AGENTIC RAG — COMPLETE MULTI-TOOL TEMPLATE This is the full production scaffold: all five architectures available to a single agent, which selects dynamically. ============================================================ ── MODULE 1: IMPORTS & CONFIGURATION ───────────────────────import osfrom typing import Literal, Annotatedfrom langchain openai import ChatOpenAIfrom langchain core.messages import HumanMessage, SystemMessage, BaseMessagefrom langchain core.tools import toolfrom langgraph.graph import StateGraph, MessagesState, START, ENDfrom langgraph.prebuilt import ToolNodefrom langgraph.checkpoint.memory import MemorySaverfrom llama index.core import Settings, SimpleDirectoryReader, VectorStoreIndex, StorageContext, load index from storage, PropertyGraphIndex from llama index.core.indices.property graph import SimpleLLMPathExtractorfrom llama index.core.postprocessor import SentenceTransformerRerankfrom llama index.core.query engine import RetrieverQueryEngine, SubQuestionQueryEnginefrom llama index.core.tools import QueryEngineTool, ToolMetadatafrom llama index.llms.openai import OpenAI as LlamaOpenAIfrom llama index.embeddings.openai import OpenAIEmbeddingfrom llama index.retrievers.bm25 import BM25Retrieverfrom llama index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever The agent's main reasoning model LangGraph llm = ChatOpenAI model="gpt-4o", temperature=0 LlamaIndex internal configurationSettings.llm = LlamaOpenAI model="gpt-4o-mini", temperature=0.1 Settings.embed model = OpenAIEmbedding model="text-embedding-3-small" Settings.chunk size = 512Settings.chunk overlap = 50 ── MODULE 2: STATE ──────────────────────────────────────────class AgentState MessagesState : Extend with retrieval metadata if you need observability retrieval count: int = 0 Track how many retrieval calls were made ── MODULE 3: KNOWLEDGE BASES all built at startup ──────────VECTOR DIR = "./storage/vector"GRAPH DIR = "./storage/graph"documents = SimpleDirectoryReader "./data" .load data Vector index for Naive, Hybrid, Advanced if os.path.exists VECTOR DIR : ctx = StorageContext.from defaults persist dir=VECTOR DIR vector index = load index from storage ctx else: vector index = VectorStoreIndex.from documents documents, show progress=True vector index.storage context.persist persist dir=VECTOR DIR Property Graph index for Graph RAG if os.path.exists GRAPH DIR : ctx = StorageContext.from defaults persist dir=GRAPH DIR graph index = load index from storage ctx else: graph index = PropertyGraphIndex.from documents documents, kg extractors= SimpleLLMPathExtractor llm=Settings.llm , show progress=True, graph index.storage context.persist persist dir=GRAPH DIR ── Set up individual engines ─────────────────────────────────reranker = SentenceTransformerRerank model="cross-encoder/ms-marco-MiniLM-L-2-v2", top n=3 Tool 1: Naive / vector searchvector engine = vector index.as query engine similarity top k=3, node postprocessors= reranker , Tool 2: Hybrid vector + BM25 hybrid retriever = QueryFusionRetriever retrievers= VectorIndexRetriever index=vector index, similarity top k=5 , BM25Retriever.from defaults index=vector index, similarity top k=5 , , similarity top k=3, mode="reciprocal rerank", use async=True, hybrid engine = RetrieverQueryEngine.from args retriever=hybrid retriever, node postprocessors= reranker , Tool 3: Graph traversalgraph engine = graph index.as query engine include text=True, retriever mode="hybrid", similarity top k=3, Tool 4: Decomposed for complex multi-part questions decomposed engine = SubQuestionQueryEngine.from defaults query engine tools= QueryEngineTool query engine=vector engine, metadata=ToolMetadata name="docs", description="Company documents and policies." , use async=True, ── MODULE 3 cont. : ALL TOOLS ───────────────────────────────@tooldef search documents query: str - str: """Search company documents by semantic meaning. Best for conceptual questions where the exact wording in the answer may differ from the question.""" return str vector engine.query query @tooldef search exact terms query: str - str: """Search using both keyword and semantic matching. Best when the query contains specific terminology, product codes, names, or exact phrases that must appear in the result.""" return str hybrid engine.query query @tooldef search relationships query: str - str: """Search the knowledge graph for questions about how things connect: dependencies, impact chains, organizational links, and multi-hop reasoning. Use when the answer requires tracing a relationship across multiple entities.""" return str graph engine.query query @tooldef search complex question query: str - str: """For multi-part questions requiring synthesis across several topics. Automatically decomposes the question into sub-queries, retrieves each independently, and combines the results.""" return str decomposed engine.query query tools = search documents, search exact terms, search relationships, search complex question, llm with tools = llm.bind tools tools tool node = ToolNode tools ── MODULE 4: NODES ──────────────────────────────────────────def agent node state: AgentState - dict: system prompt = SystemMessage content= "You are a precise research assistant with access to four retrieval tools:\n\n" "1. search documents - semantic search over company documents\n" "2. search exact terms - hybrid semantic + keyword search\n" "3. search relationships - graph traversal for relationship questions\n" "4. search complex question - decomposed retrieval for multi-part questions\n\n" "Think step by step. Use the tool that best fits the question type. " "You may call multiple tools if a question has multiple parts. " "Only answer when you have retrieved sufficient evidence." response = llm with tools.invoke system prompt + state "messages" return { "messages": response , "retrieval count": state.get "retrieval count", 0 , } ── MODULE 5: ROUTING ────────────────────────────────────────def should continue state: AgentState - Literal "tools", " end " : last = state "messages" -1 if hasattr last, "tool calls" and last.tool calls: return "tools" return " end " ── MODULE 6: GRAPH ASSEMBLY ─────────────────────────────────builder = StateGraph AgentState builder.add node "agent", agent node builder.add node "tools", tool node builder.add edge START, "agent" builder.add conditional edges "agent", should continue, {"tools": "tools", " end ": END} builder.add edge "tools", "agent" graph = builder.compile checkpointer=MemorySaver ── MODULE 7: ENTRYPOINT ──────────────────────────────────────if name == " main ": config = {"configurable": {"thread id": "agentic-session-001"}} print "Agentic RAG ready. Ask anything.\n" while True: user text = input "You: " .strip if not user text or user text.lower in "exit", "quit" : break response = graph.invoke {"messages": HumanMessage content=user text }, config=config, print f"\nAgent: {response 'messages' -1 .content}\n" The critical difference between Agentic RAG and every other architecture in this article is self-correction . A pipeline cannot realize it retrieved the wrong thing. An agent can. If the first retrieval returns weak results, the agent recognizes this in its next reasoning step and issues a different query with different search terms. If a question has an unexpected dependency, the agent discovers this mid-answer and makes an additional retrieval call to resolve it. If the question was ambiguous, the agent can ask for clarification before searching at all. This is the architecture to reach for when the cost of a wrong answer is high — compliance, legal, financial, medical — because you can add verification steps, confidence thresholds, and human-in-the-loop checkpoints from Part 2 of this series directly into the agent graph. Agentic RAG is genuinely slower. A single-tool pipeline runs in 200 to 500 milliseconds. An agent that makes three retrieval calls before answering may take 8 to 12 seconds. For real-time user-facing interfaces, this is often too slow for the primary interaction path. The two production patterns that resolve this are: streaming intermediate steps to the user so they see progress rather than silence, and running agentic retrieval asynchronously to pre-fetch answers for anticipated follow-up questions. Every retrieval problem has a right answer among these five. Here is how to find it. One of the most important things to understand about this family is that the architectures are composable. You do not pick one and discard the others. The most common production pattern is a stack. In LangGraph, this stacking pattern translates directly to the tool list. An Agentic RAG agent with access to a Naive tool, a Hybrid tool, a Graph tool, and a Decomposed tool is exactly the five-architecture stack — the agent Layer 5 selects from and orchestrates the others Layers 1 to 4 on every turn. An extension of the reference cards from Parts 1 through 5. NAIVE RAGSimpleDirectoryReader Load files into Document objectsVectorStoreIndex.from documents Build the embed-and-store indexindex.as query engine Full retrieve-and-answer pipelineindex.as retriever Retrieve only no answer generation Settings.chunk size Token size per NodeSettings.chunk overlap Token overlap between adjacent NodesHYBRID RAGBM25Retriever Keyword-based sparse retrieverVectorIndexRetriever Dense embedding retrieverQueryFusionRetriever Merges multiple retrievers RRF algorithm SentenceTransformerRerank Cross-encoder reranker for post-retrievalGRAPH RAGPropertyGraphIndex Builds a knowledge graph from documentsSimpleLLMPathExtractor LLM-based entity and relation extractionImplicitPathExtractor Heuristic-based entity extraction fast ADVANCED RAGSubQuestionQueryEngine Decomposes complex queries into sub-queriesHyDEQueryTransform Hypothetical Document Embedding transformTransformQueryEngine Wraps any engine with a query transformnode postprocessors Where rerankers and filters attachAGENTIC RAG LANGGRAPH LAYER @tool The bridge - every LlamaIndex engine becomes a LangGraph tool through this decoratorToolNode Executes whatever tool the agent selectsbind tools Gives the agent LLM its tool registryMemorySaver / SqliteSaver Thread-level memory across turns Part 2 interrupt Human approval checkpoint before retrieval Part 3 The five architectures in this article are not five ways to do the same thing. They are five answers to five different retrieval problems, and they sit in a clean progression from simple to sophisticated. Naive RAG is fast, cheap, and right for most document Q&A problems. Hybrid RAG is the production default for anything with specialized terminology. Graph RAG is the answer when relationships matter more than individual documents. Advanced RAG is the pattern for when accuracy needs to go up and the problem is retrieval quality. Agentic RAG is the architecture for open-ended, high-stakes, autonomous reasoning tasks. Combined, with LlamaIndex handling the data layer and LangGraph handling the orchestration layer, these five patterns cover the overwhelming majority of what a production AI application built on retrieval actually needs. The seam between the two frameworks remains exactly what Part 4 taught: one @tool-decorated function. Everything else is a choice about what goes Bessie Delight Kekeli — AI engineer. Writing about what actually works in production. Connect on LinkedIn: linkedin.com/in/delight-bessie The 5 RAG Architectures and Exactly When to Use Each One in Production https://pub.towardsai.net/the-5-rag-architectures-and-exactly-when-to-use-each-one-in-production-d73c9acedbf7 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.