RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern.

A new standardized pattern combining LangGraph and LlamaIndex for Retrieval-Augmented Generation (RAG) eliminates guesswork by using LlamaIndex for data indexing and retrieval, and LangGraph for orchestration. The approach treats RAG as a capability added to an agent, not a new orchestration pattern, and provides a five-stage pipeline for building production-ready RAG systems.

Part 6 of the LangGraph Mental Model series , a focused detour into Retrieval-Augmented Generation, connected back to the canonical structure from Parts 1–4 For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , Part 5 What this article assumes:You’re comfortable with the seven-module structure and know how to give a LangGraph agent a tool Part 1, Module 3 . That’s genuinely all you need. If not, you can still follow along very easily. We are deliberately rolling complexity back down to a single agent here, no supervisors, no subgraphs, noSend. RAG is acapabilityyou add to an agent, not a new orchestration pattern, and this article treats it that way. Every agent you’ve built so far in this series has one blind spot: it only knows what the LLM was trained on. Ask it about your company’s internal pricing policy, last quarter’s financial report, or a product manual you wrote last week, and it will either say “I don’t know” or — worse — confidently make something up. Retrieval-Augmented Generation RAG fixes this. Instead of relying purely on what the model memorized during training, RAG retrieves relevant snippets from your own documents at query time and feeds them to the LLM as context before it answers. The LLM is no longer guessing ,it’s reading. You could build a RAG pipeline by hand inside LangGraph: chunk documents yourself, generate embeddings yourself, build a vector index yourself, write the retrieval logic yourself. People do this. It’s a lot of plumbing for something that’s become a solved problem. LlamaIndex is a framework built specifically to solve that plumbing problem. It is, by a wide margin, the most mature and widely used tool for the “data” half of an AI application ,loading, chunking, embedding, indexing, and retrieving. LangGraph, as you already know, is built for the “orchestration” half — state, control flow, multi-step reasoning. This article shows you how to use each tool for what it’s best at: LlamaIndex builds and serves the knowledge base. LangGraph wraps it as a tool and decides when to use it. This is the dominant pattern in real production RAG systems today, and it’s the pattern we’ll build, step by step. Before writing code, you need the same kind of mental model for LlamaIndex that you built for LangGraph in Part 1. Let’s build it the same way: concept, then keywords, then template. Think of LlamaIndex as a librarian for your LLM . Imagine you hired a brilliant research assistant the LLM who has read most of the internet but has never seen your company’s files. You can’t just hand them a 10,000-page binder and say “remember all of this” — they don’t have room to hold it all in their head at once. So instead, you hire a librarian LlamaIndex . The librarian’s job has four parts: read all your documents, organize them into small, searchable index cards chunks , file those cards in a system that allows fast lookup by topic a vector index , and when your research assistant has a question, fetch only the 3–5 most relevant cards and hand them over. The research assistant never has to memorize the binder. They just ask the librarian, get a tiny handful of relevant snippets, and reason over those. Every RAG system, regardless of framework, follows this same five-stage pipeline. LlamaIndex gives you a clean abstraction for each stage: 1. LOAD → Read raw files PDF, Word, web pages, databases into Documents2. CHUNK → Split Documents into small, retrievable Nodes3. EMBED → Convert each Node's text into a vector a list of numbers representing meaning 4. STORE → Save those vectors in a Vector Index for fast lookup5. RETRIEVE → At query time, embed the user's question, find the most similar Nodes, and return them as context Stages 1–4 happen once, when you build your knowledge base this is called indexing . Stage 5 happens every time a user asks a question this is called querying . This distinction — indexing time vs. query time — is the single most important mental model in all of RAG. Keep it in your head as you read everything that follows. Document — a raw piece of source content one PDF, one web page, one database row plus its metadata. This is the unprocessed, full-length unit of data. Node — a chunk of a Document. LlamaIndex splits each Document into multiple smaller Node objects because LLMs and embedding models work better with focused, bite-sized text rather than entire documents at once. Nodes carry metadata back to their parent document so you can always trace a chunk back to its source . SimpleDirectoryReader — the most common loader. Points at a folder and automatically reads PDFs, text files, Word docs, and more, turning each file into one or more Document objects. VectorStoreIndex — the core indexing class. Takes a list of Document or Node objects, embeds them, and builds a searchable vector index in memory or in an external vector database, if you configure one . Settings — a global configuration object. Instead of passing your LLM and embedding model into every function call, you set them once on Settings and every LlamaIndex component uses them automatically. QueryEngine — the object you actually "ask questions" to. Created by calling .as query engine on an index. Internally it does embedding the query, similarity search, and feeding retrieved context to the LLM to generate a final answer — all in one call. Retriever — a lower-level object created by calling .as retriever on an index. Unlike a QueryEngine, a Retriever only does the search step — it returns the raw matching Node objects without generating an LLM answer. You'll use this when you want retrieval and generation to be separate, controllable steps which is exactly what we want once we bring in LangGraph . Let’s build a working LlamaIndex pipeline on its own first — no LangGraph yet. This mirrors how Part 1 taught LangGraph’s modules in isolation before assembling them. Core package + OpenAI LLM and embedding integrations the common starting setup pip install llama-index-core llama-index-llms-openai llama-index-embeddings-openai Readers for common file types PDF, Word, etc. pip install llama-index-readers-file pypdf LlamaIndex is intentionally split into many small packages. llama-index-core has the framework logic. Everything else : which LLM, which embedding model, which vector database, which file reader is its own package you add only when you need it. This keeps your installs lean. python ── llamaindex config.py ─────────────────────────────────────import osfrom llama index.core import Settingsfrom llama index.llms.openai import OpenAIfrom llama index.embeddings.openai import OpenAIEmbedding Settings is global - configure once, used everywhere in LlamaIndexSettings.llm = OpenAI model="gpt-4o-mini", Used for generating final answers from retrieved context temperature=0.1, Low temperature: factual, not creative Settings.embed model = OpenAIEmbedding model="text-embedding-3-small", Used to convert text into vectors Controls how documents are split into Nodes chunks Settings.chunk size = 512 Max tokens per chunkSettings.chunk overlap = 50 Overlap between consecutive chunks, to preserve context across boundaries A quick but important note on chunk size: too large, and each chunk contains so much text that retrieval becomes imprecise you fetch a chunk that's only 10% relevant . Too small, and you lose surrounding context that the LLM needs to understand a fact. 512 tokens with a 50-token overlap is a reasonable, widely-used starting point ,tune it based on your documents. ── build knowledge base.py ──────────────────────────────────from llama index.core import SimpleDirectoryReader, VectorStoreIndex ── LOAD: Read all files in a folder into Document objects ──documents = SimpleDirectoryReader "./data" .load data print f"Loaded {len documents } documents" ── CHUNK + EMBED + STORE: all three happen inside this one call ── VectorStoreIndex automatically: 1. Splits each Document into Nodes using Settings.chunk size 2. Embeds each Node using Settings.embed model 3. Stores the vectors in an in-memory indexindex = VectorStoreIndex.from documents documents, show progress=True ── RETRIEVE + GENERATE: ask a question ──────────────────────query engine = index.as query engine similarity top k=3, Retrieve the 3 most relevant chunks for each query response = query engine.query "What is our refund policy for enterprise customers?" print response That’s a complete, working RAG system in about ten lines. This is genuinely the “hello world” of LlamaIndex, and it’s a fair demonstration of why the framework exists — compare this to manually writing a chunker, an embedding loop, a similarity search function, and a prompt template yourself. Building the index calls the embedding model for every chunk — which costs money and time. You don’t want to rebuild it every time your app starts. Persist it to disk. ── Save the index after building it ─────────────────────────index.storage context.persist persist dir="./storage" ── Load it back later without re-embedding anything ──────────from llama index.core import StorageContext, load index from storagestorage context = StorageContext.from defaults persist dir="./storage" index = load index from storage storage context storage context.persist writes the vector data, the node metadata, and the document store to a local folder. load index from storage reconstructs the exact same index from those files — no re-embedding, no API calls. This is the standalone equivalent of the checkpointer pattern you learned in Part 2: build once, persist, reload cheaply. The in-memory index above is fine for prototypes, but for production you typically want a dedicated vector database — something built to handle millions of vectors, concurrent reads, and metadata filtering at scale. Vector DB: Chroma https://www.trychroma.com/ ── Using Chroma as a persistent, production-grade vector store ── pip install llama-index-vector-stores-chroma chromadbimport chromadbfrom llama index.vector stores.chroma import ChromaVectorStorefrom llama index.core import StorageContext, VectorStoreIndexchroma client = chromadb.PersistentClient path="./chroma db" chroma collection = chroma client.get or create collection "my knowledge base" vector store = ChromaVectorStore chroma collection=chroma collection storage context = StorageContext.from defaults vector store=vector store Build the index directly into Chromaindex = VectorStoreIndex.from documents documents, storage context=storage context Later, in a different process, reconnect without re-indexing:index = VectorStoreIndex.from vector store vector store=vector store The pattern is identical to what you learned with checkpointers in Part 2: swap one object MemorySaver → SqliteSaver, or here, in-memory → ChromaVectorStore and everything downstream,your query engine, your retrieval logic — stays exactly the same. The same swap pattern works for Pinecone, Qdrant, Weaviate, and most other vector databases LlamaIndex supports. Now for the part this whole article has been building toward. We need to bridge two frameworks that speak different languages: LlamaIndex’s QueryEngine is not a LangGraph tool. LangGraph's agent expects tools built with LangChain's @tool decorator. The integration pattern is refreshingly simple once you see it: write a regular Python function, decorated with @tool, whose body calls query engine.query ... . That's the entire bridge. LangGraph doesn't need to know anything about LlamaIndex internals — it just sees a tool that takes a string and returns a string, exactly like every other tool from Part 1. ── MODULE 3: TOOLS LlamaIndex-backed ─────────────────────from langchain core.tools import tool The query engine built in Part B - created once, at startup In a real app, you'd load this from persisted storage, not rebuild it every time @tooldef search knowledge base query: str - str: """Search the internal knowledge base for company policies, product documentation, and internal procedures. Use this whenever the user asks a question that might be answered by internal company documents rather than general knowledge. Args: query: A natural-language question to search for. Returns: A synthesized answer based on the most relevant retrieved documents. """ response = query engine.query query return str response That’s it. The docstring matters exactly as much as it did in Part 1 — it’s how the LLM in your LangGraph agent decides when to call this tool versus answering from its own knowledge or calling a different tool. Here is the full canonical agent template from Part 1, with a LlamaIndex-backed RAG tool slotted directly into Module 3. Nothing else changes. ============================================================ LANGGRAPH + LLAMAINDEX RAG AGENT — COMPLETE TEMPLATE Extends: Part 1 core structure ============================================================ ── MODULE 1: IMPORTS & CONFIGURATION ───────────────────────import osfrom typing import Literal LangChain / LangGraph imports the orchestration layer from langchain openai import ChatOpenAIfrom langchain core.messages import HumanMessage, SystemMessage, BaseMessagefrom langchain core.tools import toolfrom langgraph.graph import StateGraph, MessagesState, START, ENDfrom langgraph.prebuilt import ToolNodefrom langgraph.checkpoint.memory import MemorySaver LlamaIndex imports the retrieval / data layer from llama index.core import 1 Settings, SimpleDirectoryReader, VectorStoreIndex, StorageContext, load index from storage from llama index.llms.openai import OpenAI as LlamaOpenAIfrom llama index.embeddings.openai import OpenAIEmbedding LangGraph's chat model - used by the agent's reasoningllm = ChatOpenAI model="gpt-4o", temperature=0 LlamaIndex's model config - used internally by the query engine Note: these are SEPARATE from the LangGraph llm above. Each framework manages its own model instances; they don't share state.Settings.llm = LlamaOpenAI model="gpt-4o-mini", temperature=0.1 Settings.embed model = OpenAIEmbedding model="text-embedding-3-small" Settings.chunk size = 512 ── MODULE 2: STATE ──────────────────────────────────────────class State MessagesState : pass messages field inherited; extend if your agent needs more ── MODULE 3: TOOLS RAG-backed ───────────────────────────── Build or load the LlamaIndex knowledge base ONCE, at startupPERSIST DIR = "./storage"if os.path.exists PERSIST DIR : Reload existing index - no re-embedding, fast startup storage context = StorageContext.from defaults persist dir=PERSIST DIR index = load index from storage storage context else: First run - build the index and persist it documents = SimpleDirectoryReader "./data" .load data index = VectorStoreIndex.from documents documents, show progress=True index.storage context.persist persist dir=PERSIST DIR query engine = index.as query engine similarity top k=3 @tooldef search knowledge base query: str - str: """Search internal company documents for policies, product specs, procedures, and other domain-specific information. Use this for any question that requires knowledge specific to this organization rather than general world knowledge.""" response = query engine.query query return str response tools = search knowledge base llm with tools = llm.bind tools tools tool node = ToolNode tools ── MODULE 4: NODES ──────────────────────────────────────────def agent node state: State - dict: """The reasoning node. Decides whether to answer directly or search the knowledge base first.""" system prompt = SystemMessage content= "You are a helpful assistant with access to an internal knowledge base. " "Use the search knowledge base tool when the user asks about company-specific " "information. For general questions, answer directly." messages = system prompt + state "messages" response = llm with tools.invoke messages return {"messages": response } ── MODULE 5: ROUTING ────────────────────────────────────────def should continue state: State - Literal "tools", " end " : last message = state "messages" -1 if hasattr last message, "tool calls" and last message.tool calls: return "tools" return " end " ── MODULE 6: GRAPH ASSEMBLY ─────────────────────────────────graph builder = StateGraph State graph builder.add node "agent", agent node graph builder.add node "tools", tool node graph builder.add edge START, "agent" graph builder.add conditional edges "agent", should continue, {"tools": "tools", " end ": END} graph builder.add edge "tools", "agent" graph = graph builder.compile checkpointer=MemorySaver ── MODULE 7: ENTRYPOINT ──────────────────────────────────────if name == " main ": config = {"configurable": {"thread id": "session-001"}} print "RAG agent ready. Ask about your documents, or anything else.\n" while True: user text = input "You: " .strip if not user text or user text.lower == "exit": break response = graph.invoke {"messages": HumanMessage content=user text }, config=config print f"Agent: {response 'messages' -1 .content}\n" It’s worth tracing through a single query so the two frameworks’ division of labor is completely clear: User: "What's our policy on remote work?" ↓ agent node — LangGraph's LLM reads the message, recognizes this needs internal info, decides to call search knowledge base ↓ tools — ToolNode executes search knowledge base "What's our policy on remote work?" ↓ Inside the tool: query engine.query ... runs — this is 100% LlamaIndex, invisible to LangGraph: 1. Embeds the query 2. Searches the vector index for the 3 closest chunks 3. Feeds those chunks + the question to Settings.llm 4. Returns a synthesized answer string ↓ agent node — LangGraph's LLM receives the tool's string result, and crafts the final response shown to the user ↓Response to user Notice the clean separation: LangGraph never sees a Node, an embedding, or a vector. It just sees a tool that returns a string — exactly like the search web or calculate tools from Part 1. LlamaIndex never sees messages, state, or a thread id — it just receives a query string and returns an answer. Each framework does exactly what it's good at, and the seam between them is one function call. The query engine.query pattern above is the simplest integration, and it's right for most cases. But sometimes you want more control — for example, you want your LangGraph agent's LLM to be the one synthesizing the final answer using your existing system prompt and conversation context , rather than letting LlamaIndex's internal Settings.llm generate a separate, disconnected answer. For this, use a Retriever instead of a QueryEngine. The retriever does only the search step — it returns raw chunks, and you decide what to do with them. ── Retriever-only tool: returns raw chunks, not a synthesized answer ──retriever = index.as retriever similarity top k=3 @tooldef retrieve documents query: str - str: """Retrieve relevant document excerpts from the internal knowledge base. Returns raw excerpts for you to read and reason over yourself - use this when you need to cite specific sources or combine information from multiple documents.""" nodes = retriever.retrieve query Format each retrieved chunk with its source for transparency formatted chunks = for i, node in enumerate nodes : source = node.metadata.get "file name", "unknown source" formatted chunks.append f" Excerpt {i+1} from {source} \n{node.text}" return "\n\n---\n\n".join formatted chunks With this pattern, your LangGraph agent node's own LLM call the one using llm with tools, with your full conversation history and system prompt is the one that reads these raw excerpts and writes the final answer. This gives you a few real advantages: the agent can cite which document a fact came from, it can combine the retrieved excerpts with earlier conversation context, and you have one consistent "voice" for the agent rather than two different LLMs answering in two different styles. Use QueryEngine the Part C pattern when you want the simplest possible integration and don't mind LlamaIndex's LLM generating the retrieval-based answer as a self-contained string. Good for most chatbots and assistants. Use Retriever this pattern when you need source citations, want the agent's main LLM to maintain a consistent voice and reasoning style across both retrieved and non-retrieved answers, or want to combine retrieved chunks with other tool results before generating a final answer. This extends the keyword cards from Parts 1–4 with LlamaIndex-specific terms. LlamaIndex Core Keywords Document — a raw loaded source one file, one page . The unprocessed unit of data. Node — a chunk of a Document. The retrievable unit. Carries metadata back to its source. SimpleDirectoryReader "./path" .load data — loads all supported files in a folder into Document objects. Settings — global configuration singleton. Set Settings.llm, Settings.embed model, Settings.chunk size once; every LlamaIndex component uses them. VectorStoreIndex.from documents documents — builds a complete index: chunks, embeds, and stores in one call. Querying Keywords index.as query engine similarity top k=3 — creates a QueryEngine. Retrieves top-k chunks AND generates a synthesized answer in one call. index.as retriever similarity top k=3 — creates a Retriever. Retrieves top-k chunks only — no answer generation. Use when you want your own LLM to synthesize. query engine.query "..." — runs the full retrieve-then-generate pipeline. Returns a Response object use str response to get text . retriever.retrieve "..." — runs only the retrieval step. Returns a list of Node objects with .text and .metadata. Persistence Keywords index.storage context.persist persist dir="./storage" — saves the index to disk. Avoids re-embedding on every restart. StorageContext.from defaults persist dir="./storage" + load index from storage ... — reloads a previously persisted index. ChromaVectorStore / PineconeVectorStore / QdrantVectorStore — external vector database backends. Swap-in replacements for the default in-memory store, following the same pattern as LangGraph's checkpointers from Part 2. The Bridge LlamaIndex → LangGraph @tool + a function body calling query engine.query ... — the entire integration pattern. LangGraph sees a normal tool; LlamaIndex internals stay fully encapsulated. Two separate LLM configs — Settings.llm LlamaIndex's internal model, used only inside the query engine and your LangGraph llm the agent's main reasoning model are independent. They do not share conversation history or state. Your agent only needs general knowledge, math, or live web search → You don’t need LlamaIndex. Stick with the Part 1 template and standard tools. Your agent needs to answer questions from a fixed set of documents policies, manuals, reports that don’t change often → This article’s Part C pattern QueryEngine wrapped as a tool is exactly right. Your agent needs to cite sources, or combine retrieved facts with multi-turn conversation reasoning → Use the Retriever pattern from Part D instead. Your knowledge base is large 100,000+ documents , needs frequent updates, or needs to scale to many concurrent users → Use an external vector database Chroma, Pinecone, Qdrant instead of the in-memory index, exactly as shown in Part B, Step 5. You need the retrieval step itself to be a reviewable, pausable action e.g., compliance requires a human to approve what gets searched → Combine this article with Part 3’s interrupt pattern. Wrap the retrieval call inside a node that pauses for approval before querying, just like the tool-approval pattern for any other sensitive tool. The lesson underneath this entire article is one you’ve now seen twice: in Part 4, you learned that a complex multi-agent system is really just several single-agent graphs composed together. Here, you’ve learned that adding real-world knowledge to an agent isn’t a LangGraph problem at all — it’s a data problem, and the right tool for a data problem is a data framework. LlamaIndex does the indexing and retrieval. LangGraph does the reasoning and orchestration. The entire integration between them is a single @tool-decorated function. That's not a limitation — it's the right amount of coupling. Each framework stays fully in its lane, and you can upgrade either side independently: swap your vector database without touching your graph, or add a supervisor pattern from Part 4 without touching your retrieval logic. You now have the complete production scaffold across five articles: canonical structure Part 1 , memory management Part 2 , human-in-the-loop safety Part 3 , multi-agent orchestration Part 4 , and real-world knowledge via RAG Part 5 . Combined, these five patterns cover the overwhelming majority of what a serious, production-grade LangGraph application actually needs. For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , Part 5, RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern. https://pub.towardsai.net/rag-without-the-guesswork-a-standardized-langgraph-llamaindex-pattern-bcaf14f9c811 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.