# RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern.

> Source: <https://pub.towardsai.net/rag-without-the-guesswork-a-standardized-langgraph-llamaindex-pattern-bcaf14f9c811?source=rss----98111c9905da---4>
> Published: 2026-06-21 16:01:01+00:00

*Part 6 of the LangGraph Mental Model series , a focused detour into Retrieval-Augmented Generation, connected back to the canonical structure from Parts 1–4*

*For other parts of the series : **Part 0** , **Part 1** , **Part 2** , **Part 3** *, *Part 4** , **Part 5*

What this article assumes:You’re comfortable with the seven-module structure and know how to give a LangGraph agent a tool (Part 1, Module 3). That’s genuinely all you need. If not, you can still follow along very easily. We are deliberately rolling complexity back down to a single agent here, no supervisors, no subgraphs, noSend. RAG is acapabilityyou add to an agent, not a new orchestration pattern, and this article treats it that way.

Every agent you’ve built so far in this series has one blind spot: it only knows what the LLM was trained on. Ask it about your company’s internal pricing policy, last quarter’s financial report, or a product manual you wrote last week, and it will either say “I don’t know” or — worse — confidently make something up.

**Retrieval-Augmented Generation (RAG)** fixes this. Instead of relying purely on what the model memorized during training, RAG retrieves relevant snippets from *your own documents* at query time and feeds them to the LLM as context before it answers. The LLM is no longer guessing ,it’s reading.

You could build a RAG pipeline by hand inside LangGraph: chunk documents yourself, generate embeddings yourself, build a vector index yourself, write the retrieval logic yourself. People do this. It’s a lot of plumbing for something that’s become a solved problem.

**LlamaIndex** is a framework built specifically to solve that plumbing problem. It is, by a wide margin, the most mature and widely used tool for the “data” half of an AI application ,loading, chunking, embedding, indexing, and retrieving. LangGraph, as you already know, is built for the “orchestration” half — state, control flow, multi-step reasoning.

This article shows you how to use each tool for what it’s best at: **LlamaIndex builds and serves the knowledge base. LangGraph wraps it as a tool and decides when to use it.** This is the dominant pattern in real production RAG systems today, and it’s the pattern we’ll build, step by step.

Before writing code, you need the same kind of mental model for LlamaIndex that you built for LangGraph in Part 1. Let’s build it the same way: concept, then keywords, then template.

Think of LlamaIndex as a **librarian for your LLM**. Imagine you hired a brilliant research assistant (the LLM) who has read most of the internet but has never seen your company’s files. You can’t just hand them a 10,000-page binder and say “remember all of this” — they don’t have room to hold it all in their head at once.

So instead, you hire a librarian (LlamaIndex). The librarian’s job has four parts: **read** all your documents, **organize** them into small, searchable index cards (chunks), **file** those cards in a system that allows fast lookup by topic (a vector index), and when your research assistant has a question, **fetch** only the 3–5 most relevant cards and hand them over.

The research assistant never has to memorize the binder. They just ask the librarian, get a tiny handful of relevant snippets, and reason over those.

Every RAG system, regardless of framework, follows this same five-stage pipeline. LlamaIndex gives you a clean abstraction for each stage:

```
1. LOAD       → Read raw files (PDF, Word, web pages, databases) into Documents2. CHUNK      → Split Documents into small, retrievable Nodes3. EMBED      → Convert each Node's text into a vector (a list of numbers                representing meaning)4. STORE      → Save those vectors in a Vector Index for fast lookup5. RETRIEVE   → At query time, embed the user's question, find the most                  similar Nodes, and return them as context
```

Stages 1–4 happen once, when you build your knowledge base (this is called **indexing**). Stage 5 happens every time a user asks a question (this is called **querying**). This distinction — indexing time vs. query time — is the single most important mental model in all of RAG. Keep it in your head as you read everything that follows.

**Document** — a raw piece of source content (one PDF, one web page, one database row) plus its metadata. This is the unprocessed, full-length unit of data.

**Node** — a chunk of a Document. LlamaIndex splits each Document into multiple smaller Node objects because LLMs and embedding models work better with focused, bite-sized text rather than entire documents at once. Nodes carry metadata back to their parent document (so you can always trace a chunk back to its source).

**SimpleDirectoryReader** — the most common loader. Points at a folder and automatically reads PDFs, text files, Word docs, and more, turning each file into one or more Document objects.

**VectorStoreIndex** — the core indexing class. Takes a list of Document (or Node) objects, embeds them, and builds a searchable vector index in memory (or in an external vector database, if you configure one).

**Settings** — a global configuration object. Instead of passing your LLM and embedding model into every function call, you set them once on Settings and every LlamaIndex component uses them automatically.

**QueryEngine** — the object you actually "ask questions" to. Created by calling .as_query_engine() on an index. Internally it does embedding the query, similarity search, and feeding retrieved context to the LLM to generate a final answer — all in one call.

**Retriever** — a lower-level object created by calling .as_retriever() on an index. Unlike a QueryEngine, a Retriever only does the *search* step — it returns the raw matching Node objects without generating an LLM answer. You'll use this when you want retrieval and generation to be separate, controllable steps (which is exactly what we want once we bring in LangGraph).

Let’s build a working LlamaIndex pipeline on its own first — no LangGraph yet. This mirrors how Part 1 taught LangGraph’s modules in isolation before assembling them.

```
# Core package + OpenAI LLM and embedding integrations (the common starting setup)pip install llama-index-core llama-index-llms-openai llama-index-embeddings-openai# Readers for common file types (PDF, Word, etc.)pip install llama-index-readers-file pypdf
```

LlamaIndex is intentionally split into many small packages. llama-index-core has the framework logic. Everything else : which LLM, which embedding model, which vector database, which file reader is its own package you add only when you need it. This keeps your installs lean.

``` python
# ── llamaindex_config.py ─────────────────────────────────────import osfrom llama_index.core import Settingsfrom llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbedding# Settings is global - configure once, used everywhere in LlamaIndexSettings.llm = OpenAI(    model="gpt-4o-mini",       # Used for generating final answers from retrieved context    temperature=0.1,            # Low temperature: factual, not creative)Settings.embed_model = OpenAIEmbedding(    model="text-embedding-3-small",   # Used to convert text into vectors)# Controls how documents are split into Nodes (chunks)Settings.chunk_size = 512        # Max tokens per chunkSettings.chunk_overlap = 50      # Overlap between consecutive chunks, to preserve context across boundaries
```

A quick but important note on chunk_size: too large, and each chunk contains so much text that retrieval becomes imprecise (you fetch a chunk that's only 10% relevant). Too small, and you lose surrounding context that the LLM needs to understand a fact. 512 tokens with a 50-token overlap is a reasonable, widely-used starting point ,tune it based on your documents.

```
# ── build_knowledge_base.py ──────────────────────────────────from llama_index.core import SimpleDirectoryReader, VectorStoreIndex# ── LOAD: Read all files in a folder into Document objects ──documents = SimpleDirectoryReader("./data").load_data()print(f"Loaded {len(documents)} documents")# ── CHUNK + EMBED + STORE: all three happen inside this one call ──# VectorStoreIndex automatically:#   1. Splits each Document into Nodes (using Settings.chunk_size)#   2. Embeds each Node (using Settings.embed_model)#   3. Stores the vectors in an in-memory indexindex = VectorStoreIndex.from_documents(documents, show_progress=True)# ── RETRIEVE + GENERATE: ask a question ──────────────────────query_engine = index.as_query_engine(    similarity_top_k=3,   # Retrieve the 3 most relevant chunks for each query)response = query_engine.query("What is our refund policy for enterprise customers?")print(response)
```

That’s a complete, working RAG system in about ten lines. This is genuinely the “hello world” of LlamaIndex, and it’s a fair demonstration of why the framework exists — compare this to manually writing a chunker, an embedding loop, a similarity search function, and a prompt template yourself.

Building the index calls the embedding model for every chunk — which costs money and time. You don’t want to rebuild it every time your app starts. Persist it to disk.

```
# ── Save the index after building it ─────────────────────────index.storage_context.persist(persist_dir="./storage")# ── Load it back later without re-embedding anything ──────────from llama_index.core import StorageContext, load_index_from_storagestorage_context = StorageContext.from_defaults(persist_dir="./storage")index = load_index_from_storage(storage_context)
```

storage_context.persist() writes the vector data, the node metadata, and the document store to a local folder. load_index_from_storage() reconstructs the exact same index from those files — no re-embedding, no API calls. This is the standalone equivalent of the checkpointer pattern you learned in Part 2: build once, persist, reload cheaply.

The in-memory index above is fine for prototypes, but for production you typically want a dedicated vector database — something built to handle millions of vectors, concurrent reads, and metadata filtering at scale. Vector DB: [Chroma](https://www.trychroma.com/)

```
# ── Using Chroma as a persistent, production-grade vector store ──# pip install llama-index-vector-stores-chroma chromadbimport chromadbfrom llama_index.vector_stores.chroma import ChromaVectorStorefrom llama_index.core import StorageContext, VectorStoreIndexchroma_client = chromadb.PersistentClient(path="./chroma_db")chroma_collection = chroma_client.get_or_create_collection("my_knowledge_base")vector_store = ChromaVectorStore(chroma_collection=chroma_collection)storage_context = StorageContext.from_defaults(vector_store=vector_store)# Build the index directly into Chromaindex = VectorStoreIndex.from_documents(    documents,    storage_context=storage_context)# Later, in a different process, reconnect without re-indexing:index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
```

The pattern is identical to what you learned with checkpointers in Part 2: swap one object (MemorySaver → SqliteSaver, or here, in-memory → ChromaVectorStore) and everything downstream,your query engine, your retrieval logic — stays exactly the same. The same swap pattern works for Pinecone, Qdrant, Weaviate, and most other vector databases LlamaIndex supports.

Now for the part this whole article has been building toward. We need to bridge two frameworks that speak different languages: LlamaIndex’s QueryEngine is not a LangGraph tool. LangGraph's agent expects tools built with LangChain's @tool decorator.

The integration pattern is refreshingly simple once you see it: **write a regular Python function, decorated with ****@tool, whose body calls ****query_engine.query(...).** That's the entire bridge. LangGraph doesn't need to know anything about LlamaIndex internals — it just sees a tool that takes a string and returns a string, exactly like every other tool from Part 1.

```
# ── MODULE 3: TOOLS (LlamaIndex-backed) ─────────────────────from langchain_core.tools import tool# The query_engine built in Part B - created once, at startup# (In a real app, you'd load this from persisted storage, not rebuild it every time)@tooldef search_knowledge_base(query: str) -> str:    """Search the internal knowledge base for company policies, product     documentation, and internal procedures. Use this whenever the user asks     a question that might be answered by internal company documents rather     than general knowledge.        Args:        query: A natural-language question to search for.        Returns:        A synthesized answer based on the most relevant retrieved documents.    """    response = query_engine.query(query)    return str(response)
```

That’s it. The docstring matters exactly as much as it did in Part 1 — it’s how the LLM in your LangGraph agent decides *when* to call this tool versus answering from its own knowledge or calling a different tool.

Here is the full canonical agent template from Part 1, with a LlamaIndex-backed RAG tool slotted directly into Module 3. Nothing else changes.

```
# ============================================================# LANGGRAPH + LLAMAINDEX RAG AGENT — COMPLETE TEMPLATE# Extends: Part 1 (core structure)# ============================================================# ── MODULE 1: IMPORTS & CONFIGURATION ───────────────────────import osfrom typing import Literal# LangChain / LangGraph imports (the orchestration layer)from langchain_openai import ChatOpenAIfrom langchain_core.messages import HumanMessage, SystemMessage, BaseMessagefrom langchain_core.tools import toolfrom langgraph.graph import StateGraph, MessagesState, START, ENDfrom langgraph.prebuilt import ToolNodefrom langgraph.checkpoint.memory import MemorySaver# LlamaIndex imports (the retrieval / data layer)from llama_index.core import (1    Settings, SimpleDirectoryReader, VectorStoreIndex,    StorageContext, load_index_from_storage)from llama_index.llms.openai import OpenAI as LlamaOpenAIfrom llama_index.embeddings.openai import OpenAIEmbedding# LangGraph's chat model - used by the agent's reasoningllm = ChatOpenAI(model="gpt-4o", temperature=0)# LlamaIndex's model config - used internally by the query engine# Note: these are SEPARATE from the LangGraph llm above. Each framework# manages its own model instances; they don't share state.Settings.llm = LlamaOpenAI(model="gpt-4o-mini", temperature=0.1)Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")Settings.chunk_size = 512# ── MODULE 2: STATE ──────────────────────────────────────────class State(MessagesState):    pass  # messages field inherited; extend if your agent needs more# ── MODULE 3: TOOLS (RAG-backed) ─────────────────────────────# Build or load the LlamaIndex knowledge base ONCE, at startupPERSIST_DIR = "./storage"if os.path.exists(PERSIST_DIR):    # Reload existing index - no re-embedding, fast startup    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)    index = load_index_from_storage(storage_context)else:    # First run - build the index and persist it    documents = SimpleDirectoryReader("./data").load_data()    index = VectorStoreIndex.from_documents(documents, show_progress=True)    index.storage_context.persist(persist_dir=PERSIST_DIR)query_engine = index.as_query_engine(similarity_top_k=3)@tooldef search_knowledge_base(query: str) -> str:    """Search internal company documents for policies, product specs,     procedures, and other domain-specific information. Use this for any     question that requires knowledge specific to this organization rather     than general world knowledge."""    response = query_engine.query(query)    return str(response)tools = [search_knowledge_base]llm_with_tools = llm.bind_tools(tools)tool_node = ToolNode(tools)# ── MODULE 4: NODES ──────────────────────────────────────────def agent_node(state: State) -> dict:    """The reasoning node. Decides whether to answer directly or     search the knowledge base first."""    system_prompt = SystemMessage(content=(        "You are a helpful assistant with access to an internal knowledge base. "        "Use the search_knowledge_base tool when the user asks about company-specific "        "information. For general questions, answer directly."    ))    messages = [system_prompt] + state["messages"]    response = llm_with_tools.invoke(messages)    return {"messages": [response]}# ── MODULE 5: ROUTING ────────────────────────────────────────def should_continue(state: State) -> Literal["tools", "__end__"]:    last_message = state["messages"][-1]    if hasattr(last_message, "tool_calls") and last_message.tool_calls:        return "tools"    return "__end__"# ── MODULE 6: GRAPH ASSEMBLY ─────────────────────────────────graph_builder = StateGraph(State)graph_builder.add_node("agent", agent_node)graph_builder.add_node("tools", tool_node)graph_builder.add_edge(START, "agent")graph_builder.add_conditional_edges(    "agent", should_continue,    {"tools": "tools", "__end__": END})graph_builder.add_edge("tools", "agent")graph = graph_builder.compile(checkpointer=MemorySaver())# ── MODULE 7: ENTRYPOINT ──────────────────────────────────────if __name__ == "__main__":    config = {"configurable": {"thread_id": "session-001"}}        print("RAG agent ready. Ask about your documents, or anything else.\n")        while True:        user_text = input("You: ").strip()        if not user_text or user_text.lower() == "exit":            break                response = graph.invoke(            {"messages": [HumanMessage(content=user_text)]},            config=config        )        print(f"Agent: {response['messages'][-1].content}\n")
```

It’s worth tracing through a single query so the two frameworks’ division of labor is completely clear:

```
User: "What's our policy on remote work?"        ↓[agent_node] — LangGraph's LLM reads the message, recognizes this needs               internal info, decides to call search_knowledge_base        ↓[tools] — ToolNode executes search_knowledge_base("What's our policy on remote work?")        ↓        Inside the tool: query_engine.query(...) runs —        this is 100% LlamaIndex, invisible to LangGraph:          1. Embeds the query          2. Searches the vector index for the 3 closest chunks          3. Feeds those chunks + the question to Settings.llm          4. Returns a synthesized answer string        ↓[agent_node] — LangGraph's LLM receives the tool's string result,               and crafts the final response shown to the user        ↓Response to user
```

Notice the clean separation: LangGraph never sees a Node, an embedding, or a vector. It just sees a tool that returns a string — exactly like the search_web or calculate tools from Part 1. LlamaIndex never sees messages, state, or a thread_id — it just receives a query string and returns an answer. Each framework does exactly what it's good at, and the seam between them is one function call.

The query_engine.query() pattern above is the simplest integration, and it's right for most cases. But sometimes you want more control — for example, you want *your* LangGraph agent's LLM to be the one synthesizing the final answer (using your existing system prompt and conversation context), rather than letting LlamaIndex's internal Settings.llm generate a separate, disconnected answer.

For this, use a **Retriever** instead of a QueryEngine. The retriever does *only* the search step — it returns raw chunks, and you decide what to do with them.

```
# ── Retriever-only tool: returns raw chunks, not a synthesized answer ──retriever = index.as_retriever(similarity_top_k=3)@tooldef retrieve_documents(query: str) -> str:    """Retrieve relevant document excerpts from the internal knowledge base.    Returns raw excerpts for you to read and reason over yourself -     use this when you need to cite specific sources or combine information     from multiple documents."""        nodes = retriever.retrieve(query)        # Format each retrieved chunk with its source for transparency    formatted_chunks = []    for i, node in enumerate(nodes):        source = node.metadata.get("file_name", "unknown source")        formatted_chunks.append(f"[Excerpt {i+1} from {source}]\n{node.text}")        return "\n\n---\n\n".join(formatted_chunks)
```

With this pattern, your LangGraph agent_node's own LLM call (the one using llm_with_tools, with your full conversation history and system prompt) is the one that reads these raw excerpts and writes the final answer. This gives you a few real advantages: the agent can cite which document a fact came from, it can combine the retrieved excerpts with earlier conversation context, and you have one consistent "voice" for the agent rather than two different LLMs answering in two different styles.

**Use ****QueryEngine** (the Part C pattern) when you want the simplest possible integration and don't mind LlamaIndex's LLM generating the retrieval-based answer as a self-contained string. Good for most chatbots and assistants.

**Use ****Retriever** (this pattern) when you need source citations, want the agent's main LLM to maintain a consistent voice and reasoning style across both retrieved and non-retrieved answers, or want to combine retrieved chunks with other tool results before generating a final answer.

This extends the keyword cards from Parts 1–4 with LlamaIndex-specific terms.

**LlamaIndex Core Keywords** Document — a raw loaded source (one file, one page). The unprocessed unit of data. Node — a chunk of a Document. The retrievable unit. Carries metadata back to its source. SimpleDirectoryReader("./path").load_data() — loads all supported files in a folder into Document objects. Settings — global configuration singleton. Set Settings.llm, Settings.embed_model, Settings.chunk_size once; every LlamaIndex component uses them. VectorStoreIndex.from_documents(documents) — builds a complete index: chunks, embeds, and stores in one call.

**Querying Keywords** index.as_query_engine(similarity_top_k=3) — creates a QueryEngine. Retrieves top-k chunks AND generates a synthesized answer in one call. index.as_retriever(similarity_top_k=3) — creates a Retriever. Retrieves top-k chunks only — no answer generation. Use when you want your own LLM to synthesize. query_engine.query("...") — runs the full retrieve-then-generate pipeline. Returns a Response object (use str(response) to get text). retriever.retrieve("...") — runs only the retrieval step. Returns a list of Node objects with .text and .metadata.

**Persistence Keywords** index.storage_context.persist(persist_dir="./storage") — saves the index to disk. Avoids re-embedding on every restart. StorageContext.from_defaults(persist_dir="./storage") + load_index_from_storage(...) — reloads a previously persisted index. ChromaVectorStore / PineconeVectorStore / QdrantVectorStore — external vector database backends. Swap-in replacements for the default in-memory store, following the same pattern as LangGraph's checkpointers from Part 2.

**The Bridge (LlamaIndex → LangGraph)** @tool + a function body calling query_engine.query(...) — the entire integration pattern. LangGraph sees a normal tool; LlamaIndex internals stay fully encapsulated. **Two separate LLM configs** — Settings.llm (LlamaIndex's internal model, used only inside the query engine) and your LangGraph llm (the agent's main reasoning model) are independent. They do not share conversation history or state.

**Your agent only needs general knowledge, math, or live web search** → You don’t need LlamaIndex. Stick with the Part 1 template and standard tools.

**Your agent needs to answer questions from a fixed set of documents (policies, manuals, reports) that don’t change often** → This article’s Part C pattern (QueryEngine wrapped as a tool) is exactly right.

**Your agent needs to cite sources, or combine retrieved facts with multi-turn conversation reasoning** → Use the Retriever pattern from Part D instead.

**Your knowledge base is large (100,000+ documents), needs frequent updates, or needs to scale to many concurrent users** → Use an external vector database (Chroma, Pinecone, Qdrant) instead of the in-memory index, exactly as shown in Part B, Step 5.

**You need the retrieval step itself to be a reviewable, pausable action** (e.g., compliance requires a human to approve what gets searched) → Combine this article with Part 3’s interrupt() pattern. Wrap the retrieval call inside a node that pauses for approval before querying, just like the tool-approval pattern for any other sensitive tool.

The lesson underneath this entire article is one you’ve now seen twice: in Part 4, you learned that a complex multi-agent system is really just several single-agent graphs composed together. Here, you’ve learned that adding real-world knowledge to an agent isn’t a LangGraph problem at all — it’s a *data* problem, and the right tool for a data problem is a data framework.

LlamaIndex does the indexing and retrieval. LangGraph does the reasoning and orchestration. The entire integration between them is a single @tool-decorated function. That's not a limitation — it's the right amount of coupling. Each framework stays fully in its lane, and you can upgrade either side independently: swap your vector database without touching your graph, or add a supervisor pattern from Part 4 without touching your retrieval logic.

You now have the complete production scaffold across five articles: canonical structure (Part 1), memory management (Part 2), human-in-the-loop safety (Part 3), multi-agent orchestration (Part 4), and real-world knowledge via RAG (Part 5). Combined, these five patterns cover the overwhelming majority of what a serious, production-grade LangGraph application actually needs.

*For other parts of the series : **Part 0** , **Part 1** , **Part 2** , **Part 3** *, *Part 4** , Part 5,*

[RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern.](https://pub.towardsai.net/rag-without-the-guesswork-a-standardized-langgraph-llamaindex-pattern-bcaf14f9c811) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.