cd /news/large-language-models/rag-without-the-guesswork-a-standard… · home topics large-language-models article
[ARTICLE · art-35700] src=pub.towardsai.net ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern.

A new standardized pattern combining LangGraph and LlamaIndex for Retrieval-Augmented Generation (RAG) eliminates guesswork by using LlamaIndex for data indexing and retrieval, and LangGraph for orchestration. The approach treats RAG as a capability added to an agent, not a new orchestration pattern, and provides a five-stage pipeline for building production-ready RAG systems.

read17 min views1 publishedJun 21, 2026

Part 6 of the LangGraph Mental Model series , a focused detour into Retrieval-Augmented Generation, connected back to the canonical structure from Parts 1–4

*For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , *Part 5

What this article assumes:You’re comfortable with the seven-module structure and know how to give a LangGraph agent a tool (Part 1, Module 3). That’s genuinely all you need. If not, you can still follow along very easily. We are deliberately rolling complexity back down to a single agent here, no supervisors, no subgraphs, noSend. RAG is acapabilityyou add to an agent, not a new orchestration pattern, and this article treats it that way.

Every agent you’ve built so far in this series has one blind spot: it only knows what the LLM was trained on. Ask it about your company’s internal pricing policy, last quarter’s financial report, or a product manual you wrote last week, and it will either say “I don’t know” or — worse — confidently make something up.

Retrieval-Augmented Generation (RAG) fixes this. Instead of relying purely on what the model memorized during training, RAG retrieves relevant snippets from your own documents at query time and feeds them to the LLM as context before it answers. The LLM is no longer guessing ,it’s reading.

You could build a RAG pipeline by hand inside LangGraph: chunk documents yourself, generate embeddings yourself, build a vector index yourself, write the retrieval logic yourself. People do this. It’s a lot of plumbing for something that’s become a solved problem.

LlamaIndex is a framework built specifically to solve that plumbing problem. It is, by a wide margin, the most mature and widely used tool for the “data” half of an AI application ,, chunking, embedding, indexing, and retrieving. LangGraph, as you already know, is built for the “orchestration” half — state, control flow, multi-step reasoning.

This article shows you how to use each tool for what it’s best at: LlamaIndex builds and serves the knowledge base. LangGraph wraps it as a tool and decides when to use it. This is the dominant pattern in real production RAG systems today, and it’s the pattern we’ll build, step by step.

Before writing code, you need the same kind of mental model for LlamaIndex that you built for LangGraph in Part 1. Let’s build it the same way: concept, then keywords, then template.

Think of LlamaIndex as a librarian for your LLM. Imagine you hired a brilliant research assistant (the LLM) who has read most of the internet but has never seen your company’s files. You can’t just hand them a 10,000-page binder and say “remember all of this” — they don’t have room to hold it all in their head at once.

So instead, you hire a librarian (LlamaIndex). The librarian’s job has four parts: read all your documents, organize them into small, searchable index cards (chunks), file those cards in a system that allows fast lookup by topic (a vector index), and when your research assistant has a question, fetch only the 3–5 most relevant cards and hand them over.

The research assistant never has to memorize the binder. They just ask the librarian, get a tiny handful of relevant snippets, and reason over those.

Every RAG system, regardless of framework, follows this same five-stage pipeline. LlamaIndex gives you a clean abstraction for each stage:

1. LOAD       → Read raw files (PDF, Word, web pages, databases) into Documents2. CHUNK      → Split Documents into small, retrievable Nodes3. EMBED      → Convert each Node's text into a vector (a list of numbers                representing meaning)4. STORE      → Save those vectors in a Vector Index for fast lookup5. RETRIEVE   → At query time, embed the user's question, find the most                  similar Nodes, and return them as context

Stages 1–4 happen once, when you build your knowledge base (this is called indexing). Stage 5 happens every time a user asks a question (this is called querying). This distinction — indexing time vs. query time — is the single most important mental model in all of RAG. Keep it in your head as you read everything that follows.

Document — a raw piece of source content (one PDF, one web page, one database row) plus its metadata. This is the unprocessed, full-length unit of data.

Node — a chunk of a Document. LlamaIndex splits each Document into multiple smaller Node objects because LLMs and embedding models work better with focused, bite-sized text rather than entire documents at once. Nodes carry metadata back to their parent document (so you can always trace a chunk back to its source).

SimpleDirectoryReader — the most common . Points at a folder and automatically reads PDFs, text files, Word docs, and more, turning each file into one or more Document objects.

VectorStoreIndex — the core indexing class. Takes a list of Document (or Node) objects, embeds them, and builds a searchable vector index in memory (or in an external vector database, if you configure one).

Settings — a global configuration object. Instead of passing your LLM and embedding model into every function call, you set them once on Settings and every LlamaIndex component uses them automatically.

QueryEngine — the object you actually "ask questions" to. Created by calling .as_query_engine() on an index. Internally it does embedding the query, similarity search, and feeding retrieved context to the LLM to generate a final answer — all in one call.

Retriever — a lower-level object created by calling .as_retriever() on an index. Unlike a QueryEngine, a Retriever only does the search step — it returns the raw matching Node objects without generating an LLM answer. You'll use this when you want retrieval and generation to be separate, controllable steps (which is exactly what we want once we bring in LangGraph).

Let’s build a working LlamaIndex pipeline on its own first — no LangGraph yet. This mirrors how Part 1 taught LangGraph’s modules in isolation before assembling them.

LlamaIndex is intentionally split into many small packages. llama-index-core has the framework logic. Everything else : which LLM, which embedding model, which vector database, which file reader is its own package you add only when you need it. This keeps your installs lean.

A quick but important note on chunk_size: too large, and each chunk contains so much text that retrieval becomes imprecise (you fetch a chunk that's only 10% relevant). Too small, and you lose surrounding context that the LLM needs to understand a fact. 512 tokens with a 50-token overlap is a reasonable, widely-used starting point ,tune it based on your documents.

That’s a complete, working RAG system in about ten lines. This is genuinely the “hello world” of LlamaIndex, and it’s a fair demonstration of why the framework exists — compare this to manually writing a chunker, an embedding loop, a similarity search function, and a prompt template yourself.

Building the index calls the embedding model for every chunk — which costs money and time. You don’t want to rebuild it every time your app starts. Persist it to disk.

storage_context.persist() writes the vector data, the node metadata, and the document store to a local folder. load_index_from_storage() reconstructs the exact same index from those files — no re-embedding, no API calls. This is the standalone equivalent of the checkpointer pattern you learned in Part 2: build once, persist, reload cheaply.

The in-memory index above is fine for prototypes, but for production you typically want a dedicated vector database — something built to handle millions of vectors, concurrent reads, and metadata filtering at scale. Vector DB: Chroma

The pattern is identical to what you learned with checkpointers in Part 2: swap one object (MemorySaver → SqliteSaver, or here, in-memory → ChromaVectorStore) and everything downstream,your query engine, your retrieval logic — stays exactly the same. The same swap pattern works for Pinecone, Qdrant, Weaviate, and most other vector databases LlamaIndex supports.

Now for the part this whole article has been building toward. We need to bridge two frameworks that speak different languages: LlamaIndex’s QueryEngine is not a LangGraph tool. LangGraph's agent expects tools built with LangChain's @tool decorator.

The integration pattern is refreshingly simple once you see it: **write a regular Python function, decorated with ****@tool, whose body calls **query_engine.query(...). That's the entire bridge. LangGraph doesn't need to know anything about LlamaIndex internals — it just sees a tool that takes a string and returns a string, exactly like every other tool from Part 1.

That’s it. The docstring matters exactly as much as it did in Part 1 — it’s how the LLM in your LangGraph agent decides when to call this tool versus answering from its own knowledge or calling a different tool.

Here is the full canonical agent template from Part 1, with a LlamaIndex-backed RAG tool slotted directly into Module 3. Nothing else changes.

It’s worth tracing through a single query so the two frameworks’ division of labor is completely clear:

User: "What's our policy on remote work?"        ↓[agent_node] — LangGraph's LLM reads the message, recognizes this needs               internal info, decides to call search_knowledge_base        ↓[tools] — ToolNode executes search_knowledge_base("What's our policy on remote work?")        ↓        Inside the tool: query_engine.query(...) runs —        this is 100% LlamaIndex, invisible to LangGraph:          1. Embeds the query          2. Searches the vector index for the 3 closest chunks          3. Feeds those chunks + the question to Settings.llm          4. Returns a synthesized answer string        ↓[agent_node] — LangGraph's LLM receives the tool's string result,               and crafts the final response shown to the user        ↓Response to user

Notice the clean separation: LangGraph never sees a Node, an embedding, or a vector. It just sees a tool that returns a string — exactly like the search_web or calculate tools from Part 1. LlamaIndex never sees messages, state, or a thread_id — it just receives a query string and returns an answer. Each framework does exactly what it's good at, and the seam between them is one function call.

The query_engine.query() pattern above is the simplest integration, and it's right for most cases. But sometimes you want more control — for example, you want your LangGraph agent's LLM to be the one synthesizing the final answer (using your existing system prompt and conversation context), rather than letting LlamaIndex's internal Settings.llm generate a separate, disconnected answer.

For this, use a Retriever instead of a QueryEngine. The retriever does only the search step — it returns raw chunks, and you decide what to do with them.

With this pattern, your LangGraph agent_node's own LLM call (the one using llm_with_tools, with your full conversation history and system prompt) is the one that reads these raw excerpts and writes the final answer. This gives you a few real advantages: the agent can cite which document a fact came from, it can combine the retrieved excerpts with earlier conversation context, and you have one consistent "voice" for the agent rather than two different LLMs answering in two different styles.

**Use **QueryEngine (the Part C pattern) when you want the simplest possible integration and don't mind LlamaIndex's LLM generating the retrieval-based answer as a self-contained string. Good for most chatbots and assistants.

**Use **Retriever (this pattern) when you need source citations, want the agent's main LLM to maintain a consistent voice and reasoning style across both retrieved and non-retrieved answers, or want to combine retrieved chunks with other tool results before generating a final answer.

This extends the keyword cards from Parts 1–4 with LlamaIndex-specific terms.

LlamaIndex Core Keywords Document — a raw loaded source (one file, one page). The unprocessed unit of data. Node — a chunk of a Document. The retrievable unit. Carries metadata back to its source. SimpleDirectoryReader("./path").load_data() — loads all supported files in a folder into Document objects. Settings — global configuration singleton. Set Settings.llm, Settings.embed_model, Settings.chunk_size once; every LlamaIndex component uses them. VectorStoreIndex.from_documents(documents) — builds a complete index: chunks, embeds, and stores in one call.

Querying Keywords index.as_query_engine(similarity_top_k=3) — creates a QueryEngine. Retrieves top-k chunks AND generates a synthesized answer in one call. index.as_retriever(similarity_top_k=3) — creates a Retriever. Retrieves top-k chunks only — no answer generation. Use when you want your own LLM to synthesize. query_engine.query("...") — runs the full retrieve-then-generate pipeline. Returns a Response object (use str(response) to get text). retriever.retrieve("...") — runs only the retrieval step. Returns a list of Node objects with .text and .metadata.

Persistence Keywords index.storage_context.persist(persist_dir="./storage") — saves the index to disk. Avoids re-embedding on every restart. StorageContext.from_defaults(persist_dir="./storage") + load_index_from_storage(...) — reloads a previously persisted index. ChromaVectorStore / PineconeVectorStore / QdrantVectorStore — external vector database backends. Swap-in replacements for the default in-memory store, following the same pattern as LangGraph's checkpointers from Part 2.

The Bridge (LlamaIndex → LangGraph) @tool + a function body calling query_engine.query(...) — the entire integration pattern. LangGraph sees a normal tool; LlamaIndex internals stay fully encapsulated. Two separate LLM configs — Settings.llm (LlamaIndex's internal model, used only inside the query engine) and your LangGraph llm (the agent's main reasoning model) are independent. They do not share conversation history or state.

Your agent only needs general knowledge, math, or live web search → You don’t need LlamaIndex. Stick with the Part 1 template and standard tools.

Your agent needs to answer questions from a fixed set of documents (policies, manuals, reports) that don’t change often → This article’s Part C pattern (QueryEngine wrapped as a tool) is exactly right.

Your agent needs to cite sources, or combine retrieved facts with multi-turn conversation reasoning → Use the Retriever pattern from Part D instead.

Your knowledge base is large (100,000+ documents), needs frequent updates, or needs to scale to many concurrent users → Use an external vector database (Chroma, Pinecone, Qdrant) instead of the in-memory index, exactly as shown in Part B, Step 5.

You need the retrieval step itself to be a reviewable, pausable action (e.g., compliance requires a human to approve what gets searched) → Combine this article with Part 3’s interrupt() pattern. Wrap the retrieval call inside a node that s for approval before querying, just like the tool-approval pattern for any other sensitive tool.

The lesson underneath this entire article is one you’ve now seen twice: in Part 4, you learned that a complex multi-agent system is really just several single-agent graphs composed together. Here, you’ve learned that adding real-world knowledge to an agent isn’t a LangGraph problem at all — it’s a data problem, and the right tool for a data problem is a data framework.

LlamaIndex does the indexing and retrieval. LangGraph does the reasoning and orchestration. The entire integration between them is a single @tool-decorated function. That's not a limitation — it's the right amount of coupling. Each framework stays fully in its lane, and you can upgrade either side independently: swap your vector database without touching your graph, or add a supervisor pattern from Part 4 without touching your retrieval logic.

You now have the complete production scaffold across five articles: canonical structure (Part 1), memory management (Part 2), human-in-the-loop safety (Part 3), multi-agent orchestration (Part 4), and real-world knowledge via RAG (Part 5). Combined, these five patterns cover the overwhelming majority of what a serious, production-grade LangGraph application actually needs.

For other parts of the series : Part 0 , Part 1 , Part 2 , Part 3 , Part 4 , Part 5,

RAG Without the Guesswork: A Standardized LangGraph + LlamaIndex Pattern. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

── more in #large-language-models 4 stories · sorted by recency
── more on @langgraph 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/rag-without-the-gues…] indexed:0 read:17min 2026-06-21 ·