Choosing the Right RAG Strategy A Complete Decision Guide to Chunking, Agentic RAG, and GraphRAG

Poor performance in Retrieval-Augmented Generation (RAG) systems is typically caused by inadequate document chunking or mismatched retrieval architecture, not by the embedding model or LLM. It describes chunking as the process of dividing large documents into meaningful, self-contained units (like LEGO pieces) that preserve context and relationships, which is critical for effective retrieval and accurate answers. The guide covers various chunking strategies, advanced architectures like Agentic RAG and GraphRAG, and provides a decision framework to help users select the best combination for their specific use case.

Introduction Here is a scenario many RAG builders know well, you wire up a pipeline, load your documents, ask a question and the answer is wrong, vague, or confidently hallucinated . The information was right there in your knowledge base. So what went wrong? In most cases the problem is not your embedding model . It is not your LLM . It is how you cut up your documents before storing them the under appreciated craft called chunking and whether the retrieval architecture you chose actually matches the complexity of your queries. This blog walks you through every major chunking strategy , explains how retrieval and augmentation work on top of those chunks, covers two advanced architectures Agentic RAG and GraphRAG and most importantly gives you a complete decision framework so you can walk away knowing exactly which combination fits your use case. 🐘 The Elephant & The LEGO Pieces Your document is an elephant. A 200+ pages of legal contract , a dense research paper , a massive product manual , or years of enterprise knowledge large, complex, interconnected, and full of valuable information. A Large Language Model cannot effectively consume the entire elephant at once because of: Context window limitations Retrieval precision constraints Latency considerations Token cost optimization Context dilution and retrieval noise So the elephant must be divided into smaller pieces. But this is where most RAG systems fail. If you cut the elephant randomly , you destroy meaning . Sentences lose context . Ideas become fragmented . Relationships disappear . Retrieval quality collapses. Good chunking is not about making text smaller. It is about preserving meaning while making retrieval efficient . That is why chunking is better understood as turning the elephant into LEGO pieces. LEGO pieces are: - Modular — each piece can stand on its own - Structured — pieces connect cleanly to related pieces - Consistent — standardized enough for reliable retrieval - Meaningful — each piece preserves semantic value - Composable — you assemble only the pieces needed for the task Good chunking works the same way. A well designed chunk should preserve structure , semantics , relationships , and surrounding context while remaining small enough for efficient retrieval and generation. The real goal of chunking in RAG systems is not simply splitting documents. Chunking is not simply about making documents smaller. The actual goals are: - Preserve semantic meaning - Improve retrieval precision - Reduce hallucinations - Optimize context windows - Improve grounding quality Balance latency and cost In practice: Better chunks lead to better retrieval, better prompts, and better answers. The goal is to retrieve: - the right piece, - with the right context, - from the right section, - at the right time. That is the foundation of effective Retrieval Augmented Generation RAG . The RAG Pipeline:End to End Every RAG system regardless of complexity follows the same four stage flow. Understanding each stage makes chunking and architecture decisions obvious rather than arbitrary. Stage 1: Document Your raw source material: PDFs, Word files, web pages, transcripts, database exports. Too large to pass directly to an LLM. Needs to be broken into chunks before it can be indexed or searched. Stage 2: Chunking and Embedding Documents are cut into units and each unit is converted into a vector embedding a numerical representation of its meaning. These embeddings are stored in a vector database and form your searchable index. Your chunking strategy here determines everything that follows. Stage 3: Retrieval When a user asks a question, the query is also embedded. The vector database returns the chunks whose embeddings are closest in meaning to the query. These are your retrieved LEGO pieces. Stage 4: Augmentation and Generation The retrieved chunks along with surrounding parent context are assembled into a prompt and sent to the LLM. The model generates an accurate, grounded answer from the material it receives. Core insight: The quality of your answer is bounded by retrieval quality, which is bounded by chunk quality. Better chunks → better retrieval → better answers. Every architectural decision downstream is built on this foundation. 1. Fixed-Size Chunking The simplest and most widely used strategy. Documents are split into equal sized blocks by token count, character count, or word count without regard for meaning, sentence boundaries, or document structure. LangChain Methods CharacterTextSplitter: splits on a single separator default \n\n , then enforces chunk size by character count. TokenTextSplitter: splits by token count using a tokenizer e.g. tiktoken for OpenAI models ; more accurate for LLM context budgets than character based splitting. python from langchain.text splitter import CharacterTextSplitter, TokenTextSplitter Character-based splitter = CharacterTextSplitter chunk size=1000, max characters per chunk chunk overlap=200, characters repeated at chunk boundaries separator="\n\n" Token-based splitter = TokenTextSplitter chunk size=512, max tokens per chunk chunk overlap=50 tokens repeated at chunk boundaries Overlap guidance: A 10–20% overlap is typical. For chunk size=1000, set chunk overlap between 100–200. Overlap reduces the risk of a relevant answer being split across two chunks, at the cost of minor redundancy. Strengths: Simple to implement, fast, predictable, easy to scale. Weaknesses: Frequently breaks sentences mid-way, degrading semantic continuity and retrieval quality on complex documents. Best for: Logs, telemetry, JSON, CSV, and other uniform structured content. 2. Recursive Chunking Rather than splitting blindly, recursive chunking respects natural document structure. It works down a priority list of separators — \n\n, then \n, then . / / ?, then spaces — only moving to a finer separator when a chunk still exceeds the size limit. This is the recommended default strategy in LangChain for most document types. LangChain Methods RecursiveCharacterTextSplitter: The primary implementation; tries each separator in the list before falling back to the next. RecursiveCharacterTextSplitter.from language : pre-configured separator lists for specific programming languages Python, JS, Markdown, HTML, etc. . python from langchain.text splitter import RecursiveCharacterTextSplitter, Language General prose splitter = RecursiveCharacterTextSplitter chunk size=1000, chunk overlap=150, separators= "\n\n", "\n", ".", " ", "?", " ", "" Language-aware e.g. Python source code splitter = RecursiveCharacterTextSplitter.from language language=Language.PYTHON, chunk size=1000, chunk overlap=100 Overlap guidance: 10–15% overlap works well for most prose. For code, keep overlap low 50–100 tokens to avoid duplicating function signatures across chunks. Strengths: Better semantic retention than fixed size chunking; good general-purpose strategy; improves retrieval coherence. Weaknesses: Structure aware rather than meaning aware; performance depends on document formatting quality. Best for: Documentation, PDFs, articles, knowledge bases, and web pages. 3. Semantic Chunking Instead of asking how large should the chunk be, semantic chunking asks which sentences belong together. Sentences are converted into vector embeddings, similarity is measured between adjacent sentences, and chunk boundaries are drawn where similarity drops below a threshold — indicating a topic transition. LangChain Methods SemanticChunker from langchain experimental — supports three breakpoint detection strategies: percentile, standard deviation, and interquartile. python from langchain experimental.text splitter import SemanticChunker from langchain openai import OpenAIEmbeddings splitter = SemanticChunker embeddings=OpenAIEmbeddings , breakpoint threshold type="percentile", or "standard deviation", "interquartile" breakpoint threshold amount=95 top 5% of similarity drops become boundaries Overlap guidance: Semantic chunking does not use a fixed chunk overlap boundaries are drawn on meaning, so overlapping would undermine the approach. If continuity is needed at boundaries, consider appending the last sentence of the previous chunk manually. Strengths: High retrieval relevance; strong semantic continuity; well-suited to precision-sensitive systems. Weaknesses: Computationally expensive; requires an embedding model at chunking time; similarity thresholds need tuning per dataset. Best for: Enterprise knowledge systems, research platforms, policy documents, and AI assistants requiring contextual precision. 4. Hierarchical Chunking Creates two levels of chunks: large parent chunks for context, and smaller child chunks for precision. Retrieval targets the child level to find relevant passages, then expands to the parent level to return surrounding context. This directly addresses the core RAG trade off: small chunks improve precision, large chunks preserve context. LangChain Methods ParentDocumentRetriever: stores parent chunks in a document store and child chunks in a vector store, then links them at retrieval time. python from langchain.retrievers import ParentDocumentRetriever from langchain.text splitter import RecursiveCharacterTextSplitter from langchain.storage import InMemoryStore from langchain community.vectorstores import Chroma parent splitter = RecursiveCharacterTextSplitter chunk size=2000 large context chunks child splitter = RecursiveCharacterTextSplitter chunk size=400 precise retrieval chunks retriever = ParentDocumentRetriever vectorstore=Chroma embedding function=embeddings , docstore=InMemoryStore , child splitter=child splitter, parent splitter=parent splitter Overlap guidance: Apply overlap only on the child splitter typically 10–15% . Parent chunks are retrieved wholesale for context, so overlap there adds noise rather than value. Strengths: Strong retrieval precision without sacrificing context; effective for long documents. Weaknesses: More complex to index and retrieve; requires additional storage and orchestration. Best for: Legal documents, technical manuals, books, enterprise documentation, and compliance systems. 5. Structure and Metadata Aware Chunking Uses the document's own structure titles , headers , sections , tables , and page layout as natural chunk boundaries rather than treating the document as plain text. Especially important for enterprise PDFs and structured reports, where layout carries semantic meaning that arbitrary splits would destroy. LangChain Methods MarkdownHeaderTextSplitter: splits on Markdown heading levels and attaches header text as metadata to each chunk. HTMLHeaderTextSplitter: same pattern for HTML documents, splitting on '<h1 -<h4 ' tags. python from langchain.text splitter import MarkdownHeaderTextSplitter, HTMLHeaderTextSplitter Markdown md splitter = MarkdownHeaderTextSplitter headers to split on= " ", "h1" , " ", "h2" , " ", "h3" , chunks = md splitter.split text markdown text Each chunk carries metadata: {"h1": "Section Title", "h2": "Subsection"} HTML html splitter = HTMLHeaderTextSplitter headers to split on= "h1", "h1" , "h2", "h2" Overlap guidance: These splitters produce structurally bounded chunks rather than size bounded ones. If downstream chunks are still too large, pipe the output into a RecursiveCharacterTextSplitter with a modest overlap 100–150 characters as a second pass. Strengths: Preserves layout semantics; keeps tables intact; improves retrieval quality for structured enterprise documents. Weaknesses: Requires a capable document parser; parser quality directly limits performance. Best for: Financial reports, compliance documents, technical PDFs, medical documentation, and enterprise records. 6. Hybrid Chunking Applies different chunking strategies based on content type within the same corpus fixed-size for logs, recursive for documentation, semantic for research papers, structure aware for Markdown or HTML. LangChain does not have a dedicated hybrid splitter. Hybrid pipelines are composed manually using the building blocks above. from langchain.text splitter import TokenTextSplitter, RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter, from langchain experimental.text splitter import SemanticChunker def hybrid chunk doc : content type = doc.metadata.get "type" if content type == "log": return TokenTextSplitter chunk size=512, chunk overlap=0 .split documents doc elif content type == "markdown": return MarkdownHeaderTextSplitter headers to split on= " ", "h1" , " ", "h2" .split text doc.page content elif content type == "research": return SemanticChunker embeddings=embeddings, breakpoint threshold type="percentile" .split documents doc else: return RecursiveCharacterTextSplitter chunk size=1000, chunk overlap=150 .split documents doc Overlap guidance: Set overlap per strategy based on content type. Logs and structured data: zero or minimal overlap. Prose and documentation: 10–15%. Code: 5–10%. Strengths: Flexible and adaptable; better performance across mixed-content corpora. Weaknesses: Higher engineering complexity; harder to evaluate and tune consistently. Best for: Enterprise AI platforms, large mixed content corpora, knowledge management systems, and multi source RAG pipelines. 7. Agentic Chunking An emerging approach where an LLM dynamically determines what information belongs together, how chunks should be formed, and how retrieval should adapt to user intent. This transforms chunking from static preprocessing into query aware reasoning at inference time. LangChain supports this through its agent and chain abstractions rather than a dedicated splitter class. python from langchain.chains import LLMChain from langchain.prompts import PromptTemplate from langchain openai import ChatOpenAI import json llm = ChatOpenAI model="gpt-4o", temperature=0 prompt = PromptTemplate.from template """ You are a document analyst. Split the following text into coherent topical sections. Return ONLY a JSON list of objects, each with a "title" and "content" key. Text: {text} """ chain = LLMChain llm=llm, prompt=prompt def agentic chunk text : result = chain.run text=text return json.loads result Overlap guidance: Not applicable in the traditional sense the LLM determines boundaries based on meaning. To preserve continuity between sections, include a brief summary of the prior section in the prompt context. Strengths: Highly adaptive; strong semantic preservation; query aware retrieval. Weaknesses: Higher compute cost and latency; requires orchestration and guardrails; not yet widely proven in production at scale. Best for: AI copilots, multi-agent systems, research assistants, and enterprise reasoning workflows. 8. Agentic RAG Not to be confused with Agentic Chunking 7 . Agentic Chunking is about how documents are split at index time. Agentic RAG is about how an LLM decides what to retrieve at query time and whether what it found is good enough to answer with. Standard RAG pipelines are static: a query comes in, a fixed retrieval step runs, the top-k chunks are passed to the LLM, and an answer comes out. Agentic RAG breaks that linearity. An LLM agent decides when to retrieve, what to search for, whether the results are sufficient, and whether to re-query with a refined question before generating an answer. Common patterns built on this idea include Corrective RAG CRAG which scores retrieved documents for relevance and falls back to a web search if they are poor and Self-RAG , where the LLM reflects on its own output and decides whether it needs to retrieve again. LangChain Methods create retriever tool wraps any retriever as a tool an agent can call on demand. AgentExecutor the classic LangChain agent loop; the agent decides which tools to call and when. LangGraph — the recommended approach for production Agentic RAG; models retrieval as a stateful graph of nodes retrieve → grade → rewrite → retrieve again with explicit conditional edges. python from langchain.tools.retriever import create retriever tool from langchain openai import ChatOpenAI from langgraph.graph import StateGraph, END from typing import TypedDict, List from langchain core.messages import BaseMessage llm = ChatOpenAI model="gpt-4o", temperature=0 Wrap retriever as a tool retriever tool = create retriever tool retriever=vector store.as retriever search kwargs={"k": 5} , name="search documents", description="Search the knowledge base for relevant information." --- LangGraph: Corrective RAG pattern --- class AgentState TypedDict : question: str documents: List str generation: str rewrite count: int def retrieve state: AgentState : docs = vector store.similarity search state "question" , k=5 return {"documents": docs} def grade documents state: AgentState : LLM scores each doc for relevance; filters out poor ones prompt = f"Is this document relevant to the question '{state 'question' }'? Answer yes or no.\n\n{{doc}}" relevant = doc for doc in state "documents" if "yes" in llm.invoke prompt.format doc=doc.page content .content.lower return {"documents": relevant} def rewrite query state: AgentState : If docs were poor, rewrite the question before re-retrieving rewritten = llm.invoke f"Rewrite this question to improve retrieval: {state 'question' }" .content return {"question": rewritten, "rewrite count": state "rewrite count" + 1} def generate state: AgentState : context = "\n\n".join d.page content for d in state "documents" answer = llm.invoke f"Answer using this context:\n{context}\n\nQuestion: {state 'question' }" .content return {"generation": answer} def should rewrite state: AgentState : if len state "documents" == 0 and state "rewrite count" < 2: return "rewrite" return "generate" Build the graph workflow = StateGraph AgentState workflow.add node "retrieve", retrieve workflow.add node "grade", grade documents workflow.add node "rewrite", rewrite query workflow.add node "generate", generate workflow.set entry point "retrieve" workflow.add edge "retrieve", "grade" workflow.add conditional edges "grade", should rewrite, {"rewrite": "rewrite", "generate": "generate"} workflow.add edge "rewrite", "retrieve" workflow.add edge "generate", END app = workflow.compile result = app.invoke {"question": "What are the risks of GraphRAG?", "rewrite count": 0} Overlap guidance: Overlap is set on the underlying retriever's chunking strategy — not on the agent itself. The agent layer operates above chunking. Use whatever overlap matches the chunking strategy feeding the vector store typically 10–15% for recursive or fixed-size chunks . Strengths: Handles multi-step and ambiguous queries that single-pass retrieval fails on; self-corrects when initial retrieval is poor; can combine multiple retrieval sources vector DB, web search, SQL in one query cycle. Weaknesses: Higher latency per query due to multiple LLM calls; harder to debug than a linear pipeline; requires careful graph design to avoid infinite retrieval loops. Best for: Complex Q&A systems, enterprise copilots where queries are open-ended, research assistants, and any pipeline where retrieval quality is highly variable. 9. GraphRAG GraphRAG, originally developed by Microsoft Research, moves beyond treating documents as flat text sequences. Instead of chunking text into linear passages, it extracts entities and relationships from documents and stores them as a knowledge graph. Retrieval then traverses the graph to answer questions that require connecting information across multiple sources or document sections — something vector search alone handles poorly. There are two primary retrieval modes: local search , which answers specific entity-level questions by traversing nearby graph nodes, and global search , which synthesizes themes across the entire corpus using community summaries generated at indexing time. LangChain Methods LangChain integrates with graph databases Neo4j, Amazon Neptune, ArangoDB and provides tooling to build graph-based RAG pipelines. LLMGraphTransformer uses an LLM to extract entities and relationships from text and convert them into graph documents. Neo4jGraph + GraphCypherQAChain store the graph in Neo4j and query it in natural language via generated Cypher queries. Neo4jVector — hybrid approach that combines vector similarity search with graph traversal on a Neo4j backend. python from langchain experimental.graph transformers import LLMGraphTransformer from langchain community.graphs import Neo4jGraph from langchain.chains import GraphCypherQAChain from langchain openai import ChatOpenAI llm = ChatOpenAI model="gpt-4o", temperature=0 Step 1: Extract entities and relationships from chunks transformer = LLMGraphTransformer llm=llm graph docs = transformer.convert to graph documents documents Step 2: Store in Neo4j graph = Neo4jGraph url="bolt://localhost:7687", username="neo4j", password="password" graph.add graph documents graph docs Step 3: Query the graph in natural language chain = GraphCypherQAChain.from llm llm=llm, graph=graph, verbose=True, return intermediate steps=True response = chain.invoke {"query": "Which authors collaborated with researchers at MIT?"} For hybrid vector + graph retrieval: pythonfrom langchain community.vectorstores import Neo4jVector from langchain openai import OpenAIEmbeddings Store chunks as vectors alongside the graph vector store = Neo4jVector.from documents documents, embedding=OpenAIEmbeddings , url="bolt://localhost:7687", username="neo4j", password="password", index name="document chunks", node label="Chunk", embedding node property="embedding" retriever = vector store.as retriever search kwargs={"k": 5} Overlap guidance: GraphRAG does not rely on chunk overlap for continuity — relationships between entities bridge that gap structurally. When pre-chunking documents before graph extraction, use a RecursiveCharacterTextSplitter with modest overlap 100–150 characters to ensure entity mentions near chunk boundaries are captured in at least one chunk before the LLM extracts them. Strengths: Excels at multi-hop reasoning e.g. "find all projects involving X that also relate to Y" ; surfaces cross-document relationships invisible to vector search; global search enables corpus-wide thematic synthesis. Weaknesses: Significantly higher indexing cost and complexity; graph quality depends on LLM extraction accuracy; Cypher query generation can be brittle on complex schemas; not well-suited to simple factual lookups where vector search is faster and cheaper. Best for: Knowledge graphs, research corpora, compliance and regulatory systems, enterprise wikis with dense cross-references, and any domain where answering questions requires connecting facts across multiple documents. The Core Trade-Off A common misconception is that smaller chunks always improve retrieval. In practice, chunks that are too small lose context, fragment meaning, and can increase hallucinations. Chunking is a balancing act across four competing factors: There is no universally optimal strategy. The right choice depends on your data characteristics, query patterns, retrieval architecture, and business requirements. Quick Reference Final Thoughts The strongest production RAG systems rarely rely on a single chunking strategy. A robust architecture typically combines: - Recursive chunking for general prose - Semantic chunking for precision-sensitive content - Hierarchical retrieval for long or dense documents - Structure-aware parsing for enterprise PDFs - Hybrid orchestration where content types vary As enterprise AI matures, retrieval architecture is becoming just as important as model selection. And intelligent retrieval begins with intelligent chunking.