The Complete Guide to RAG Strategies: 25 Techniques Every Researcher and Engineer Must Know

A comprehensive guide published in 2026 details 25 retrieval-augmented generation (RAG) strategies, organized into five pipeline layers, to help engineers and researchers move beyond naive RAG implementations that fail in production. The guide provides code examples, use cases, and limitations for each strategy, addressing common issues like low retrieval precision and multi-hop fact retrieval.

Retrieval-Augmented Generation is no longer just a “vector search + LLM” trick. In 2026 it is an entire ecosystem of architectures, retrieval patterns, and reasoning pipelines. Whether you are building production systems or doing research, this guide covers the 25 most important RAG strategies from the basics to the bleeding edge. No fluff. No theory-only explanations. Every strategy comes with what it is, when to use it, real code solving a real problem, and honest limitations. The 25 strategies are grouped into five layers that mirror a real RAG pipeline. Layer 1: Foundational Architectures the big paradigms, strategies 1 to 4 Layer 2: Retrieval Strategies how and when to retrieve, strategies 5 to 12 Layer 3: Chunking and Indexing how to prepare your documents, strategies 13 to 17 Layer 4: Query-Side Strategies fixing the query before it hits the retriever, strategies 18 to 21 Layer 5: Post-Retrieval and Generation Strategies what happens after retrieval, strategies 22 to 25 These are the skeleton. Everything else in this guide hangs off one of these four paradigms The starting point every engineer knows and most should not stop at. You have a pile of PDFs. You want users to ask questions about them. You split the documents into chunks, embed them, throw them into a vector database, and at query time you find the top-k nearest chunks and feed them to an LLM. Done in an afternoon. Impressive in a demo. Then it goes to real users. It starts missing obvious answers. It retrieves vaguely related chunks. It confidently answers from the wrong document. This is not a bug in your implementation. It is the ceiling of naive RAG itself. The problem being solved: You have a company knowledge base with 500 documents and users asking questions about policies, products, and procedures. You need a working system fast. python from langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddingsfrom langchain.text splitter import RecursiveCharacterTextSplitterfrom langchain.chains import RetrievalQAfrom langchain.chat models import ChatOpenAI Your 500 company policy documents are loaded as docs 1. Chunksplitter = RecursiveCharacterTextSplitter chunk size=500, chunk overlap=50 chunks = splitter.split documents docs 2. Embed + storevectordb = Chroma.from documents chunks, OpenAIEmbeddings 3. Retrieve + generateretriever = vectordb.as retriever search kwargs={"k": 4} qa = RetrievalQA.from chain type llm=ChatOpenAI model="gpt-4o" , retriever=retriever, return source documents=True result = qa "What is the refund policy for enterprise customers?" print result "result" Works fine for simple queries. Falls apart for anything nuanced. When to use it: Building a proof of concept. Establishing a baseline before adding complexity. Simple Q&A over small, well-structured document sets. Limitation: Fixed-size chunking severs sentences mid-thought. A single retrieval pass misses multi-hop facts. There is no way to verify if retrieved docs are actually relevant. Not one technique. A systematic set of upgrades at every stage of the pipeline. Advanced RAG is the answer to “naive RAG works in demos but fails in production.” It applies deliberate improvements before retrieval better indexing, query rewriting , during retrieval hybrid search, dense embeddings , and after retrieval reranking, compression . Think of it as engineering discipline applied to each pipeline stage. The problem being solved: Your naive RAG system is live. Retrieval precision is around 60%. Users are complaining that answers feel vague. You need a systematic way to debug and improve without rebuilding everything. python from langchain experimental.text splitter import SemanticChunkerfrom langchain.retrievers import BM25Retriever, EnsembleRetrieverimport cohere PRE: Semantic chunking instead of fixed-sizesemantic splitter = SemanticChunker embeddings=OpenAIEmbeddings , breakpoint threshold type="percentile" chunks = semantic splitter.split documents docs DURING: Hybrid search combining BM25 and dense retrievalbm25 = BM25Retriever.from documents chunks, k=10 dense = Chroma.from documents chunks, OpenAIEmbeddings .as retriever k=10 hybrid = EnsembleRetriever retrievers= bm25, dense , weights= 0.4, 0.6 raw results = hybrid.get relevant documents "enterprise refund policy" POST: Rerank the combined resultsco = cohere.Client "your-api-key" reranked = co.rerank query="enterprise refund policy", documents= r.page content for r in raw results , top n=5, model="rerank-english-v3.0" final docs = raw results r.index for r in reranked.results When to use it: When you have identified specific failure modes and want to address them layer by layer rather than throwing everything at the wall. Limitation: Still a fixed pipeline. Cannot dynamically adapt to different query types or reroute based on what gets retrieved. Stop thinking of RAG as a pipeline. Start thinking of it as LEGO blocks. Modular RAG decomposes the system into discrete swappable components. Retrievers, rerankers, memory modules, and generators are all independent. You compose them differently depending on the use case. The same query router can send legal questions to a graph retriever and support questions to a dense retriever, with both going through the same reranker and generator. The problem being solved: You are building a product that serves both a medical team and a legal team. Medical queries need a specialized biomedical retriever. Legal queries need a graph-based retriever for case relationships. Both teams use the same frontend. python class ModularRAG: def init self, router, retrievers: dict, reranker=None, generator=None : self.router = router decides which retriever to use self.retrievers = retrievers self.reranker = reranker self.generator = generator def run self, query: str - str: domain = self.router.classify query retriever = self.retrievers.get domain, self.retrievers "default" docs = retriever.get relevant documents query if self.reranker: docs = self.reranker.rerank query, docs, top n=5 return self.generator.generate query, docs Swap retrievers without touching anything elserag = ModularRAG router=DomainClassifier , retrievers={ "medical": BiomedicalRetriever , "legal": GraphRetriever , "default": hybrid retriever }, reranker=CohereReranker , generator=GPT4Generator answer = rag.run "What are the contraindications for warfarin?" When to use it: Multi-domain systems. Any production system where requirements will evolve and you need to swap components without touching everything else. Limitation: More complex to design and maintain. Requires well-defined interfaces between modules or the flexibility becomes a liability. What if retrieval was not a step your system takes but a decision the model makes itself? In all three paradigms above, the retrieval flow is fixed. Agentic RAG hands that control to an LLM. The agent decides when to retrieve, what to search for, whether the results are sufficient, and whether to retrieve again. It uses a ReAct loop: Reason, then Act retrieve , then Observe the results, then loop back or produce a final answer. The problem being solved: A user asks: “Compare our Q3 revenue with the industry average from last quarter and flag any red flags.” That requires hitting your internal financial docs, a web search for industry data, and multi-step reasoning. No fixed pipeline handles this. python from langchain.agents import initialize agent, AgentType, Toolfrom langchain.chat models import ChatOpenAItools = Tool name="InternalDocSearch", func=lambda q: vectordb.similarity search q, k=4 , description="Search internal financial documents, reports, and policies" , Tool name="WebSearch", func=lambda q: web search api q , description="Search the internet for current industry data and news" , agent = initialize agent tools=tools, llm=ChatOpenAI model="gpt-4o", temperature=0 , agent=AgentType.ZERO SHOT REACT DESCRIPTION, max iterations=6, hard stop to prevent infinite loops verbose=True result = agent.run "Compare our Q3 revenue with the industry average from last quarter " "and flag any red flags." When to use it: Complex multi-step questions. Queries that span multiple data sources. Research-style tasks that require exploration before answering. Limitation: Much higher latency and cost. Can loop indefinitely without a proper stopping condition. Hard to debug when it goes wrong because the failure mode is reasoning, not retrieval. This is the layer with the most direct impact on answer quality. Most RAG failures trace back here. The single most reliable retrieval upgrade. Almost every production RAG system uses this. Dense semantic search understands meaning but misses exact tokens like part numbers, names, IDs, and technical acronyms. BM25 keyword search catches those but misses paraphrases and synonyms. Hybrid search runs both in parallel and merges results with Reciprocal Rank Fusion RRF . You get the best of both. The problem being solved: Users at an e-commerce company are searching for products by name “Nike Air Max 270” , by description “lightweight running shoe” , and by product ID “SKU-4821-BK” . A pure semantic search fails on IDs. A pure keyword search fails on descriptions. python from langchain.retrievers import BM25Retriever, EnsembleRetrieverfrom langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddings Build both indexes from the same document setbm25 retriever = BM25Retriever.from documents docs bm25 retriever.k = 10dense retriever = Chroma.from documents docs, OpenAIEmbeddings .as retriever search kwargs={"k": 10} RRF merges the ranked lists without needing score normalizationhybrid = EnsembleRetriever retrievers= bm25 retriever, dense retriever , weights= 0.4, 0.6 weight dense higher for semantic queries Now this single retriever handles all three query typesresults = hybrid.get relevant documents "Nike Air Max 270" results = hybrid.get relevant documents "lightweight running shoe for long distance" results = hybrid.get relevant documents "SKU-4821-BK" When to use it: Almost always. If you only make one upgrade to your RAG system make it this one. Limitation: Requires maintaining two separate index types. Adds slight latency from running parallel searches. One retrieval pass is not always enough. Let each round inform the next. Rather than a single retrieve-then-generate pass, iterative retrieval performs multiple rounds. The output of each round enriches the query for the next. The system builds up context progressively until it has enough to answer confidently. The problem being solved: A user asks “Who founded the company that acquired the startup mentioned in our latest board memo?” That requires finding the memo, finding the startup name, then finding the acquisition, then finding the founder. One retrieval pass gets you maybe the first step. php def iterative rag query: str, retriever, llm, max rounds: int = 3 - str: accumulated context = "" for round num in range max rounds : Each round enriches the search with what we learned so far search query = query if accumulated context: search query = f"{query}\n\nRelated context already found:\n{accumulated context}" docs = retriever.get relevant documents search query new context = "\n\n".join d.page content for d in docs Ask the LLM if it has enough to answer check = llm.predict f"Given this context, can you fully answer '{query}'? " f"Reply 'yes' or 'no'.\n\nContext: {new context}" accumulated context += f"\n\n{new context}" if check.strip .lower == "yes": break return llm.predict f"Answer this question completely:\n{query}\n\nContext:\n{accumulated context}" When to use it: Multi-hop questions where facts from different documents need to be chained together. Limitation: Multiplies latency and cost by the number of rounds. Needs an explicit stopping condition or it runs forever. Like drilling into a document. Each result becomes the starting point for a deeper search. Recursive retrieval builds a tree of retrievals. The top-level search finds the most relevant documents. For each of those, a sub-retrieval goes deeper on the specific section or entity that was found. Useful when your knowledge base has nested structure where high-level documents point to detailed annexes. The problem being solved: A legal team asks questions about compliance regulations. The answer lives in a regulation document, which references a specific clause, which points to an interpretation guideline. No single query surfaces all three. python def recursive retrieve query: str, retriever, max depth: int = 2, current depth: int = 0 - list: if current depth = max depth: return top docs = retriever.get relevant documents query all docs = list top docs Only recurse on the top 2 results to avoid explosion for doc in top docs :2 : Extract the most specific entity or topic from this doc sub topic = doc.page content :150 sub query = f"More details specifically about: {sub topic}" deeper docs = recursive retrieve sub query, retriever, max depth=max depth, current depth=current depth + 1 all docs.extend deeper docs Deduplicate by content seen = set unique docs = for d in all docs: if d.page content not in seen: seen.add d.page content unique docs.append d return unique docs When to use it: Legal documents with cross-references. Technical manuals with appendices. Any corpus where sections point to sub-sections. Limitation: Depth can explode without strict limits. Can introduce irrelevant tangents if the sub-queries drift from the original intent. Smart enough to know when not to retrieve at all. Not every query needs a vector search. “What is 2 + 2?” does not need to hit your database. “What does our SLA say about uptime guarantees?” absolutely does. Adaptive retrieval classifies query complexity first, then routes it to the appropriate strategy. Simple queries go direct to the LLM. Complex ones get multi-hop retrieval. The problem being solved: Your RAG system serves 10,000 queries a day. About 40% are simple greetings, basic definitions, or questions the LLM already knows. Running vector search on all of them is wasting money and adding latency to responses that do not need it. php def adaptive rag query: str, retriever, llm - str: Step 1: Classify query complexity classification = llm.predict f"Classify this query into exactly one of: simple, moderate, complex.\n" f"simple = general knowledge, no specific documents needed\n" f"moderate = needs one retrieval pass from documents\n" f"complex = needs multiple retrievals or cross-document reasoning\n\n" f"Query: {query}\n\nClassification:" .strip .lower Step 2: Route to the appropriate strategy if "simple" in classification: return llm.predict query elif "moderate" in classification: docs = retriever.get relevant documents query context = "\n\n".join d.page content for d in docs :4 return llm.predict f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:" else: return iterative rag query, retriever, llm, max rounds=3 When to use it: High-traffic production systems with mixed query types where retrieval cost adds up. Limitation: The routing classifier can misclassify. A wrongly classified complex query sent down the simple path gives a confidently wrong answer. When your knowledge lives in relationships, not paragraphs. GraphRAG builds an entity-relationship graph from your documents instead of a flat vector index. Querying traverses graph neighborhoods, surfacing connected facts that no single chunk could contain. Microsoft’s open-source implementation also creates hierarchical “community reports” via community detection for high-level summarization queries. The problem being solved: A financial analyst asks “Which companies in our portfolio have executives who previously worked at firms that were later acquired by competitors?” The answer requires traversing entity relationships across dozens of documents. No vector search gets there. Using Microsoft's graphrag library pip install graphragimport asynciofrom graphrag.query.cli import run global search, run local search After running: python -m graphrag.index --root ./your data Global search: uses community summaries for big-picture questionsresult = asyncio.run run global search config dir="./your data", data dir="./your data/output", root dir="./your data", community level=2, response type="multiple paragraphs", query="Which portfolio companies have overlapping executive networks?" Local search: traverses entity neighborhoods for specific factsresult = asyncio.run run local search config dir="./your data", data dir="./your data/output", root dir="./your data", community level=2, response type="single paragraph", query="Who are the board members of Acme Corp and what other companies are they associated with?" print result When to use it: Medical knowledge graphs. Legal networks. Org charts. Supply chains. Any domain where the answer lives in a relationship rather than a paragraph. Limitation: Expensive to build and maintain the graph. Entity extraction is imperfect. Not worth it if your domain is mostly independent documents with no relational structure. Retrieval is the first draft. Corrective RAG decides if it is good enough to use. After retrieval, CRAG scores each document’s relevance against the query. If the top results fall below a confidence threshold it discards them and fires a new search with a reformulated query. Sometimes it falls back to web search entirely. Only then does generation happen. The problem being solved: You are building a legal research assistant. The user asks about a specific jurisdiction’s statute. Your retriever returns a similar-looking but wrong statute from a different jurisdiction. Without correction this becomes a confident wrong answer in a high-stakes context. python def corrective rag query: str, retriever, web search fn, llm, relevance threshold: float = 0.7 - str: docs = retriever.get relevant documents query Score each retrieved document for relevance scored docs = for doc in docs: score str = llm.predict f"Score how relevant this document is to the query on a scale of 0.0 to 1.0.\n" f"Reply with only the number.\n\n" f"Query: {query}\n\nDocument: {doc.page content :400 }" try: score = float score str.strip except ValueError: score = 0.0 scored docs.append doc, score good docs = d for d, s in scored docs if s = relevance threshold If not enough good docs, fall back to web search if len good docs < 2: web results = web search fn query good docs.extend web results context = "\n\n".join d.page content for d in good docs :5 return llm.predict f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:" When to use it: Legal research. Medical Q&A. Academic writing. Anywhere a wrong source is worse than no source at all. Limitation: The relevance evaluator can itself be wrong. Adds latency from the evaluation and potential re-retrieval loop. Can get stuck in correction loops on genuinely ambiguous queries. The model grades its own retrieval from the inside. Self-RAG trains an LLM to emit special reflection tokens during generation. The model pauses and asks itself: should I even retrieve for this? Is this passage relevant? Is my answer actually supported by what I retrieved? The reflection happens inside the generation process, not as an external wrapper. The problem being solved: You want retrieval quality to be the model’s responsibility, not something you orchestrate externally. The model should know when it is guessing versus when it is grounded. SELF RAG PROMPT = """You are answering a question using retrieved documents.Follow this process exactly: RETRIEVE NEEDED: yes/no - Do you need to look this up? RELEVANCE: relevant/irrelevant - Is the retrieved document useful? ANSWER : Your answer based only on relevant documents. SUPPORTED: supported/not supported - Is your answer grounded in the doc?Question: {question}Retrieved document:{document}Now follow the process:"""def self rag answer question: str, retriever, llm - str: docs = retriever.get relevant documents question for doc in docs :3 : response = llm.predict SELF RAG PROMPT.format question=question, document=doc.page content :600 if "supported" in response.lower .split " supported:" -1 : return response return "Could not find a well-supported answer in the available documents." When to use it: Research settings. When you can fine-tune. When you want the model to own retrieval quality rather than relying on external scoring logic. Limitation: Requires a specially fine-tuned model. Not plug-and-play with off-the-shelf GPT-4 or Claude. The prompting simulation below approximates the behavior but does not match the full fine-tuned version. Generate four different ways to ask the question, search all four, and merge the best results. RAG Fusion generates multiple semantically diverse reformulations of the original query, runs parallel vector searches for each, and merges all ranked lists using Reciprocal Rank Fusion . It surfaces both literal and related knowledge that a single query would miss. The problem being solved: A product manager asks “how do users feel about the checkout experience?” Your documents use phrases like “cart abandonment,” “payment friction,” “purchase flow,” and “conversion drop.” A single embedding of the original query will not match all of them. RAG Fusion generates all those variants automatically. python from collections import defaultdictdef rag fusion query: str, vectordb, llm, num variants: int = 4 - list: Step 1: Generate semantically diverse query variants raw = llm.predict f"Write {num variants} different ways to search for the answer to this question.\n" f"Each variant should use different vocabulary and framing.\n" f"Output one per line, no numbering.\n\n" f"Original question: {query}" variants = query + v.strip for v in raw.strip .split "\n" if v.strip :num variants Step 2: Retrieve for each variant independently all ranked lists = for variant in variants: results = vectordb.similarity search variant, k=6 all ranked lists.append results Step 3: Merge with Reciprocal Rank Fusion k=60 is standard rrf scores = defaultdict float doc map = {} for ranked list in all ranked lists: for rank, doc in enumerate ranked list : key = doc.page content :100 use content prefix as key rrf scores key += 1.0 / rank + 60 doc map key = doc sorted keys = sorted rrf scores, key=rrf scores.get, reverse=True return doc map k for k in sorted keys :5 When to use it: Ambiguous queries. Research questions. Any case where user phrasing diverges significantly from how documents are written. Limitation: N query variants equals N retrieval calls. Can bloat the context window with tangentially related results if the generated queries drift too far. The most overlooked layer. Most engineers spend 80% of their time on retrieval and generation and almost none on how documents are prepared. That is backwards. Bad chunking causes retrieval failures that no amount of reranking can fix. Fixed-size chunking is the number one source of retrieval failures. Stop using it. Splitting documents at fixed token counts severs sentences mid-thought, breaks tables in half, and separates headings from their content. Context-aware chunking splits at natural semantic or structural boundaries. Each chunk is a coherent unit of meaning. The problem being solved: You chunked a technical manual with a 500-token splitter. A chunk ends mid-sentence: “The maximum load capacity is 450kg when the safety override is…” The next chunk starts “…disabled only by authorized personnel.” Both chunks are retrieved separately and neither makes sense alone. python from langchain experimental.text splitter import SemanticChunkerfrom langchain.embeddings import OpenAIEmbeddings BAD: fixed-size splits mid-sentence constantly splitter = CharacterTextSplitter chunk size=500 BETTER: splits where embedding distance spikes semantic boundary semantic splitter = SemanticChunker embeddings=OpenAIEmbeddings , breakpoint threshold type="percentile", breakpoint threshold amount=95 split on the top 5% biggest semantic jumps ALSO GOOD: structure-aware splitting respects markdown/heading boundariesfrom langchain.text splitter import RecursiveCharacterTextSplitterstructural splitter = RecursiveCharacterTextSplitter separators= "\n ", "\n ", "\n\n", "\n", ". ", " " , chunk size=600, chunk overlap=80 chunks = semantic splitter.split documents docs print f"Created {len chunks } semantically coherent chunks" When to use it: Every serious RAG system. If you are still using CharacterTextSplitter with a fixed chunk size in production, this is your first fix. Limitation: Semantic chunking requires an embedding pass over all documents at indexing time. Slower and more expensive to build the initial index. Build a pyramid of summaries over your corpus. Answer from any level. RAPTOR clusters documents, summarizes each cluster, then recursively clusters and summarizes those summaries, building a tree from raw passages up to a high-level overview of everything. At query time retrieval can pull from any level. Abstract thematic questions hit the top. Specific fact-finding drills to the leaves. The problem being solved: You have 3,000 research papers in your database. A researcher asks “What are the dominant themes in recent work on transformer efficiency?” That is a question about the whole corpus, not a specific paper. Naive retrieval returns random chunks. RAPTOR answers from the summary tree. python from sklearn.mixture import GaussianMixturefrom langchain.schema import Documentimport numpy as npdef build raptor tree docs: list, embeddings, llm, levels: int = 2 - dict: tree = {"level 0": docs} current level docs = docs for level in range 1, levels + 1 : print f"Building RAPTOR level {level}..." Embed all docs at this level doc embeddings = np.array embeddings.embed query d.page content for d in current level docs Cluster them n clusters = max 2, len current level docs // 5 gm = GaussianMixture n components=n clusters, random state=42 labels = gm.fit predict doc embeddings Summarize each cluster into a new document summaries = for cluster id in range n clusters : cluster docs = d for d, l in zip current level docs, labels if l == cluster id combined text = "\n\n".join d.page content for d in cluster docs :6 summary = llm.predict f"Summarize these documents into a single coherent paragraph:\n\n{combined text}" summaries.append Document page content=summary, metadata={"level": level, "cluster": cluster id} tree f"level {level}" = summaries current level docs = summaries return tree When to use it: Long documents. Research paper corpora. Legal document collections. Any corpus where you need both high-level thematic answers and specific passage-level retrieval. Limitation: Expensive to build. Requires LLM summarization of every cluster at every level. Index rebuild time is much longer than flat indexing. Retrieve small. Return large. Precision at search time, context at generation time. Index small granular chunks children for high-precision similarity search. When a child chunk is matched, automatically retrieve its full parent section for generation. Sentence Window Retrieval is a variant where you match at the sentence level then return a window of surrounding sentences. The problem being solved: You are building a customer support bot over a 200-page product manual. Small chunks give precise retrieval but when the answer is spread across a paragraph the LLM only sees half of it. Large chunks give full context but their embeddings are diluted and retrieval precision drops. python from langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain.text splitter import RecursiveCharacterTextSplitter Small chunks for retrieval precisionchild splitter = RecursiveCharacterTextSplitter chunk size=200, chunk overlap=20 Large chunks for generation contextparent splitter = RecursiveCharacterTextSplitter chunk size=1200, chunk overlap=100 The store holds full parent documentsstore = InMemoryStore retriever = ParentDocumentRetriever vectorstore=Chroma embedding function=OpenAIEmbeddings , docstore=store, child splitter=child splitter, parent splitter=parent splitter, retriever.add documents docs, ids=None Retrieves via small child chunks but returns the full parent sectionresults = retriever.get relevant documents "What happens if the safety sensor is triggered during operation?" print f"Returned {len results } parent sections with full context" When to use it: Documents with clear hierarchical structure like reports, manuals, and codebases. Anywhere you need search precision but context richness. Limitation: Parent chunks can be large and dilute the context window when multiple children from different parents are retrieved simultaneously. Embed the whole document first. Chunk the embeddings afterward. Context preserved. Traditional approaches chunk text first then embed each chunk independently, losing all cross-chunk context. Late chunking runs the full document through the encoder first, preserving long-range context in the token representations, then pools embeddings over chunk boundaries. Each chunk’s embedding carries awareness of the whole document. The problem being solved: A contract reads “The party referred to in Section 2.1 shall not…” Chunked naively, the chunk containing this sentence has no idea who “the party referred to in Section 2.1” is because that section is in a different chunk. Late chunking embeds both sections together so the reference is preserved. Late chunking is natively supported in Jina Embeddings v3 pip install jina-embeddingsimport requestsdef late chunk embed text: str, chunk size chars: int = 512 - list: Split text into chunks first for boundary definitions chunks = text i:i+chunk size chars for i in range 0, len text , chunk size chars Call Jina's late chunking endpoint which encodes full doc then chunks response = requests.post "https://api.jina.ai/v1/embeddings", headers={"Authorization": "Bearer YOUR JINA KEY"}, json={ "model": "jina-embeddings-v3", "input": chunks, "late chunking": True, this is the key flag "task": "retrieval.passage" } embeddings = item "embedding" for item in response.json "data" return list zip chunks, embeddings chunk embedding pairs = late chunk embed long contract text When to use it: Documents where important context spans multiple sections. Contracts, scientific papers, narrative reports. Works best with long-context embedding models. Limitation: Requires a model that supports long-context encoding at the embedding stage. More memory-intensive at indexing time. Index the idea of the document. Retrieve the full document for generation. Generate a concise LLM summary for each document and index that instead of raw chunks. Retrieval matches at the summary level which is rich in concepts and free of noise. But the LLM gets the full document text for generation. This separates the finding problem from the reading problem. The problem being solved: You have 500 internal knowledge base articles. Users ask conceptual questions like “What is our approach to incident response?” A specific chunk from the incident response article might not embed near that phrasing. But a summary of the article will. python from llama index.core import SummaryIndex, SimpleDirectoryReader, Settingsfrom llama index.llms.openai import OpenAIfrom llama index.embeddings.openai import OpenAIEmbeddingSettings.llm = OpenAI model="gpt-4o" Settings.embed model = OpenAIEmbedding model="text-embedding-3-large" Load your 500 knowledge base articlesdocuments = SimpleDirectoryReader "./knowledge base" .load data Build the summary index one LLM call per document at indexing time summary index = SummaryIndex.from documents documents Retrieval uses the summaries; the LLM gets the full documentquery engine = summary index.as query engine response mode="tree summarize", verbose=True response = query engine.query "What is our overall approach to incident response and escalation?" print response When to use it: Queries that are thematic or conceptual rather than exact. Knowledge bases, policy documents, research collections. Limitation: Summaries can lose specific details. If the answer is a specific number or a specific clause the summary may not surface it and retrieval fails silently. Real users write terrible queries. “thing broke,” “pricing,” “the document from last week.” These strategies fix the query before it ever touches your retriever. Users type lazy queries. This adds the missing words before searching. Before retrieval, expand the query with related terms, synonyms, and rephrasings. This bridges the vocabulary gap between how users phrase questions and how documents are written. A user types “my order is late.” Your document says “shipment delay notification procedures.” Without expansion these never match. The problem being solved: You are building a search system for an HR platform. An employee types “can I work from abroad.” Your policy documents use “remote work policy,” “international employment,” “tax implications for overseas work,” and “work permit requirements.” A single embedding of the original query retrieves at most one of these. php def expand query query: str, llm - str: expanded = llm.predict f"Expand this search query to improve document retrieval.\n" f"Add relevant synonyms, related terms, and alternative phrasings.\n" f"Keep it concise. Output only the expanded query, nothing else.\n\n" f"Original query: {query}\n\n" f"Expanded query:" return expanded.strip def step back query query: str, llm - str: Step-back prompting: ask the broader question first return llm.predict f"What is the broader, more general topic behind this specific question?\n" f"Return only the broader question, nothing else.\n\n" f"Specific: {query}\n" f"Broader:" original = "can I work from abroad"expanded = expand query original, llm → "remote work abroad international employment overseas work policy work permit tax implications working from another country"broader = step back query original, llm → "What are the rules and policies for working outside my home country?"docs = retriever.get relevant documents expanded When to use it: Any system with real users. This is one of the easiest wins in the pipeline with a very low implementation cost. Limitation: Poorly expanded queries add noise. An LLM expansion call adds latency. Overly broad expansions can pull in irrelevant documents. Do not search with your question. Search with a hypothetical answer. HyDE prompts the LLM to write a fake ideal document that would answer the query. That fabricated document is embedded and used as the search vector instead of the raw question. Since it is written in document-style language it lands much closer to real documents in embedding space than a short user query would. The problem being solved: You are building a RAG system over medical literature. A doctor asks “best treatment protocol for resistant hypertension in diabetic patients.” That short clinical question has a very different embedding than a multi-paragraph clinical guideline that actually answers it. HyDE generates a fake guideline and searches with that instead. php def hyde retrieval query: str, vectordb, llm, k: int = 5 - list: Step 1: Generate a hypothetical ideal document hypothetical doc = llm.predict f"Write a detailed clinical paragraph that directly answers this question.\n" f"Use the same formal language and vocabulary as medical literature.\n" f"Do not indicate that this is hypothetical.\n\n" f"Question: {query}\n\n" f"Clinical answer:" print f"Hypothetical doc preview: {hypothetical doc :200 }..." Step 2: Search using the hypothetical document as the query vector results = vectordb.similarity search hypothetical doc, k=k return resultsdocs = hyde retrieval "best treatment protocol for resistant hypertension in diabetic patients", medical vectordb, llm When to use it: Out-of-domain queries. Zero-shot retrieval. Any scenario where user questions and document vocabulary are stylistically very different. Limitation: The hypothetical document can hallucinate invented facts. Those invented facts then skew the search vector toward documents that contain those same hallucinations. Handle with care in high-stakes domains. One question. Four search angles. Four times the coverage. Multi-Query RAG generates multiple semantically distinct reformulations of the original query, runs parallel retrievals for each, deduplicates the combined result set, and passes everything to the LLM. Unlike RAG Fusion which focuses on merging ranked lists with RRF, Multi-Query focuses on exploring different semantic facets of the question. The problem being solved: A user asks “how do I cancel my subscription and get a refund?” That single question actually contains two sub-intents. Multi-Query generates “subscription cancellation process,” “refund policy for cancelled accounts,” “how to end a plan,” and “getting money back after cancelling” and searches all of them. python from langchain.retrievers.multi query import MultiQueryRetrieverfrom langchain.chat models import ChatOpenAImulti query retriever = MultiQueryRetriever.from llm retriever=vectordb.as retriever search kwargs={"k": 5} , llm=ChatOpenAI model="gpt-4o", temperature=0.4 Under the hood this generates ~4 query variants, runs all of them, and deduplicates the combined result set automaticallyresults = multi query retriever.get relevant documents "how do I cancel my subscription and get a refund?" print f"Retrieved {len results } unique documents across all query variants" Internally searched: → "subscription cancellation steps" → "refund policy after cancellation" → "how to terminate an account and receive a refund" → "cancel plan and billing reversal process" When to use it: Questions where the right framing is not obvious. Questions that contain multiple sub-intents. Limitation: N query variants equals N retrieval calls. Can bloat the context window if variants retrieve the same irrelevant documents repeatedly. Give every chunk a memory of where it came from. Before indexing, use an LLM to prepend a short document-aware context summary to each chunk. A chunk that used to say “Revenue declined by 12%” now says “From the Acme Corp Q3 2024 Earnings Report executive summary: Revenue declined by 12%.” The chunk carries its origin as part of its content, dramatically reducing context loss during retrieval. The problem being solved: You are indexing 200 quarterly earnings reports. Without context, a chunk like “Operating margins improved to 23.4%” could be from any company in any quarter. With contextual retrieval the chunk reads “From Tesla Q2 2024 earnings, operations section: Operating margins improved to 23.4%.” The embedding is far more specific. php def add context to chunk full doc text: str, chunk text: str, llm - str: context = llm.predict f"Here is a full document:\n" f"<document \n{full doc text :3000 }\n</document \n\n" f"Here is a specific chunk from that document:\n" f"<chunk \n{chunk text}\n</chunk \n\n" f"Write 1-2 sentences that situate this chunk within the document. " f"Be specific: mention the document name, section, and what the chunk is about. " f"Output only the context sentences, no preamble." return f"{context.strip }\n\n{chunk text}" Apply at indexing time across all chunksenriched chunks = for chunk in raw chunks: enriched content = add context to chunk full doc text=chunk.metadata.get "source doc", "" , chunk text=chunk.page content, llm=llm enriched chunks.append Document page content=enriched content, metadata=chunk.metadata vectordb = Chroma.from documents enriched chunks, OpenAIEmbeddings When to use it: Large document collections where chunks become meaningless without their surrounding context. Financial reports, legal contracts, technical manuals. Limitation: Requires one LLM call per chunk at indexing time. For millions of chunks this is expensive. Also inflates chunk size which can affect retrieval precision. You have retrieved your documents. Now what happens between “here are your chunks” and “here is your answer” determines whether users trust the system. Retrieve 50. Rerank to 5. The highest-ROI single addition to any RAG pipeline. After fast bi-encoder retrieval, a cross-encoder model scores each candidate chunk against the original query with full attention rather than just dot-product similarity. The cross-encoder reads both the query and the document together before scoring, making it far more accurate. Retrieve 50 candidates, rerank, pass only the top 5 to the LLM. This consistently adds 5 to 15 percent accuracy on top of hybrid retrieval. The problem being solved: Your hybrid search retrieves 20 documents. Ranks 1 and 2 are decent. Rank 7 is actually the perfect answer. The LLM only sees ranks 1 to 5 and gives a mediocre answer. The reranker would have put rank 7 at position 1. python import coherefrom sentence transformers import CrossEncoderdef rerank pipeline query: str, vectordb, top k retrieve: int = 50, top k return: int = 5 : Step 1: Fast broad retrieval - cast a wide net candidates = vectordb.similarity search query, k=top k retrieve Option A: Cohere reranker API-based, high quality co = cohere.Client "your-api-key" reranked = co.rerank query=query, documents= c.page content for c in candidates , top n=top k return, model="rerank-english-v3.0" return candidates r.index for r in reranked.results Option B: Open-source cross-encoder free, runs locally model = CrossEncoder "cross-encoder/ms-marco-MiniLM-L-6-v2" pairs = query, c.page content for c in candidates scores = model.predict pairs ranked = sorted zip scores, candidates , key=lambda x: x 0 , reverse=True return doc for , doc in ranked :top k return final docs = rerank pipeline "maximum upload file size for enterprise plan", vectordb When to use it: Always. The cost and accuracy tradeoff almost always favors including a reranker. This and hybrid search are the two changes that will have the biggest immediate impact. Limitation: Cross-encoders are slower than bi-encoders. Cannot run on the full corpus, only on a pre-retrieved candidate set. Adds 100 to 400ms depending on candidate set size and model. Do not search everything. Search summaries first, then drill in. A two-stage retrieval funnel. First search over document or section summaries fast and coarse . Then retrieve detailed chunks only from the top-ranked documents. This avoids wasting your retrieval budget on irrelevant documents in large corpora. The problem being solved: You have 10,000 documents. A broad vector search returns 20 results. Half of them are from documents that are completely unrelated to the domain of the query. You are burning context window space on noise. Hierarchical RAG first identifies the 3 most relevant documents, then retrieves specific chunks only from those. python def hierarchical retrieve query: str, summary vectordb, index of document summaries chunk vectordb, index of fine-grained chunks top documents: int = 3, top chunks: int = 8 - list: Stage 1: Find the most relevant documents via their summaries relevant summaries = summary vectordb.similarity search query, k=top documents shortlisted doc ids = s.metadata "doc id" for s in relevant summaries print f"Stage 1: shortlisted {len shortlisted doc ids } documents" Stage 2: Retrieve chunks only from those shortlisted documents final chunks = chunk vectordb.similarity search query, k=top chunks, filter={"doc id": {"$in": shortlisted doc ids}} print f"Stage 2: retrieved {len final chunks } chunks from shortlisted docs" return final chunksresults = hierarchical retrieve "What were the key risk factors disclosed in annual reports from 2023?", summary vectordb=doc summary db, chunk vectordb=fine chunk db When to use it: Corpora with thousands of documents where initial retrieval returns too much noise. Anywhere retrieval precision needs to improve at the document level before the chunk level. Limitation: If the coarse document-level search misses the right document the second stage never recovers it. Hard failures are harder to debug than soft ones. After the answer is generated, ask: was it actually grounded in the documents? After generating a response, a second LLM pass evaluates whether the answer is supported by the retrieved context. If not it triggers re-retrieval with a refined query or flags the response as low-confidence. This uses the LLM-as-judge pattern without requiring any fine-tuning. The problem being solved: You are building a compliance assistant. The LLM generates a confident answer that sounds great but is actually a blend of retrieved content and training data hallucination. Without self-reflection you never know. With it, the system catches that the answer is not fully grounded before returning it. python def self reflective rag query: str, retriever, llm, max correction attempts: int = 2 - dict: docs = retriever.get relevant documents query context = "\n\n".join d.page content for d in docs :5 for attempt in range max correction attempts + 1 : Generate an answer answer = llm.predict f"Answer this question using only the provided context.\n" f"Context:\n{context}\n\n" f"Question: {query}\n\nAnswer:" Self-check: is this answer grounded? verdict = llm.predict f"Is every claim in this answer directly supported by the context?\n" f"Reply with only 'supported' or 'not supported'.\n\n" f"Context:\n{context}\n\n" f"Answer: {answer}\n\nVerdict:" .strip .lower if "supported" in verdict: return {"answer": answer, "grounded": True, "attempts": attempt + 1} Not supported - re-retrieve with a more specific query if attempt < max correction attempts: refined query = f"{query} - specific evidence and sources" docs = retriever.get relevant documents refined query context = "\n\n".join d.page content for d in docs :5 return {"answer": answer, "grounded": False, "attempts": max correction attempts + 1} When to use it: Medical, legal, or financial applications. Anywhere a confident wrong answer is a hard failure not just an inconvenience. Limitation: Doubles the LLM calls per query. The judge LLM can itself be wrong about what is and is not supported. Generic embeddings fail in specialized domains. Train your own retriever. Generic embedding models are trained on general web text. Biomedical literature, legal contracts, internal company jargon, and financial filings all have vocabulary that generic models embed poorly. Fine-tuning an embedding model on domain-specific query-document pairs fixes this at the source. The problem being solved: You are building a RAG system over internal engineering documentation. Terms like “flaky-ci-timeout,” “shard-rebalance-lag,” and “canary-deployment-rollback” mean nothing to a generic embedding model. It groups them by surface similarity to generic English. A fine-tuned model learns that a query about “canary rollback failures” should retrieve your specific incident postmortems. python from sentence transformers import SentenceTransformer, InputExample, lossesfrom torch.utils.data import DataLoader Your labeled training pairs: user query, matching document excerpt training pairs = InputExample texts= "canary deployment rollback after high error rate", "Canary rollback procedure: when error rate exceeds 2% in canary tier, " "trigger immediate rollback via the deployment controller..." , InputExample texts= "shard rebalance causing query latency spike", "Shard rebalance operations temporarily increase p99 query latency by 40-60ms " "due to data movement across nodes..." , ...collect hundreds to thousands of these from your actual query logs Start from a strong general-purpose base modelmodel = SentenceTransformer "BAAI/bge-base-en-v1.5" train dataloader = DataLoader training pairs, shuffle=True, batch size=16 train loss = losses.MultipleNegativesRankingLoss model model.fit train objectives= train dataloader, train loss , epochs=3, warmup steps=100, show progress bar=True, output path="./engineering-docs-embeddings-v1" Now use this domain-specific model in your retrievercustom embeddings = HuggingFaceEmbeddings model name="./engineering-docs-embeddings-v1" When to use it: You have implemented the other 24 strategies and still see retrieval mismatches on domain-specific terminology. This is the last optimization to reach for, not the first. Limitation: Requires labeled query-document pairs for training. Expensive to train and retrain when the domain evolves. Total overkill for most use cases. If you read this whole guide and are wondering where to begin, here is the honest answer. Do these three things first and you will close 80 percent of the gap between a prototype and a production-grade system. Fix your chunking strategy 13 . Semantic or structure-aware splitting instead of fixed-size. This is free to implement and removes the single biggest source of retrieval failure. Add hybrid search strategy 5 . Combine BM25 with your dense retriever and merge with RRF. One afternoon of work, immediately measurable improvement. Add a reranker strategy 22 . Retrieve 50 candidates, pass the top 5 to your LLM. Consistently adds 5 to 15 percent accuracy. Cohere has a free tier to start. After that add contextual retrieval 21 and query expansion 18 . Then look at your specific failure modes and pick the strategy that addresses them directly. The last thing you should reach for is fine-tuned embeddings 25 . It is almost never the bottleneck early on. The RAG landscape moves fast. Agentic architectures, multimodal retrieval, and RL-trained search agents are already changing what this field looks like. Treat this guide as a foundation, not a ceiling. The Complete Guide to RAG Strategies: 25 Techniques Every Researcher and Engineer Must Know https://pub.towardsai.net/the-complete-guide-to-rag-strategies-25-techniques-every-researcher-and-engineer-must-know-3459ed4bc036 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.