cd /news/large-language-models/the-complete-guide-to-rag-strategies… · home topics large-language-models article
[ARTICLE · art-28514] src=pub.towardsai.net ↗ pub= topic=large-language-models verified=true sentiment=· neutral

The Complete Guide to RAG Strategies: 25 Techniques Every Researcher and Engineer Must Know

A comprehensive guide published in 2026 details 25 retrieval-augmented generation (RAG) strategies, organized into five pipeline layers, to help engineers and researchers move beyond naive RAG implementations that fail in production. The guide provides code examples, use cases, and limitations for each strategy, addressing common issues like low retrieval precision and multi-hop fact retrieval.

read33 min views3 publishedJun 15, 2026

Retrieval-Augmented Generation is no longer just a “vector search + LLM” trick. In 2026 it is an entire ecosystem of architectures, retrieval patterns, and reasoning pipelines. Whether you are building production systems or doing research, this guide covers the 25 most important RAG strategies from the basics to the bleeding edge.

No fluff. No theory-only explanations. Every strategy comes with what it is, when to use it, real code solving a real problem, and honest limitations.

The 25 strategies are grouped into five layers that mirror a real RAG pipeline.

Layer 1: Foundational Architectures (the big paradigms, strategies 1 to 4)

Layer 2: Retrieval Strategies (how and when to retrieve, strategies 5 to 12)

Layer 3: Chunking and Indexing (how to prepare your documents, strategies 13 to 17)

Layer 4: Query-Side Strategies (fixing the query before it hits the retriever, strategies 18 to 21)

**Layer 5: Post-Retrieval and Generation Strategies **(what happens after retrieval, strategies 22 to 25)

These are the skeleton. Everything else in this guide hangs off one of these four paradigms

The starting point every engineer knows and most should not stop at.

You have a pile of PDFs. You want users to ask questions about them. You split the documents into chunks, embed them, throw them into a vector database, and at query time you find the top-k nearest chunks and feed them to an LLM. Done in an afternoon. Impressive in a demo.

Then it goes to real users.

It starts missing obvious answers. It retrieves vaguely related chunks. It confidently answers from the wrong document. This is not a bug in your implementation. It is the ceiling of naive RAG itself.

The problem being solved: You have a company knowledge base with 500 documents and users asking questions about policies, products, and procedures. You need a working system fast.

from langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddingsfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAI# Your 500 company policy documents are loaded as `docs`# 1. Chunksplitter = RecursiveCharacterTextSplitter(    chunk_size=500,    chunk_overlap=50)chunks = splitter.split_documents(docs)# 2. Embed + storevectordb = Chroma.from_documents(chunks, OpenAIEmbeddings())# 3. Retrieve + generateretriever = vectordb.as_retriever(search_kwargs={"k": 4})qa = RetrievalQA.from_chain_type(    llm=ChatOpenAI(model="gpt-4o"),    retriever=retriever,    return_source_documents=True)result = qa("What is the refund policy for enterprise customers?")print(result["result"])# Works fine for simple queries. Falls apart for anything nuanced.

When to use it: Building a proof of concept. Establishing a baseline before adding complexity. Simple Q&A over small, well-structured document sets.

Limitation: Fixed-size chunking severs sentences mid-thought. A single retrieval pass misses multi-hop facts. There is no way to verify if retrieved docs are actually relevant.

Not one technique. A systematic set of upgrades at every stage of the pipeline.

Advanced RAG is the answer to “naive RAG works in demos but fails in production.” It applies deliberate improvements before retrieval (better indexing, query rewriting), during retrieval (hybrid search, dense embeddings), and after retrieval (reranking, compression). Think of it as engineering discipline applied to each pipeline stage.

The problem being solved: Your naive RAG system is live. Retrieval precision is around 60%. Users are complaining that answers feel vague. You need a systematic way to debug and improve without rebuilding everything.

from langchain_experimental.text_splitter import SemanticChunkerfrom langchain.retrievers import BM25Retriever, EnsembleRetrieverimport cohere# PRE: Semantic chunking instead of fixed-sizesemantic_splitter = SemanticChunker(    embeddings=OpenAIEmbeddings(),    breakpoint_threshold_type="percentile")chunks = semantic_splitter.split_documents(docs)# DURING: Hybrid search combining BM25 and dense retrievalbm25 = BM25Retriever.from_documents(chunks, k=10)dense = Chroma.from_documents(chunks, OpenAIEmbeddings()).as_retriever(k=10)hybrid = EnsembleRetriever(retrievers=[bm25, dense], weights=[0.4, 0.6])raw_results = hybrid.get_relevant_documents("enterprise refund policy")# POST: Rerank the combined resultsco = cohere.Client("your-api-key")reranked = co.rerank(    query="enterprise refund policy",    documents=[r.page_content for r in raw_results],    top_n=5,    model="rerank-english-v3.0")final_docs = [raw_results[r.index] for r in reranked.results]

When to use it: When you have identified specific failure modes and want to address them layer by layer rather than throwing everything at the wall.

Limitation: Still a fixed pipeline. Cannot dynamically adapt to different query types or reroute based on what gets retrieved.

Stop thinking of RAG as a pipeline. Start thinking of it as LEGO blocks.

Modular RAG decomposes the system into discrete swappable components. Retrievers, rerankers, memory modules, and generators are all independent. You compose them differently depending on the use case. The same query router can send legal questions to a graph retriever and support questions to a dense retriever, with both going through the same reranker and generator.

The problem being solved: You are building a product that serves both a medical team and a legal team. Medical queries need a specialized biomedical retriever. Legal queries need a graph-based retriever for case relationships. Both teams use the same frontend.

class ModularRAG:    def __init__(self, router, retrievers: dict, reranker=None, generator=None):        self.router = router        # decides which retriever to use        self.retrievers = retrievers        self.reranker = reranker        self.generator = generator    def run(self, query: str) -> str:        domain = self.router.classify(query)        retriever = self.retrievers.get(domain, self.retrievers["default"])        docs = retriever.get_relevant_documents(query)        if self.reranker:            docs = self.reranker.rerank(query, docs, top_n=5)        return self.generator.generate(query, docs)# Swap retrievers without touching anything elserag = ModularRAG(    router=DomainClassifier(),    retrievers={        "medical": BiomedicalRetriever(),        "legal": GraphRetriever(),        "default": hybrid_retriever    },    reranker=CohereReranker(),    generator=GPT4Generator())answer = rag.run("What are the contraindications for warfarin?")

When to use it: Multi-domain systems. Any production system where requirements will evolve and you need to swap components without touching everything else.

Limitation: More complex to design and maintain. Requires well-defined interfaces between modules or the flexibility becomes a liability.

What if retrieval was not a step your system takes but a decision the model makes itself?

In all three paradigms above, the retrieval flow is fixed. Agentic RAG hands that control to an LLM. The agent decides when to retrieve, what to search for, whether the results are sufficient, and whether to retrieve again. It uses a ReAct loop: Reason, then Act (retrieve), then Observe the results, then loop back or produce a final answer.

The problem being solved: A user asks: “Compare our Q3 revenue with the industry average from last quarter and flag any red flags.” That requires hitting your internal financial docs, a web search for industry data, and multi-step reasoning. No fixed pipeline handles this.

from langchain.agents import initialize_agent, AgentType, Toolfrom langchain.chat_models import ChatOpenAItools = [    Tool(        name="InternalDocSearch",        func=lambda q: vectordb.similarity_search(q, k=4),        description="Search internal financial documents, reports, and policies"    ),    Tool(        name="WebSearch",        func=lambda q: web_search_api(q),        description="Search the internet for current industry data and news"    ),]agent = initialize_agent(    tools=tools,    llm=ChatOpenAI(model="gpt-4o", temperature=0),    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,    max_iterations=6,       # hard stop to prevent infinite loops    verbose=True)result = agent.run(    "Compare our Q3 revenue with the industry average from last quarter "    "and flag any red flags.")

When to use it: Complex multi-step questions. Queries that span multiple data sources. Research-style tasks that require exploration before answering.

Limitation: Much higher latency and cost. Can loop indefinitely without a proper stopping condition. Hard to debug when it goes wrong because the failure mode is reasoning, not retrieval.

This is the layer with the most direct impact on answer quality. Most RAG failures trace back here.

The single most reliable retrieval upgrade. Almost every production RAG system uses this.

Dense semantic search understands meaning but misses exact tokens like part numbers, names, IDs, and technical acronyms. BM25 keyword search catches those but misses paraphrases and synonyms. Hybrid search runs both in parallel and merges results with Reciprocal Rank Fusion (RRF). You get the best of both.

The problem being solved: Users at an e-commerce company are searching for products by name (“Nike Air Max 270”), by description (“lightweight running shoe”), and by product ID (“SKU-4821-BK”). A pure semantic search fails on IDs. A pure keyword search fails on descriptions.

from langchain.retrievers import BM25Retriever, EnsembleRetrieverfrom langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddings# Build both indexes from the same document setbm25_retriever = BM25Retriever.from_documents(docs)bm25_retriever.k = 10dense_retriever = Chroma.from_documents(    docs, OpenAIEmbeddings()).as_retriever(search_kwargs={"k": 10})# RRF merges the ranked lists without needing score normalizationhybrid = EnsembleRetriever(    retrievers=[bm25_retriever, dense_retriever],    weights=[0.4, 0.6]   # weight dense higher for semantic queries)# Now this single retriever handles all three query typesresults = hybrid.get_relevant_documents("Nike Air Max 270")results = hybrid.get_relevant_documents("lightweight running shoe for long distance")results = hybrid.get_relevant_documents("SKU-4821-BK")

When to use it: Almost always. If you only make one upgrade to your RAG system make it this one.

Limitation: Requires maintaining two separate index types. Adds slight latency from running parallel searches.

One retrieval pass is not always enough. Let each round inform the next.

Rather than a single retrieve-then-generate pass, iterative retrieval performs multiple rounds. The output of each round enriches the query for the next. The system builds up context progressively until it has enough to answer confidently.

The problem being solved: A user asks “Who founded the company that acquired the startup mentioned in our latest board memo?” That requires finding the memo, finding the startup name, then finding the acquisition, then finding the founder. One retrieval pass gets you maybe the first step.

def iterative_rag(query: str, retriever, llm, max_rounds: int = 3) -> str:    accumulated_context = ""    for round_num in range(max_rounds):        # Each round enriches the search with what we learned so far        search_query = query        if accumulated_context:            search_query = (                f"{query}\n\nRelated context already found:\n{accumulated_context}"            )        docs = retriever.get_relevant_documents(search_query)        new_context = "\n\n".join(d.page_content for d in docs)        # Ask the LLM if it has enough to answer        check = llm.predict(            f"Given this context, can you fully answer '{query}'? "            f"Reply 'yes' or 'no'.\n\nContext: {new_context}"        )        accumulated_context += f"\n\n{new_context}"        if check.strip().lower() == "yes":            break    return llm.predict(        f"Answer this question completely:\n{query}\n\nContext:\n{accumulated_context}"    )

When to use it: Multi-hop questions where facts from different documents need to be chained together.

Limitation: Multiplies latency and cost by the number of rounds. Needs an explicit stopping condition or it runs forever.

Like drilling into a document. Each result becomes the starting point for a deeper search.

Recursive retrieval builds a tree of retrievals. The top-level search finds the most relevant documents. For each of those, a sub-retrieval goes deeper on the specific section or entity that was found. Useful when your knowledge base has nested structure where high-level documents point to detailed annexes.

The problem being solved: A legal team asks questions about compliance regulations. The answer lives in a regulation document, which references a specific clause, which points to an interpretation guideline. No single query surfaces all three.

def recursive_retrieve(    query: str,    retriever,    max_depth: int = 2,    current_depth: int = 0) -> list:    if current_depth >= max_depth:        return []    top_docs = retriever.get_relevant_documents(query)    all_docs = list(top_docs)    # Only recurse on the top 2 results to avoid explosion    for doc in top_docs[:2]:        # Extract the most specific entity or topic from this doc        sub_topic = doc.page_content[:150]        sub_query = f"More details specifically about: {sub_topic}"        deeper_docs = recursive_retrieve(            sub_query, retriever,            max_depth=max_depth,            current_depth=current_depth + 1        )        all_docs.extend(deeper_docs)    # Deduplicate by content    seen = set()    unique_docs = []    for d in all_docs:        if d.page_content not in seen:            seen.add(d.page_content)            unique_docs.append(d)    return unique_docs

When to use it: Legal documents with cross-references. Technical manuals with appendices. Any corpus where sections point to sub-sections.

Limitation: Depth can explode without strict limits. Can introduce irrelevant tangents if the sub-queries drift from the original intent.

Smart enough to know when not to retrieve at all.

Not every query needs a vector search. “What is 2 + 2?” does not need to hit your database. “What does our SLA say about uptime guarantees?” absolutely does. Adaptive retrieval classifies query complexity first, then routes it to the appropriate strategy. Simple queries go direct to the LLM. Complex ones get multi-hop retrieval.

The problem being solved: Your RAG system serves 10,000 queries a day. About 40% are simple greetings, basic definitions, or questions the LLM already knows. Running vector search on all of them is wasting money and adding latency to responses that do not need it.

def adaptive_rag(query: str, retriever, llm) -> str:    # Step 1: Classify query complexity    classification = llm.predict(        f"Classify this query into exactly one of: simple, moderate, complex.\n"        f"simple = general knowledge, no specific documents needed\n"        f"moderate = needs one retrieval pass from documents\n"        f"complex = needs multiple retrievals or cross-document reasoning\n\n"        f"Query: {query}\n\nClassification:"    ).strip().lower()        # Step 2: Route to the appropriate strategy    if "simple" in classification:        return llm.predict(query)    elif "moderate" in classification:        docs = retriever.get_relevant_documents(query)        context = "\n\n".join(d.page_content for d in docs[:4])        return llm.predict(f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:")    else:        return iterative_rag(query, retriever, llm, max_rounds=3)

When to use it: High-traffic production systems with mixed query types where retrieval cost adds up.

Limitation: The routing classifier can misclassify. A wrongly classified complex query sent down the simple path gives a confidently wrong answer.

When your knowledge lives in relationships, not paragraphs.

GraphRAG builds an entity-relationship graph from your documents instead of a flat vector index. Querying traverses graph neighborhoods, surfacing connected facts that no single chunk could contain. Microsoft’s open-source implementation also creates hierarchical “community reports” via community detection for high-level summarization queries.

The problem being solved: A financial analyst asks “Which companies in our portfolio have executives who previously worked at firms that were later acquired by competitors?” The answer requires traversing entity relationships across dozens of documents. No vector search gets there.

When to use it: Medical knowledge graphs. Legal networks. Org charts. Supply chains. Any domain where the answer lives in a relationship rather than a paragraph.

Limitation: Expensive to build and maintain the graph. Entity extraction is imperfect. Not worth it if your domain is mostly independent documents with no relational structure.

Retrieval is the first draft. Corrective RAG decides if it is good enough to use.

After retrieval, CRAG scores each document’s relevance against the query. If the top results fall below a confidence threshold it discards them and fires a new search with a reformulated query. Sometimes it falls back to web search entirely. Only then does generation happen.

The problem being solved: You are building a legal research assistant. The user asks about a specific jurisdiction’s statute. Your retriever returns a similar-looking but wrong statute from a different jurisdiction. Without correction this becomes a confident wrong answer in a high-stakes context.

def corrective_rag(    query: str,    retriever,    web_search_fn,    llm,    relevance_threshold: float = 0.7) -> str:    docs = retriever.get_relevant_documents(query)        # Score each retrieved document for relevance    scored_docs = []    for doc in docs:        score_str = llm.predict(            f"Score how relevant this document is to the query on a scale of 0.0 to 1.0.\n"            f"Reply with only the number.\n\n"            f"Query: {query}\n\nDocument: {doc.page_content[:400]}"        )        try:            score = float(score_str.strip())        except ValueError:            score = 0.0        scored_docs.append((doc, score))    good_docs = [d for d, s in scored_docs if s >= relevance_threshold]    # If not enough good docs, fall back to web search    if len(good_docs) < 2:        web_results = web_search_fn(query)        good_docs.extend(web_results)    context = "\n\n".join(d.page_content for d in good_docs[:5])    return llm.predict(f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:")

When to use it: Legal research. Medical Q&A. Academic writing. Anywhere a wrong source is worse than no source at all.

Limitation: The relevance evaluator can itself be wrong. Adds latency from the evaluation and potential re-retrieval loop. Can get stuck in correction loops on genuinely ambiguous queries.

The model grades its own retrieval from the inside.

Self-RAG trains an LLM to emit special reflection tokens during generation. The model s and asks itself: should I even retrieve for this? Is this passage relevant? Is my answer actually supported by what I retrieved? The reflection happens inside the generation process, not as an external wrapper.

The problem being solved: You want retrieval quality to be the model’s responsibility, not something you orchestrate externally. The model should know when it is guessing versus when it is grounded.

SELF_RAG_PROMPT = """You are answering a question using retrieved documents.Follow this process exactly:[RETRIEVE NEEDED: yes/no] - Do you need to look this up?[RELEVANCE: relevant/irrelevant] - Is the retrieved document useful?[ANSWER]: Your answer based only on relevant documents.[SUPPORTED: supported/not supported] - Is your answer grounded in the doc?Question: {question}Retrieved document:{document}Now follow the process:"""def self_rag_answer(question: str, retriever, llm) -> str:    docs = retriever.get_relevant_documents(question)    for doc in docs[:3]:        response = llm.predict(            SELF_RAG_PROMPT.format(                question=question,                document=doc.page_content[:600]            )        )        if "supported" in response.lower().split("[supported:")[-1]:            return response    return "Could not find a well-supported answer in the available documents."

When to use it: Research settings. When you can fine-tune. When you want the model to own retrieval quality rather than relying on external scoring logic.

Limitation: Requires a specially fine-tuned model. Not plug-and-play with off-the-shelf GPT-4 or Claude. The prompting simulation below approximates the behavior but does not match the full fine-tuned version.

Generate four different ways to ask the question, search all four, and merge the best results.

RAG Fusion generates multiple semantically diverse reformulations of the original query, runs parallel vector searches for each, and merges all ranked lists using Reciprocal Rank Fusion. It surfaces both literal and related knowledge that a single query would miss.

The problem being solved: A product manager asks “how do users feel about the checkout experience?” Your documents use phrases like “cart abandonment,” “payment friction,” “purchase flow,” and “conversion drop.” A single embedding of the original query will not match all of them. RAG Fusion generates all those variants automatically.

from collections import defaultdictdef rag_fusion(query: str, vectordb, llm, num_variants: int = 4) -> list:    # Step 1: Generate semantically diverse query variants    raw = llm.predict(        f"Write {num_variants} different ways to search for the answer to this question.\n"        f"Each variant should use different vocabulary and framing.\n"        f"Output one per line, no numbering.\n\n"        f"Original question: {query}"    )    variants = [query] + [v.strip() for v in raw.strip().split("\n") if v.strip()][:num_variants]        # Step 2: Retrieve for each variant independently    all_ranked_lists = []    for variant in variants:        results = vectordb.similarity_search(variant, k=6)        all_ranked_lists.append(results)        # Step 3: Merge with Reciprocal Rank Fusion (k=60 is standard)    rrf_scores = defaultdict(float)    doc_map = {}    for ranked_list in all_ranked_lists:        for rank, doc in enumerate(ranked_list):            key = doc.page_content[:100]   # use content prefix as key            rrf_scores[key] += 1.0 / (rank + 60)            doc_map[key] = doc    sorted_keys = sorted(rrf_scores, key=rrf_scores.get, reverse=True)    return [doc_map[k] for k in sorted_keys[:5]]

When to use it: Ambiguous queries. Research questions. Any case where user phrasing diverges significantly from how documents are written.

Limitation: N query variants equals N retrieval calls. Can bloat the context window with tangentially related results if the generated queries drift too far.

The most overlooked layer. Most engineers spend 80% of their time on retrieval and generation and almost none on how documents are prepared. That is backwards. Bad chunking causes retrieval failures that no amount of reranking can fix.

Fixed-size chunking is the number one source of retrieval failures. Stop using it.

Splitting documents at fixed token counts severs sentences mid-thought, breaks tables in half, and separates headings from their content. Context-aware chunking splits at natural semantic or structural boundaries. Each chunk is a coherent unit of meaning.

The problem being solved: You chunked a technical manual with a 500-token splitter. A chunk ends mid-sentence: “The maximum load capacity is 450kg when the safety override is…” The next chunk starts “…disabled only by authorized personnel.” Both chunks are retrieved separately and neither makes sense alone.

from langchain_experimental.text_splitter import SemanticChunkerfrom langchain.embeddings import OpenAIEmbeddings# BAD: fixed-size splits mid-sentence constantly# splitter = CharacterTextSplitter(chunk_size=500)# BETTER: splits where embedding distance spikes (semantic boundary)semantic_splitter = SemanticChunker(    embeddings=OpenAIEmbeddings(),    breakpoint_threshold_type="percentile",    breakpoint_threshold_amount=95   # split on the top 5% biggest semantic jumps)# ALSO GOOD: structure-aware splitting respects markdown/heading boundariesfrom langchain.text_splitter import RecursiveCharacterTextSplitterstructural_splitter = RecursiveCharacterTextSplitter(    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],    chunk_size=600,    chunk_overlap=80)chunks = semantic_splitter.split_documents(docs)print(f"Created {len(chunks)} semantically coherent chunks")

When to use it: Every serious RAG system. If you are still using CharacterTextSplitter with a fixed chunk size in production, this is your first fix.

Limitation: Semantic chunking requires an embedding pass over all documents at indexing time. Slower and more expensive to build the initial index.

Build a pyramid of summaries over your corpus. Answer from any level.

RAPTOR clusters documents, summarizes each cluster, then recursively clusters and summarizes those summaries, building a tree from raw passages up to a high-level overview of everything. At query time retrieval can pull from any level. Abstract thematic questions hit the top. Specific fact-finding drills to the leaves.

The problem being solved: You have 3,000 research papers in your database. A researcher asks “What are the dominant themes in recent work on transformer efficiency?” That is a question about the whole corpus, not a specific paper. Naive retrieval returns random chunks. RAPTOR answers from the summary tree.

from sklearn.mixture import GaussianMixturefrom langchain.schema import Documentimport numpy as npdef build_raptor_tree(docs: list, embeddings, llm, levels: int = 2) -> dict:    tree = {"level_0": docs}    current_level_docs = docs    for level in range(1, levels + 1):        print(f"Building RAPTOR level {level}...")        # Embed all docs at this level        doc_embeddings = np.array([            embeddings.embed_query(d.page_content) for d in current_level_docs        ])        # Cluster them        n_clusters = max(2, len(current_level_docs) // 5)        gm = GaussianMixture(n_components=n_clusters, random_state=42)        labels = gm.fit_predict(doc_embeddings)        # Summarize each cluster into a new document        summaries = []        for cluster_id in range(n_clusters):            cluster_docs = [                d for d, l in zip(current_level_docs, labels) if l == cluster_id            ]            combined_text = "\n\n".join(d.page_content for d in cluster_docs[:6])            summary = llm.predict(                f"Summarize these documents into a single coherent paragraph:\n\n{combined_text}"            )            summaries.append(Document(                page_content=summary,                metadata={"level": level, "cluster": cluster_id}            ))        tree[f"level_{level}"] = summaries        current_level_docs = summaries    return tree

When to use it: Long documents. Research paper corpora. Legal document collections. Any corpus where you need both high-level thematic answers and specific passage-level retrieval.

Limitation: Expensive to build. Requires LLM summarization of every cluster at every level. Index rebuild time is much longer than flat indexing.

Retrieve small. Return large. Precision at search time, context at generation time.

Index small granular chunks (children) for high-precision similarity search. When a child chunk is matched, automatically retrieve its full parent section for generation. Sentence Window Retrieval is a variant where you match at the sentence level then return a window of surrounding sentences.

The problem being solved: You are building a customer support bot over a 200-page product manual. Small chunks give precise retrieval but when the answer is spread across a paragraph the LLM only sees half of it. Large chunks give full context but their embeddings are diluted and retrieval precision drops.

from langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain.text_splitter import RecursiveCharacterTextSplitter# Small chunks for retrieval precisionchild_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)# Large chunks for generation contextparent_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=100)# The store holds full parent documentsstore = InMemoryStore()retriever = ParentDocumentRetriever(    vectorstore=Chroma(embedding_function=OpenAIEmbeddings()),    docstore=store,    child_splitter=child_splitter,    parent_splitter=parent_splitter,)retriever.add_documents(docs, ids=None)# Retrieves via small child chunks but returns the full parent sectionresults = retriever.get_relevant_documents(    "What happens if the safety sensor is triggered during operation?")print(f"Returned {len(results)} parent sections with full context")

When to use it: Documents with clear hierarchical structure like reports, manuals, and codebases. Anywhere you need search precision but context richness.

Limitation: Parent chunks can be large and dilute the context window when multiple children from different parents are retrieved simultaneously.

Embed the whole document first. Chunk the embeddings afterward. Context preserved.

Traditional approaches chunk text first then embed each chunk independently, losing all cross-chunk context. Late chunking runs the full document through the encoder first, preserving long-range context in the token representations, then pools embeddings over chunk boundaries. Each chunk’s embedding carries awareness of the whole document.

The problem being solved: A contract reads “The party referred to in Section 2.1 shall not…” Chunked naively, the chunk containing this sentence has no idea who “the party referred to in Section 2.1” is because that section is in a different chunk. Late chunking embeds both sections together so the reference is preserved.

When to use it: Documents where important context spans multiple sections. Contracts, scientific papers, narrative reports. Works best with long-context embedding models.

Limitation: Requires a model that supports long-context encoding at the embedding stage. More memory-intensive at indexing time.

Index the idea of the document. Retrieve the full document for generation.

Generate a concise LLM summary for each document and index that instead of raw chunks. Retrieval matches at the summary level which is rich in concepts and free of noise. But the LLM gets the full document text for generation. This separates the finding problem from the reading problem.

The problem being solved: You have 500 internal knowledge base articles. Users ask conceptual questions like “What is our approach to incident response?” A specific chunk from the incident response article might not embed near that phrasing. But a summary of the article will.

from llama_index.core import SummaryIndex, SimpleDirectoryReader, Settingsfrom llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbeddingSettings.llm = OpenAI(model="gpt-4o")Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")# Load your 500 knowledge base articlesdocuments = SimpleDirectoryReader("./knowledge_base").load_data()# Build the summary index (one LLM call per document at indexing time)summary_index = SummaryIndex.from_documents(documents)# Retrieval uses the summaries; the LLM gets the full documentquery_engine = summary_index.as_query_engine(    response_mode="tree_summarize",    verbose=True)response = query_engine.query(    "What is our overall approach to incident response and escalation?")print(response)

When to use it: Queries that are thematic or conceptual rather than exact. Knowledge bases, policy documents, research collections.

Limitation: Summaries can lose specific details. If the answer is a specific number or a specific clause the summary may not surface it and retrieval fails silently.

Real users write terrible queries. “thing broke,” “pricing,” “the document from last week.” These strategies fix the query before it ever touches your retriever.

Users type lazy queries. This adds the missing words before searching.

Before retrieval, expand the query with related terms, synonyms, and rephrasings. This bridges the vocabulary gap between how users phrase questions and how documents are written. A user types “my order is late.” Your document says “shipment delay notification procedures.” Without expansion these never match.

The problem being solved: You are building a search system for an HR platform. An employee types “can I work from abroad.” Your policy documents use “remote work policy,” “international employment,” “tax implications for overseas work,” and “work permit requirements.” A single embedding of the original query retrieves at most one of these.

def expand_query(query: str, llm) -> str:    expanded = llm.predict(        f"Expand this search query to improve document retrieval.\n"        f"Add relevant synonyms, related terms, and alternative phrasings.\n"        f"Keep it concise. Output only the expanded query, nothing else.\n\n"        f"Original query: {query}\n\n"        f"Expanded query:"    )    return expanded.strip()def step_back_query(query: str, llm) -> str:    # Step-back prompting: ask the broader question first    return llm.predict(        f"What is the broader, more general topic behind this specific question?\n"        f"Return only the broader question, nothing else.\n\n"        f"Specific: {query}\n"        f"Broader:"    )original = "can I work from abroad"expanded = expand_query(original, llm)# → "remote work abroad international employment overseas work policy#    work permit tax implications working from another country"broader = step_back_query(original, llm)# → "What are the rules and policies for working outside my home country?"docs = retriever.get_relevant_documents(expanded)

When to use it: Any system with real users. This is one of the easiest wins in the pipeline with a very low implementation cost.

Limitation: Poorly expanded queries add noise. An LLM expansion call adds latency. Overly broad expansions can pull in irrelevant documents.

Do not search with your question. Search with a hypothetical answer.

HyDE prompts the LLM to write a fake ideal document that would answer the query. That fabricated document is embedded and used as the search vector instead of the raw question. Since it is written in document-style language it lands much closer to real documents in embedding space than a short user query would.

The problem being solved: You are building a RAG system over medical literature. A doctor asks “best treatment protocol for resistant hypertension in diabetic patients.” That short clinical question has a very different embedding than a multi-paragraph clinical guideline that actually answers it. HyDE generates a fake guideline and searches with that instead.

def hyde_retrieval(query: str, vectordb, llm, k: int = 5) -> list:    # Step 1: Generate a hypothetical ideal document    hypothetical_doc = llm.predict(        f"Write a detailed clinical paragraph that directly answers this question.\n"        f"Use the same formal language and vocabulary as medical literature.\n"        f"Do not indicate that this is hypothetical.\n\n"        f"Question: {query}\n\n"        f"Clinical answer:"    )    print(f"Hypothetical doc preview: {hypothetical_doc[:200]}...")    # Step 2: Search using the hypothetical document as the query vector    results = vectordb.similarity_search(hypothetical_doc, k=k)    return resultsdocs = hyde_retrieval(    "best treatment protocol for resistant hypertension in diabetic patients",    medical_vectordb,    llm)

When to use it: Out-of-domain queries. Zero-shot retrieval. Any scenario where user questions and document vocabulary are stylistically very different.

Limitation: The hypothetical document can hallucinate invented facts. Those invented facts then skew the search vector toward documents that contain those same hallucinations. Handle with care in high-stakes domains.

One question. Four search angles. Four times the coverage.

Multi-Query RAG generates multiple semantically distinct reformulations of the original query, runs parallel retrievals for each, deduplicates the combined result set, and passes everything to the LLM. Unlike RAG Fusion which focuses on merging ranked lists with RRF, Multi-Query focuses on exploring different semantic facets of the question.

The problem being solved: A user asks “how do I cancel my subscription and get a refund?” That single question actually contains two sub-intents. Multi-Query generates “subscription cancellation process,” “refund policy for cancelled accounts,” “how to end a plan,” and “getting money back after cancelling” and searches all of them.

from langchain.retrievers.multi_query import MultiQueryRetrieverfrom langchain.chat_models import ChatOpenAImulti_query_retriever = MultiQueryRetriever.from_llm(    retriever=vectordb.as_retriever(search_kwargs={"k": 5}),    llm=ChatOpenAI(model="gpt-4o", temperature=0.4))# Under the hood this generates ~4 query variants, runs all of them,# and deduplicates the combined result set automaticallyresults = multi_query_retriever.get_relevant_documents(    "how do I cancel my subscription and get a refund?")print(f"Retrieved {len(results)} unique documents across all query variants")# Internally searched:# → "subscription cancellation steps"# → "refund policy after cancellation"# → "how to terminate an account and receive a refund"# → "cancel plan and billing reversal process"

When to use it: Questions where the right framing is not obvious. Questions that contain multiple sub-intents.

Limitation: N query variants equals N retrieval calls. Can bloat the context window if variants retrieve the same irrelevant documents repeatedly.

Give every chunk a memory of where it came from.

Before indexing, use an LLM to prepend a short document-aware context summary to each chunk. A chunk that used to say “Revenue declined by 12%” now says “From the Acme Corp Q3 2024 Earnings Report executive summary: Revenue declined by 12%.” The chunk carries its origin as part of its content, dramatically reducing context loss during retrieval.

The problem being solved: You are indexing 200 quarterly earnings reports. Without context, a chunk like “Operating margins improved to 23.4%” could be from any company in any quarter. With contextual retrieval the chunk reads “From Tesla Q2 2024 earnings, operations section: Operating margins improved to 23.4%.” The embedding is far more specific.

def add_context_to_chunk(full_doc_text: str, chunk_text: str, llm) -> str:    context = llm.predict(        f"Here is a full document:\n"        f"<document>\n{full_doc_text[:3000]}\n</document>\n\n"        f"Here is a specific chunk from that document:\n"        f"<chunk>\n{chunk_text}\n</chunk>\n\n"        f"Write 1-2 sentences that situate this chunk within the document. "        f"Be specific: mention the document name, section, and what the chunk is about. "        f"Output only the context sentences, no preamble."    )    return f"{context.strip()}\n\n{chunk_text}"# Apply at indexing time across all chunksenriched_chunks = []for chunk in raw_chunks:    enriched_content = add_context_to_chunk(        full_doc_text=chunk.metadata.get("source_doc", ""),        chunk_text=chunk.page_content,        llm=llm    )    enriched_chunks.append(Document(        page_content=enriched_content,        metadata=chunk.metadata    ))vectordb = Chroma.from_documents(enriched_chunks, OpenAIEmbeddings())

When to use it: Large document collections where chunks become meaningless without their surrounding context. Financial reports, legal contracts, technical manuals.

Limitation: Requires one LLM call per chunk at indexing time. For millions of chunks this is expensive. Also inflates chunk size which can affect retrieval precision.

You have retrieved your documents. Now what happens between “here are your chunks” and “here is your answer” determines whether users trust the system.

Retrieve 50. Rerank to 5. The highest-ROI single addition to any RAG pipeline.

After fast bi-encoder retrieval, a cross-encoder model scores each candidate chunk against the original query with full attention rather than just dot-product similarity. The cross-encoder reads both the query and the document together before scoring, making it far more accurate. Retrieve 50 candidates, rerank, pass only the top 5 to the LLM. This consistently adds 5 to 15 percent accuracy on top of hybrid retrieval.

The problem being solved: Your hybrid search retrieves 20 documents. Ranks 1 and 2 are decent. Rank 7 is actually the perfect answer. The LLM only sees ranks 1 to 5 and gives a mediocre answer. The reranker would have put rank 7 at position 1.

import coherefrom sentence_transformers import CrossEncoderdef rerank_pipeline(query: str, vectordb, top_k_retrieve: int = 50, top_k_return: int = 5):    # Step 1: Fast broad retrieval - cast a wide net    candidates = vectordb.similarity_search(query, k=top_k_retrieve)    # Option A: Cohere reranker (API-based, high quality)    co = cohere.Client("your-api-key")    reranked = co.rerank(        query=query,        documents=[c.page_content for c in candidates],        top_n=top_k_return,        model="rerank-english-v3.0"    )    return [candidates[r.index] for r in reranked.results]    # Option B: Open-source cross-encoder (free, runs locally)    # model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")    # pairs = [(query, c.page_content) for c in candidates]    # scores = model.predict(pairs)    # ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)    # return [doc for _, doc in ranked[:top_k_return]]final_docs = rerank_pipeline("maximum upload file size for enterprise plan", vectordb)

When to use it: Always. The cost and accuracy tradeoff almost always favors including a reranker. This and hybrid search are the two changes that will have the biggest immediate impact.

Limitation: Cross-encoders are slower than bi-encoders. Cannot run on the full corpus, only on a pre-retrieved candidate set. Adds 100 to 400ms depending on candidate set size and model.

Do not search everything. Search summaries first, then drill in.

A two-stage retrieval funnel. First search over document or section summaries (fast and coarse). Then retrieve detailed chunks only from the top-ranked documents. This avoids wasting your retrieval budget on irrelevant documents in large corpora.

The problem being solved: You have 10,000 documents. A broad vector search returns 20 results. Half of them are from documents that are completely unrelated to the domain of the query. You are burning context window space on noise. Hierarchical RAG first identifies the 3 most relevant documents, then retrieves specific chunks only from those.

def hierarchical_retrieve(    query: str,    summary_vectordb,     # index of document summaries    chunk_vectordb,       # index of fine-grained chunks    top_documents: int = 3,    top_chunks: int = 8) -> list:    # Stage 1: Find the most relevant documents via their summaries    relevant_summaries = summary_vectordb.similarity_search(query, k=top_documents)    shortlisted_doc_ids = [s.metadata["doc_id"] for s in relevant_summaries]    print(f"Stage 1: shortlisted {len(shortlisted_doc_ids)} documents")    # Stage 2: Retrieve chunks only from those shortlisted documents    final_chunks = chunk_vectordb.similarity_search(        query,        k=top_chunks,        filter={"doc_id": {"$in": shortlisted_doc_ids}}    )    print(f"Stage 2: retrieved {len(final_chunks)} chunks from shortlisted docs")    return final_chunksresults = hierarchical_retrieve(    "What were the key risk factors disclosed in annual reports from 2023?",    summary_vectordb=doc_summary_db,    chunk_vectordb=fine_chunk_db)

When to use it: Corpora with thousands of documents where initial retrieval returns too much noise. Anywhere retrieval precision needs to improve at the document level before the chunk level.

Limitation: If the coarse document-level search misses the right document the second stage never recovers it. Hard failures are harder to debug than soft ones.

After the answer is generated, ask: was it actually grounded in the documents?

After generating a response, a second LLM pass evaluates whether the answer is supported by the retrieved context. If not it triggers re-retrieval with a refined query or flags the response as low-confidence. This uses the LLM-as-judge pattern without requiring any fine-tuning.

The problem being solved: You are building a compliance assistant. The LLM generates a confident answer that sounds great but is actually a blend of retrieved content and training data hallucination. Without self-reflection you never know. With it, the system catches that the answer is not fully grounded before returning it.

def self_reflective_rag(    query: str,    retriever,    llm,    max_correction_attempts: int = 2) -> dict:    docs = retriever.get_relevant_documents(query)    context = "\n\n".join(d.page_content for d in docs[:5])    for attempt in range(max_correction_attempts + 1):        # Generate an answer        answer = llm.predict(            f"Answer this question using only the provided context.\n"            f"Context:\n{context}\n\n"            f"Question: {query}\n\nAnswer:"        )        # Self-check: is this answer grounded?        verdict = llm.predict(            f"Is every claim in this answer directly supported by the context?\n"            f"Reply with only 'supported' or 'not supported'.\n\n"            f"Context:\n{context}\n\n"            f"Answer: {answer}\n\nVerdict:"        ).strip().lower()        if "supported" in verdict:            return {"answer": answer, "grounded": True, "attempts": attempt + 1}        # Not supported - re-retrieve with a more specific query        if attempt < max_correction_attempts:            refined_query = f"{query} - specific evidence and sources"            docs = retriever.get_relevant_documents(refined_query)            context = "\n\n".join(d.page_content for d in docs[:5])    return {"answer": answer, "grounded": False, "attempts": max_correction_attempts + 1}

When to use it: Medical, legal, or financial applications. Anywhere a confident wrong answer is a hard failure not just an inconvenience.

Limitation: Doubles the LLM calls per query. The judge LLM can itself be wrong about what is and is not supported.

Generic embeddings fail in specialized domains. Train your own retriever.

Generic embedding models are trained on general web text. Biomedical literature, legal contracts, internal company jargon, and financial filings all have vocabulary that generic models embed poorly. Fine-tuning an embedding model on domain-specific query-document pairs fixes this at the source.

The problem being solved: You are building a RAG system over internal engineering documentation. Terms like “flaky-ci-timeout,” “shard-rebalance-lag,” and “canary-deployment-rollback” mean nothing to a generic embedding model. It groups them by surface similarity to generic English. A fine-tuned model learns that a query about “canary rollback failures” should retrieve your specific incident postmortems.

from sentence_transformers import SentenceTransformer, InputExample, lossesfrom torch.utils.data import Data# Your labeled training pairs: (user query, matching document excerpt)training_pairs = [    InputExample(texts=[        "canary deployment rollback after high error rate",        "Canary rollback procedure: when error rate exceeds 2% in canary tier, "        "trigger immediate rollback via the deployment controller..."    ]),    InputExample(texts=[        "shard rebalance causing query latency spike",        "Shard rebalance operations temporarily increase p99 query latency by 40-60ms "        "due to data movement across nodes..."    ]),    # ...collect hundreds to thousands of these from your actual query logs]# Start from a strong general-purpose base modelmodel = SentenceTransformer("BAAI/bge-base-en-v1.5")train_data = Data(training_pairs, shuffle=True, batch_size=16)train_loss = losses.MultipleNegativesRankingLoss(model)model.fit(    train_objectives=[(train_data, train_loss)],    epochs=3,    warmup_steps=100,    show_progress_bar=True,    output_path="./engineering-docs-embeddings-v1")# Now use this domain-specific model in your retrievercustom_embeddings = HuggingFaceEmbeddings(    model_name="./engineering-docs-embeddings-v1")

When to use it: You have implemented the other 24 strategies and still see retrieval mismatches on domain-specific terminology. This is the last optimization to reach for, not the first.

Limitation: Requires labeled query-document pairs for training. Expensive to train and retrain when the domain evolves. Total overkill for most use cases.

If you read this whole guide and are wondering where to begin, here is the honest answer.

Do these three things first and you will close 80 percent of the gap between a prototype and a production-grade system.

Fix your chunking (strategy 13). Semantic or structure-aware splitting instead of fixed-size. This is free to implement and removes the single biggest source of retrieval failure.

Add hybrid search (strategy 5). Combine BM25 with your dense retriever and merge with RRF. One afternoon of work, immediately measurable improvement.

Add a reranker (strategy 22). Retrieve 50 candidates, pass the top 5 to your LLM. Consistently adds 5 to 15 percent accuracy. Cohere has a free tier to start.

After that add contextual retrieval (21) and query expansion (18). Then look at your specific failure modes and pick the strategy that addresses them directly.

The last thing you should reach for is fine-tuned embeddings (25). It is almost never the bottleneck early on.

The RAG landscape moves fast. Agentic architectures, multimodal retrieval, and RL-trained search agents are already changing what this field looks like. Treat this guide as a foundation, not a ceiling.

The Complete Guide to RAG Strategies: 25 Techniques Every Researcher and Engineer Must Know was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

── more in #large-language-models 4 stories · sorted by recency
── more on @langchain 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/the-complete-guide-t…] indexed:0 read:33min 2026-06-15 ·