When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two

A developer tested HyDE (Hypothetical Document Embeddings) against standard retrieval on three query types and found it improved conceptual searches but failed on internal company policy and exact product name queries due to hallucination and signal dilution.

I ran the same queries through two retrieval pipelines: one standard, one using HyDE and the results split cleanly. HyDE crushed it on conceptual questions where the user’s words didn’t match the document’s language. But when I asked about internal company policy, the LLM hallucinated a fake answer, and the search went looking for that instead. And when I searched for an exact product name, all the extra generated text just diluted the signal. Here is exactly what happened, with code and scores. HyDE stands for Hypothetical Document Embeddings . It comes from a 2022 paper by Gao et al. https://arxiv.org/abs/2212.10496 , and the core idea is deceptively simple. In standard retrieval, you take the user’s short query, embed it, and search your vector database for similar documents. The problem? A three-word question like “Why is the sky blue?” lives in a completely different region of embedding space than a detailed paragraph explaining Rayleigh scattering. The query and the answer just don’t look alike as vectors. HyDE flips this. Instead of embedding the query, you first ask an LLM to generate a hypothetical document, a fake answer to the question. Then you embed that fake document and use it to search. A fake answer looks a lot more like a real answer in vector space than a short question ever could. The diagram above shows the difference. Standard retrieval has two steps: embed the query, search. HyDE has three: generate a fake document, embed the fake document, search. The fake document doesn’t need to be correct . It just needs to be stylistically similar to what a real answer would look like. The embedding model does the rest. HyDE sounds great in the paper. But papers test on academic benchmarks. I wanted to know what happens in the situations I actually encounter when building RAG systems: So I built a small notebook to test all three. I created a small knowledge base with 7 documents deliberately chosen to cover different retrieval challenges. I know 7 documents is tiny — I treat this as a behavior probe , not a benchmark. The goal is toisolatehowHyDE changes retrieval behavior in different scenarios, not to produce statistically significant numbers. documents = “The Federal Reserve increased interest rates by 0.25% in Q3…”, “SpaceX successfully landed the Falcon 9 booster…”, “The mitochondria is the powerhouse of the cell…”, “Apple’s latest flagship device features… the new 3nm A17 Pro chip.”, “The capital of Australia is Canberra…”, “In Python, the Global Interpreter Lock GIL is a mutex…”, “Acme Corp’s internal policy states that all travel expenses exceeding $500 must be approved by a Level 3 manager or above.” The mix is intentional. Some are public knowledge the GIL, mitochondria . One is proprietary Acme Corp’s internal policy . Some contain specific identifiers A17 Pro . This lets us test where HyDE helps and where it fails. I used Azure OpenAI with text-embedding-3-small for embeddings and gpt-4o-mini for generating hypothetical documents. The retrieval is pure cosine similarity — no HNSW, no reranking, nothing fancy. I wanted to isolate the effect of HyDE itself. The two retrieval functions look like this: Standard retrieval — embed the query, compare to all documents: python def standard retrieve query, top k=2 : query emb = get embedding query similarities = for i, doc emb in enumerate doc embeddings : sim = cosine similarity query emb, doc emb similarities.append sim, i Sort by descending similarity similarities.sort reverse=True, key=lambda x: x 0 results = for sim, idx in similarities :top k : results.append {"score": sim, "text": documents idx } return results HyDE retrieval — generate a fake document first, then embed that : python def hyde retrieve query, top k=2 : fake doc = generate hypothetical document query fake doc emb = get embedding fake doc ... same cosine similarity search as above The generate hypothetical document function simply prompts the LLM: prompt = f"""Please write a short paragraph answering the following question or explaining the topic. Do not write an introduction or conclusion, just write the factual answer as if it were a snippet from an article or Wikipedia page.Question/Topic: {query}""" Now let’s see what happens. Query: “Why is my multithreaded code in Python still slow?” Target document: “In Python, the Global Interpreter Lock GIL is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.” This is the classic case where the user describes a symptom: slow multithreaded cod, but the document uses formal terminology = GIL, mutex, bytecodes. The words barely overlap. What happened: Standard retrieval found the right document, but the similarity score was modest. The query’s vocabulary “slow”, “multithreaded code” doesn’t match the document’s language “mutex”, “bytecodes” very well. HyDE generated something like: “In Python, multithreading is often limited by the Global Interpreter Lock GIL , which prevents true parallel execution of threads. Even on multi-core systems, only one thread can execute Python bytecode at a time, making CPU-bound multithreaded programs effectively single-threaded…” This fake document uses the exact same vocabulary as the real document. When embedded, it lands right next to the GIL document in vector space. The similarity score jumped noticeably. Why it worked: The LLM “knows” about the GIL. It generates text that looks like what the real document looks like. The embedding of the fake answer is much closer to the embedding of the real answer than the short question ever was. Query: “What is the manager level required to approve a $600 flight?” Target document: “Acme Corp’s internal policy states that all travel expenses exceeding $500 must be approved by a Level 3 manager or above.” This is where things get interesting . What happened: The LLM has never seen Acme Corp’s internal policy. So it hallucinated something generic: “Typically, travel expenses require approval from a department manager or supervisor. Many organizations set thresholds where expenses over a certain amount need higher-level authorization, such as a director or VP…” This fake document talks about “directors”, “VPs”, “department managers” — none of which appear in the actual document. The real document says “Level 3 manager.” The hallucinated text pulled the search away from the correct answer. Meanwhile, standard retrieval just embedded the query directly. The words “manager level”, “$600”, “approve” had enough semantic overlap with the real document’s language to find it with a reasonable score. Why it failed: HyDE assumes the LLM can generate something that resembles the real document. But for proprietary data, the LLM is guessing blind. Its guess has the wrong vocabulary, the wrong specifics, and the wrong structure. The embedding of the hallucination ends up in the wrong neighborhood. Query: “A17 Pro” Target document: “Apple’s latest flagship device features a brushed titanium body, a periscope telephoto lens, and the new 3nm A17 Pro chip.” What happened: Standard retrieval embedded “A17 Pro” — two words, very specific. The embedding honed in on exactly the document containing that term. High similarity. HyDE generated a full paragraph: “The A17 Pro is Apple’s latest system-on-chip, built on a 3-nanometer process. It features a 6-core CPU with improved performance and efficiency cores, a 6-core GPU with hardware-accelerated ray tracing, and a 16-core Neural Engine capable of 35 trillion operations per second…” This is factually decent, but it’s a whole paragraph. The embedding now represents a broad concept — Apple chip specifications — rather than the laser-focused signal of “A17 Pro.” All that extra text dilutes the embedding, spreading the vector’s attention across many concepts. Standard retrieval’s score was higher because its embedding was pure signal, no noise. Why it failed: When the user already has the exact term, there is no semantic gap to bridge. Adding more text can only hurt. The embedding of two precise words is sharper than the embedding of a paragraph. Here is the bar chart from the notebook, comparing cosine similarity scores across all three scenarios: The pattern is clear: Scores are cosine similarity from text-embedding-3-small. Higher = closer match to the target document. Your numbers will vary slightly across runs due to LLM generation temperature. HyDE won one out of three. And importantly, when it lost, it lost for structural reasons that apply broadly — not because of bad luck with the prompt. Based on this experiment and the original paper, HyDE is most useful when: After running these experiments, this is the mental model I use now: Use HyDE when the query is vague, conceptual, or written in different language than the documents. Avoid HyDE when the query contains exact identifiers, internal policy terms, product names, error codes, customer names, or proprietary facts. In production, route the query first: This simple routing decision is worth more than picking one approach for all queries. In production, you don’t have to choose. Run both: This way you get HyDE’s semantic gap bridging and standard search’s precision on exact matches — and the reranker handles the sorting. This is a toy experiment with 7 documents. A few things this does not test: The full notebook is available on GitHub https://github.com/ameynarwadkar/medium-notebooks . To run it: 1. Clone the repo https://github.com/ameynarwadkar/medium-notebooks or download the HyDE.ipynb https://github.com/ameynarwadkar/medium-notebooks/blob/main/HyDE.ipynb 2. Set up your Azure OpenAI credentials in a ‘.env’ file: AZURE OPENAI API KEY=your keyAZURE OPENAI ENDPOINT=https://your-resource.openai.azure.com/AZURE CHAT DEPLOYMENT=gpt-4o-miniAZURE EMBEDDING DEPLOYMENT=text-embedding-3-smallAZURE OPENAI API VERSION=2025–04–01-preview 3. Install dependencies: pip install openai numpy python-dotenv matplotlib 4. Run the cells: You can swap in your own documents, change the queries, or try different embedding models. The comparison function makes it easy to test any scenario. HyDE is not a magic retrieval upgrade. It is a vocabulary bridge. When the LLM can imagine the right kind of answer, HyDE helps. When the answer lives inside your private data or inside an exact identifier, HyDE starts guessing and retrieval follows the guess. That is the real lesson: in RAG, better generation before retrieval is only useful when the generation points in the right direction. If you run this notebook on a different dataset, I would be curious whether you see the same pattern, especially on proprietary data. That is where most real-world RAG systems live, and it is exactly where HyDE struggles the most. When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two https://pub.towardsai.net/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two-c8946453de34 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.