{"slug": "when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two", "title": "When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two", "summary": "A developer tested HyDE (Hypothetical Document Embeddings) against standard retrieval on three query types and found it improved conceptual searches but failed on internal company policy and exact product name queries due to hallucination and signal dilution.", "body_md": "I ran the same queries through two retrieval pipelines: one standard, one using HyDE and the results split cleanly. HyDE crushed it on conceptual questions where the user’s words didn’t match the document’s language. But when I asked about internal company policy, the LLM hallucinated a fake answer, and the search went looking for *that* instead. And when I searched for an exact product name, all the extra generated text just diluted the signal.\n\nHere is exactly what happened, with code and scores.\n\nHyDE stands for **Hypothetical Document Embeddings**. It comes from a [2022 paper by Gao et al.](https://arxiv.org/abs/2212.10496), and the core idea is deceptively simple.\n\nIn standard retrieval, you take the user’s short query, embed it, and search your vector database for similar documents. The problem? A three-word question like *“Why is the sky blue?”* lives in a completely different region of embedding space than a detailed paragraph explaining Rayleigh scattering. The query and the answer just don’t look alike as vectors.\n\nHyDE flips this. Instead of embedding the query, you first ask an LLM to **generate a hypothetical document, **a fake answer to the question. Then you embed *that* fake document and use it to search. A fake answer looks a lot more like a real answer in vector space than a short question ever could.\n\nThe diagram above shows the difference. Standard retrieval has two steps: embed the query, search. HyDE has three: generate a fake document, embed the fake document, search.\n\nThe fake document doesn’t need to be *correct*. It just needs to be *stylistically similar* to what a real answer would look like. The embedding model does the rest.\n\nHyDE sounds great in the paper. But papers test on academic benchmarks. I wanted to know what happens in the situations I actually encounter when building RAG systems:\n\nSo I built a small notebook to test all three.\n\nI created a small knowledge base with 7 documents deliberately chosen to cover different retrieval challenges. I know 7 documents is tiny — I treat this as a **behavior probe**, not a benchmark.\n\nThe goal is toisolatehowHyDE changes retrieval behavior in different scenarios, not to produce statistically significant numbers.\n\n```\ndocuments = [ “The Federal Reserve increased interest rates by 0.25% in Q3…”, “SpaceX successfully landed the Falcon 9 booster…”, “The mitochondria is the powerhouse of the cell…”, “Apple’s latest flagship device features… the new 3nm A17 Pro chip.”, “The capital of Australia is Canberra…”, “In Python, the Global Interpreter Lock (GIL) is a mutex…”, “Acme Corp’s internal policy states that all travel expenses  exceeding $500 must be approved by a Level 3 manager or above.”]\n```\n\nThe mix is intentional. Some are public knowledge (the GIL, mitochondria). One is **proprietary** (Acme Corp’s internal policy). Some contain **specific identifiers** (A17 Pro). This lets us test where HyDE helps and where it fails.\n\nI used **Azure OpenAI** with text-embedding-3-small for embeddings and gpt-4o-mini for generating hypothetical documents. The retrieval is pure cosine similarity — no HNSW, no reranking, nothing fancy. I wanted to isolate the effect of HyDE itself.\n\nThe two retrieval functions look like this:\n\n**Standard retrieval** — embed the query, compare to all documents:\n\n``` python\ndef standard_retrieve(query, top_k=2):    query_emb = get_embedding(query)    similarities = []    for i, doc_emb in enumerate(doc_embeddings):        sim = cosine_similarity(query_emb, doc_emb)        similarities.append((sim, i))    # Sort by descending similarity    similarities.sort(reverse=True, key=lambda x: x[0])        results = []    for sim, idx in similarities[:top_k]:        results.append({\"score\": sim, \"text\": documents[idx]})    return results\n```\n\n**HyDE retrieval** — generate a fake document first, then embed *that*:\n\n``` python\ndef hyde_retrieve(query, top_k=2):    fake_doc = generate_hypothetical_document(query)    fake_doc_emb = get_embedding(fake_doc)    # ... same cosine similarity search as above\n```\n\nThe generate_hypothetical_document function simply prompts the LLM:\n\n```\nprompt = f\"\"\"Please write a short paragraph answering the following question or explaining the topic. Do not write an introduction or conclusion, just write the factual answer as if it were a snippet from an article or Wikipedia page.Question/Topic: {query}\"\"\"\n```\n\nNow let’s see what happens.\n\n**Query:** *“Why is my multithreaded code in Python still slow?”*\n\n**Target document:** *“In Python, the Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.”*\n\nThis is the classic case where the user describes a *symptom: *slow multithreaded cod, but the document uses *formal terminology = *GIL, mutex, bytecodes. The words barely overlap.\n\n**What happened:**\n\nStandard retrieval found the right document, but the similarity score was modest. The query’s vocabulary (“slow”, “multithreaded code”) doesn’t match the document’s language (“mutex”, “bytecodes”) very well.\n\nHyDE generated something like:\n\n“In Python, multithreading is often limited by the Global Interpreter Lock (GIL), which prevents true parallel execution of threads. Even on multi-core systems, only one thread can execute Python bytecode at a time, making CPU-bound multithreaded programs effectively single-threaded…”\n\nThis fake document uses the exact same vocabulary as the real document. When embedded, it lands right next to the GIL document in vector space. The similarity score jumped noticeably.\n\n**Why it worked:** The LLM “knows” about the GIL. It generates text that looks like what the real document looks like. The embedding of the fake answer is much closer to the embedding of the real answer than the short question ever was.\n\n**Query:** *“What is the manager level required to approve a $600 flight?”*\n\n**Target document:** *“Acme Corp’s internal policy states that all travel expenses exceeding $500 must be approved by a Level 3 manager or above.”*\n\nThis is where things get** interesting**.\n\n**What happened:**\n\nThe LLM has never seen Acme Corp’s internal policy. So it hallucinated something generic:\n\n“Typically, travel expenses require approval from a department manager or supervisor. Many organizations set thresholds where expenses over a certain amount need higher-level authorization, such as a director or VP…”\n\nThis fake document talks about “directors”, “VPs”, “department managers” — none of which appear in the actual document. The real document says “Level 3 manager.” The hallucinated text pulled the search *away* from the correct answer.\n\nMeanwhile, standard retrieval just embedded the query directly. The words “manager level”, “$600”, “approve” had enough semantic overlap with the real document’s language to find it with a reasonable score.\n\n**Why it failed:** HyDE assumes the LLM can generate something that *resembles* the real document. But for proprietary data, the LLM is guessing blind. Its guess has the wrong vocabulary, the wrong specifics, and the wrong structure. The embedding of the hallucination ends up in the wrong neighborhood.\n\n**Query:** *“A17 Pro”*\n\n**Target document:** *“Apple’s latest flagship device features a brushed titanium body, a periscope telephoto lens, and the new 3nm A17 Pro chip.”*\n\n**What happened:**\n\nStandard retrieval embedded “A17 Pro” — two words, very specific. The embedding honed in on exactly the document containing that term. High similarity.\n\nHyDE generated a full paragraph:\n\n“The A17 Pro is Apple’s latest system-on-chip, built on a 3-nanometer process. It features a 6-core CPU with improved performance and efficiency cores, a 6-core GPU with hardware-accelerated ray tracing, and a 16-core Neural Engine capable of 35 trillion operations per second…”\n\nThis is factually decent, but it’s a whole paragraph. The embedding now represents a broad concept — Apple chip specifications — rather than the laser-focused signal of “A17 Pro.” All that extra text dilutes the embedding, spreading the vector’s attention across many concepts.\n\nStandard retrieval’s score was higher because its embedding was pure signal, no noise.\n\n**Why it failed:** When the user already has the exact term, there is no semantic gap to bridge. Adding more text can only hurt. The embedding of two precise words is sharper than the embedding of a paragraph.\n\nHere is the bar chart from the notebook, comparing cosine similarity scores across all three scenarios:\n\nThe pattern is clear:\n\n*Scores are cosine similarity from **text-embedding-3-small. Higher = closer match to the target document. Your numbers will vary slightly across runs due to LLM generation temperature.*\n\nHyDE won one out of three. And importantly, when it lost, it lost for *structural reasons* that apply broadly — not because of bad luck with the prompt.\n\nBased on this experiment and the original paper, HyDE is most useful when:\n\nAfter running these experiments, this is the mental model I use now:\n\n**Use HyDE** when the query is vague, conceptual, or written in different language than the documents.\n\n**Avoid HyDE** when the query contains exact identifiers, internal policy terms, product names, error codes, customer names, or proprietary facts.\n\n**In production, route the query first:**\n\nThis simple routing decision is worth more than picking one approach for all queries.\n\nIn production, you don’t have to choose. Run both:\n\nThis way you get HyDE’s semantic gap bridging *and* standard search’s precision on exact matches — and the reranker handles the sorting.\n\nThis is a toy experiment with 7 documents. A few things this does not test:\n\nThe full notebook is available on [GitHub](https://github.com/ameynarwadkar/medium-notebooks). To run it:\n\n1. Clone the [repo](https://github.com/ameynarwadkar/medium-notebooks) or download the[ HyDE.ipynb](https://github.com/ameynarwadkar/medium-notebooks/blob/main/HyDE.ipynb)\n\n2. Set up your Azure OpenAI credentials in a ‘.env’ file:\n\n```\nAZURE_OPENAI_API_KEY=your_keyAZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/AZURE_CHAT_DEPLOYMENT=gpt-4o-miniAZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-smallAZURE_OPENAI_API_VERSION=2025–04–01-preview\n```\n\n3. Install dependencies:\n\n```\npip install openai numpy python-dotenv matplotlib`\n```\n\n4. Run the cells: You can swap in your own documents, change the queries, or try different embedding models. The comparison function makes it easy to test any scenario.\n\nHyDE is not a magic retrieval upgrade. It is a vocabulary bridge.\n\nWhen the LLM can imagine the right kind of answer, HyDE helps. When the answer lives inside your private data or inside an exact identifier, HyDE starts guessing and retrieval follows the guess.\n\nThat is the real lesson: in RAG, better generation before retrieval is only useful when the generation points in the right direction.\n\nIf you run this notebook on a different dataset, I would be curious whether you see the same pattern, especially on proprietary data. That is where most real-world RAG systems live, and it is exactly where HyDE struggles the most.\n\n[When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two](https://pub.towardsai.net/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two-c8946453de34) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two", "canonical_source": "https://pub.towardsai.net/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two-c8946453de34?source=rss----98111c9905da---4", "published_at": "2026-06-29 23:01:01+00:00", "updated_at": "2026-06-29 23:23:13.240665+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-research", "ai-tools", "natural-language-processing"], "entities": ["HyDE", "Gao et al.", "Azure OpenAI", "gpt-4o-mini", "text-embedding-3-small", "Acme Corp", "SpaceX", "Apple"], "alternates": {"html": "https://wpnews.pro/news/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two", "markdown": "https://wpnews.pro/news/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two.md", "text": "https://wpnews.pro/news/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two.txt", "jsonld": "https://wpnews.pro/news/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two.jsonld"}}