# When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two

> Source: <https://pub.towardsai.net/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two-c8946453de34?source=rss----98111c9905da---4>
> Published: 2026-06-29 23:01:01+00:00

I ran the same queries through two retrieval pipelines: one standard, one using HyDE and the results split cleanly. HyDE crushed it on conceptual questions where the user’s words didn’t match the document’s language. But when I asked about internal company policy, the LLM hallucinated a fake answer, and the search went looking for *that* instead. And when I searched for an exact product name, all the extra generated text just diluted the signal.

Here is exactly what happened, with code and scores.

HyDE stands for **Hypothetical Document Embeddings**. It comes from a [2022 paper by Gao et al.](https://arxiv.org/abs/2212.10496), and the core idea is deceptively simple.

In standard retrieval, you take the user’s short query, embed it, and search your vector database for similar documents. The problem? A three-word question like *“Why is the sky blue?”* lives in a completely different region of embedding space than a detailed paragraph explaining Rayleigh scattering. The query and the answer just don’t look alike as vectors.

HyDE flips this. Instead of embedding the query, you first ask an LLM to **generate a hypothetical document, **a fake answer to the question. Then you embed *that* fake document and use it to search. A fake answer looks a lot more like a real answer in vector space than a short question ever could.

The diagram above shows the difference. Standard retrieval has two steps: embed the query, search. HyDE has three: generate a fake document, embed the fake document, search.

The fake document doesn’t need to be *correct*. It just needs to be *stylistically similar* to what a real answer would look like. The embedding model does the rest.

HyDE sounds great in the paper. But papers test on academic benchmarks. I wanted to know what happens in the situations I actually encounter when building RAG systems:

So I built a small notebook to test all three.

I created a small knowledge base with 7 documents deliberately chosen to cover different retrieval challenges. I know 7 documents is tiny — I treat this as a **behavior probe**, not a benchmark.

The goal is toisolatehowHyDE changes retrieval behavior in different scenarios, not to produce statistically significant numbers.

```
documents = [ “The Federal Reserve increased interest rates by 0.25% in Q3…”, “SpaceX successfully landed the Falcon 9 booster…”, “The mitochondria is the powerhouse of the cell…”, “Apple’s latest flagship device features… the new 3nm A17 Pro chip.”, “The capital of Australia is Canberra…”, “In Python, the Global Interpreter Lock (GIL) is a mutex…”, “Acme Corp’s internal policy states that all travel expenses  exceeding $500 must be approved by a Level 3 manager or above.”]
```

The mix is intentional. Some are public knowledge (the GIL, mitochondria). One is **proprietary** (Acme Corp’s internal policy). Some contain **specific identifiers** (A17 Pro). This lets us test where HyDE helps and where it fails.

I used **Azure OpenAI** with text-embedding-3-small for embeddings and gpt-4o-mini for generating hypothetical documents. The retrieval is pure cosine similarity — no HNSW, no reranking, nothing fancy. I wanted to isolate the effect of HyDE itself.

The two retrieval functions look like this:

**Standard retrieval** — embed the query, compare to all documents:

``` python
def standard_retrieve(query, top_k=2):    query_emb = get_embedding(query)    similarities = []    for i, doc_emb in enumerate(doc_embeddings):        sim = cosine_similarity(query_emb, doc_emb)        similarities.append((sim, i))    # Sort by descending similarity    similarities.sort(reverse=True, key=lambda x: x[0])        results = []    for sim, idx in similarities[:top_k]:        results.append({"score": sim, "text": documents[idx]})    return results
```

**HyDE retrieval** — generate a fake document first, then embed *that*:

``` python
def hyde_retrieve(query, top_k=2):    fake_doc = generate_hypothetical_document(query)    fake_doc_emb = get_embedding(fake_doc)    # ... same cosine similarity search as above
```

The generate_hypothetical_document function simply prompts the LLM:

```
prompt = f"""Please write a short paragraph answering the following question or explaining the topic. Do not write an introduction or conclusion, just write the factual answer as if it were a snippet from an article or Wikipedia page.Question/Topic: {query}"""
```

Now let’s see what happens.

**Query:** *“Why is my multithreaded code in Python still slow?”*

**Target document:** *“In Python, the Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.”*

This is the classic case where the user describes a *symptom: *slow multithreaded cod, but the document uses *formal terminology = *GIL, mutex, bytecodes. The words barely overlap.

**What happened:**

Standard retrieval found the right document, but the similarity score was modest. The query’s vocabulary (“slow”, “multithreaded code”) doesn’t match the document’s language (“mutex”, “bytecodes”) very well.

HyDE generated something like:

“In Python, multithreading is often limited by the Global Interpreter Lock (GIL), which prevents true parallel execution of threads. Even on multi-core systems, only one thread can execute Python bytecode at a time, making CPU-bound multithreaded programs effectively single-threaded…”

This fake document uses the exact same vocabulary as the real document. When embedded, it lands right next to the GIL document in vector space. The similarity score jumped noticeably.

**Why it worked:** The LLM “knows” about the GIL. It generates text that looks like what the real document looks like. The embedding of the fake answer is much closer to the embedding of the real answer than the short question ever was.

**Query:** *“What is the manager level required to approve a $600 flight?”*

**Target document:** *“Acme Corp’s internal policy states that all travel expenses exceeding $500 must be approved by a Level 3 manager or above.”*

This is where things get** interesting**.

**What happened:**

The LLM has never seen Acme Corp’s internal policy. So it hallucinated something generic:

“Typically, travel expenses require approval from a department manager or supervisor. Many organizations set thresholds where expenses over a certain amount need higher-level authorization, such as a director or VP…”

This fake document talks about “directors”, “VPs”, “department managers” — none of which appear in the actual document. The real document says “Level 3 manager.” The hallucinated text pulled the search *away* from the correct answer.

Meanwhile, standard retrieval just embedded the query directly. The words “manager level”, “$600”, “approve” had enough semantic overlap with the real document’s language to find it with a reasonable score.

**Why it failed:** HyDE assumes the LLM can generate something that *resembles* the real document. But for proprietary data, the LLM is guessing blind. Its guess has the wrong vocabulary, the wrong specifics, and the wrong structure. The embedding of the hallucination ends up in the wrong neighborhood.

**Query:** *“A17 Pro”*

**Target document:** *“Apple’s latest flagship device features a brushed titanium body, a periscope telephoto lens, and the new 3nm A17 Pro chip.”*

**What happened:**

Standard retrieval embedded “A17 Pro” — two words, very specific. The embedding honed in on exactly the document containing that term. High similarity.

HyDE generated a full paragraph:

“The A17 Pro is Apple’s latest system-on-chip, built on a 3-nanometer process. It features a 6-core CPU with improved performance and efficiency cores, a 6-core GPU with hardware-accelerated ray tracing, and a 16-core Neural Engine capable of 35 trillion operations per second…”

This is factually decent, but it’s a whole paragraph. The embedding now represents a broad concept — Apple chip specifications — rather than the laser-focused signal of “A17 Pro.” All that extra text dilutes the embedding, spreading the vector’s attention across many concepts.

Standard retrieval’s score was higher because its embedding was pure signal, no noise.

**Why it failed:** When the user already has the exact term, there is no semantic gap to bridge. Adding more text can only hurt. The embedding of two precise words is sharper than the embedding of a paragraph.

Here is the bar chart from the notebook, comparing cosine similarity scores across all three scenarios:

The pattern is clear:

*Scores are cosine similarity from **text-embedding-3-small. Higher = closer match to the target document. Your numbers will vary slightly across runs due to LLM generation temperature.*

HyDE won one out of three. And importantly, when it lost, it lost for *structural reasons* that apply broadly — not because of bad luck with the prompt.

Based on this experiment and the original paper, HyDE is most useful when:

After running these experiments, this is the mental model I use now:

**Use HyDE** when the query is vague, conceptual, or written in different language than the documents.

**Avoid HyDE** when the query contains exact identifiers, internal policy terms, product names, error codes, customer names, or proprietary facts.

**In production, route the query first:**

This simple routing decision is worth more than picking one approach for all queries.

In production, you don’t have to choose. Run both:

This way you get HyDE’s semantic gap bridging *and* standard search’s precision on exact matches — and the reranker handles the sorting.

This is a toy experiment with 7 documents. A few things this does not test:

The full notebook is available on [GitHub](https://github.com/ameynarwadkar/medium-notebooks). To run it:

1. Clone the [repo](https://github.com/ameynarwadkar/medium-notebooks) or download the[ HyDE.ipynb](https://github.com/ameynarwadkar/medium-notebooks/blob/main/HyDE.ipynb)

2. Set up your Azure OpenAI credentials in a ‘.env’ file:

```
AZURE_OPENAI_API_KEY=your_keyAZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/AZURE_CHAT_DEPLOYMENT=gpt-4o-miniAZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-smallAZURE_OPENAI_API_VERSION=2025–04–01-preview
```

3. Install dependencies:

```
pip install openai numpy python-dotenv matplotlib`
```

4. Run the cells: You can swap in your own documents, change the queries, or try different embedding models. The comparison function makes it easy to test any scenario.

HyDE is not a magic retrieval upgrade. It is a vocabulary bridge.

When the LLM can imagine the right kind of answer, HyDE helps. When the answer lives inside your private data or inside an exact identifier, HyDE starts guessing and retrieval follows the guess.

That is the real lesson: in RAG, better generation before retrieval is only useful when the generation points in the right direction.

If you run this notebook on a different dataset, I would be curious whether you see the same pattern, especially on proprietary data. That is where most real-world RAG systems live, and it is exactly where HyDE struggles the most.

[When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two](https://pub.towardsai.net/when-does-hyde-help-rag-i-tested-3-query-types-and-it-failed-on-two-c8946453de34) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
