# Curing LLM Hallucinations: Building a Production-Grade Medical RAG with PubMed and Hybrid Search

> Source: <https://dev.to/beck_moulton/curing-llm-hallucinations-building-a-production-grade-medical-rag-with-pubmed-and-hybrid-search-3fjc>
> Published: 2026-05-31 00:41:00+00:00

Ever asked an AI for a medical dosage recommendation only to get a confident-sounding but dangerously incorrect answer? In the world of healthcare, **LLM hallucinations** aren't just "bugs"—they are critical risks. To bridge the gap between static training data and the rapidly evolving world of clinical research, we need a robust **Medical RAG (Retrieval-Augmented Generation)** system.

By implementing **Hybrid Search** (combining the keyword precision of BM25 with the semantic depth of Vector Search), we can ground our models in peer-reviewed evidence from the **PubMed API**. In this guide, we will leverage **LlamaIndex**, **Pinecone**, and **Elasticsearch** to build a Clinical Decision Support system that prioritizes accuracy and real-time knowledge retrieval. 🚀

Standard RAG pipelines often rely solely on cosine similarity in a vector space. However, medical queries are unique:

Here is how our system handles a medical query, ensuring we get the best of both worlds: keyword matching and semantic context.

``` php
graph TD
  User((User Query)) --> Router{LlamaIndex Router}

  subgraph Retrieval_Layer [Hybrid Search Layer]
    Router -->|Keyword Search| ES[Elasticsearch - BM25]
    Router -->|Semantic Search| PC[Pinecone - Vector DB]
  end

  ES -->|Top K Results| Reranker[Cross-Encoder Re-ranker]
  PC -->|Top K Results| Reranker

  subgraph Knowledge_Source [Data Ingestion]
    PM[PubMed API] --> Clean[Data Cleaning]
    Clean --> ES
    Clean --> PC
  end

  Reranker -->|Contextual Chunks| LLM[GPT-4o / Clinical LLM]
  LLM -->|Evidence-Based Response| Output((Final Answer + Citations))
```

To follow this tutorial, you'll need:

We use the PubMed API to fetch the latest research papers. Using `Biopython`

or direct REST calls, we extract the title and abstract.

``` python
from llama_index.core import Document
from Bio import Entrez

def fetch_pubmed_abstracts(query, max_results=10):
    Entrez.email = "your@email.com"
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    record = Entrez.read(handle)
    ids = record["IdList"]

    documents = []
    handle = Entrez.efetch(db="pubmed", id=",".join(ids), rettype="abstract", retmode="xml")
    articles = Entrez.read(handle)

    for article in articles['PubmedArticle']:
        abstract = article['MedlineCitation']['Article'].get('Abstract', {}).get('AbstractText', [""])[0]
        title = article['MedlineCitation']['Article']['ArticleTitle']
        documents.append(Document(text=abstract, metadata={"title": title, "source": "PubMed"}))
    return documents
```

The secret sauce is the `QueryFusionRetriever`

. It takes results from both **Elasticsearch** (BM25) and **Pinecone** (Vector) and merges them using Reciprocal Rank Fusion (RRF).

``` python
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# 1. Vector Store (Pinecone)
vector_store = PineconeVectorStore(pinecone_index=index)
vector_retriever = index.as_retriever(similarity_top_k=5)

# 2. Keyword Store (BM25 via Elasticsearch)
# Assuming documents are already indexed in Elasticsearch
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=5)

# 3. Hybrid Fusion
hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    num_queries=1, # Set to >1 for query expansion/rewrite
    mode="reciprocal_rerank",
    use_top_k=True
)
```

Finally, we feed the fused context into the LLM. We enforce a strict prompt template that requires the model to cite the "Source Title" from the metadata.

``` python
from llama_index.core.query_engine import RetrieverQueryEngine

prompt_template = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Always cite your sources using the 'title' metadata.\n"
    "If the answer is not in the context, state that you do not know.\n"
    "Query: {query_str}\n"
    "Answer: "
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    system_prompt="You are a specialized Medical Assistant."
)

response = query_engine.query("What are the latest treatments for drug-resistant hypertension?")
print(response)
```

Building a prototype is easy, but making it production-ready for a clinical environment involves handling PII (Personally Identifiable Information), ensuring HIPAA compliance, and implementing sophisticated "Agentic RAG" loops.

For more advanced patterns on architecting healthcare AI and production-ready data pipelines, I highly recommend checking out the technical deep dives at ** WellAlly Blog**. They cover everything from optimizing embedding models for medical jargon to handling large-scale document ingestion workflows.

By combining the precision of **Elasticsearch** with the semantic capabilities of **Pinecone**, and orchestrating it all via **LlamaIndex**, we've built a system that doesn't just "guess"—it "researches."

The medical field demands high stakes. Moving from a generic LLM to a **PubMed-grounded Hybrid RAG** is the first step toward building AI tools that doctors can actually trust. 🩺💻

**What are your thoughts?** Have you struggled with hallucination in specific domains? Drop a comment below or share your favorite re-ranking strategy!
