cd /news/generative-ai/master-rag-systems-build-an-end-to-e… Β· home β€Ί topics β€Ί generative-ai β€Ί article
[ARTICLE Β· art-14152] src=dev.to pub= topic=generative-ai verified=true sentiment=↑ positive

Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI πŸš€

A developer has built an end-to-end Retrieval-Augmented Generation (RAG) pipeline using LangChain, Milvus, reranking, and Azure OpenAI to reduce hallucination in large language models. The system retrieves relevant documents from external sources, processes them through chunking and embedding into a vector database, then applies similarity search and reranking before providing context to the LLM for grounded responses. The pipeline supports multiple document formats including PDFs and text files, with metadata tracking for enterprise traceability.

read7 min publishedMay 26, 2026

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question β†’ LLM β†’ Answer

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response

This makes responses:

βœ… More accurate

βœ… Context-aware

βœ… Less hallucinated

βœ… Enterprise-ready

Documents (PDFs, DOCX, TXT)
            ↓
      Document 
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
project/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ pdf/
β”‚   └── text/
β”‚
β”œβ”€β”€ .env
β”œβ”€β”€ rag_pipeline.py
└── requirements.txt

Never hardcode API keys.

Create a .env

file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

LangChain stores documents in a standardized format.

A document contains:

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."

Metadata stores additional information.

Examples:

from langchain_core.documents import Document
python
from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)
Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)

Why metadata matters?

In enterprise AI:

You often want:

β€œShow answer from document X page 5”

Metadata helps with traceability.

Before processing documents, we must load them.

LangChain provides multiple s.

Used for:

.txt

files

from langchain_community.document_s import Text
 = Text(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = .load()

print(documents)

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents
python
from langchain_community.document_s import Directory
 = Directory(
    "data/text",
    glob="*.txt",
    _cls=Text,
    _kwargs={
        "encoding":"utf-8"
    }
)

documents = .load()

print(documents)

Most enterprise RAG systems use PDFs.

LangChain supports:

Simple and fast.

from langchain_community.document_s import PyPDF
 = PyPDF(
    "data/pdf/rag_guide.pdf"
)

documents = .load()

print(documents[0])

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF

to GPT.

Instead:

We split documents into smaller chunks.

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

βœ… retrieval quality

βœ… relevance

βœ… accuracy

Most commonly used splitter.

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)
text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))

How large each chunk should be.

Example:

chunk_size=500

means:

500 characters per chunk.

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...

Chunk 2 starts with:

Intelligence is...

This preserves continuity.

Recommended:

chunk_size = 300–800
chunk_overlap = 30–100

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)

Not raw text.

This is where Embeddings come in.

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"

becomes:

[0.24, -0.76, 0.88, ....]

These vectors help us find:

Example:

What is AI?

and

Explain Artificial Intelligence

have similar meanings.

Embedding models place them close together in vector space.

Without embeddings:

Search becomes:

Keyword matching

Example:

Searching:

CEO

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search

Meaning-based retrieval.

Even if wording differs.

We will use:

NVIDIA Llama Nemotron Embedding Model

Advantages:

βœ… Fast

βœ… High-quality embeddings

βœ… Good semantic understanding

βœ… Free developer tier

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)
load_dotenv()
embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Before embedding:

We only need:

page_content

from chunks.

texts = [
    chunk.page_content
    for chunk in chunks
]
embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)
print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)

Output:

50
2048

Meaning:

50 chunks
2048 dimensional vector

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Now query and document vectors can be compared.

Imagine storing:

Millions of embeddings

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search

We need:

Examples:

We will use:

Why?

βœ… Fast retrieval

βœ… Open-source

βœ… Enterprise-ready

βœ… Optimized for vectors

pip install pymilvus
python
from pymilvus import (
    MilvusClient
)
client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)

A collection is like:

SQL Table

for vector data.

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)

Embedding vector size:

2048

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail

We store:

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })
client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)

Now comes the real magic.

When user asks:

"What is RAG?"

We do:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)

How many chunks to retrieve.

Example:

limit=5

returns:

Top 5 relevant chunks

Fields to return.

Example:

"text"

returns chunk text.

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )

Sometimes:

Top results are not the best.

Example:

Query:

What is RAG?

Retrieved:

Machine Learning

instead of:

Retrieval-Augmented Generation

This happens because:

Vector similarity is approximate.

Solution?

Reranking improves retrieval quality.

Instead of trusting:

Top K vectors

We re-score chunks.

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.

from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)
reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Reranker expects:

LangChain Documents

not strings.

from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]
reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)
for doc in reranked_docs:

    print(
        doc.page_content
    )

Now quality improves significantly.

Finally:

We generate answer.

from langchain_openai import (
    AzureChatOpenAI
)
llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)

Lower:

temperature=0.2

means:

More factual answers.

Good for:

RAG systems
context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])
prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""

Strict prompt:

Prevents hallucination.

response = llm.invoke(
    prompt
)

print(
    response.content
)

Production AI systems require monitoring.

Questions:

Did retrieval work?
Did hallucination happen?
Was response relevant?

Langfuse solves this.

pip install langfuse
python
from langfuse import (
    Langfuse
)
langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)
langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)

We evaluate:

Were chunks relevant?

Was answer grounded?

Did model invent information?

Did answer actually solve query?

Example evaluation prompt:

evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""
PDFs
 ↓
s
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation

Fix:

βœ… Better chunking

βœ… Reranking

βœ… Hybrid Search

Fix:

βœ… Strict prompts

βœ… Low temperature

βœ… Better retrieval

Fix:

βœ… Chunking strategy

βœ… Metadata filtering

One chunk β†’ multiple embeddings.

Better retrieval.

Generate hypothetical answer first.

Then search.

Hierarchical retrieval tree.

Better long document understanding.

Route query dynamically.

Token-level retrieval.

Highly accurate.

Basic RAG:

Retrieve β†’ Generate

Production RAG:

Retrieve
β†’ Rerank
β†’ Evaluate
β†’ Monitor
β†’ Improve

That is how enterprise AI systems are built πŸš€

── more in #generative-ai 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/master-rag-systems-b…] indexed:0 read:7min 2026-05-26 Β· β€”