Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

wpnews.pro

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question → LLM → Answer

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready

Documents (PDFs, DOCX, TXT)
            ↓
      Document 
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt

Never hardcode API keys.

Create a .env

file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

LangChain stores documents in a standardized format.

A document contains:

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."

Metadata stores additional information.

Examples:

from langchain_core.documents import Document
python
from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)
Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.

Before processing documents, we must load them.

LangChain provides multiple s.

Used for:

.txt

files

from langchain_community.document_s import Text
 = Text(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = .load()

print(documents)

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents
python
from langchain_community.document_s import Directory
 = Directory(
    "data/text",
    glob="*.txt",
    _cls=Text,
    _kwargs={
        "encoding":"utf-8"
    }
)

documents = .load()

print(documents)

Most enterprise RAG systems use PDFs.

LangChain supports:

Simple and fast.

from langchain_community.document_s import PyPDF
 = PyPDF(
    "data/pdf/rag_guide.pdf"
)

documents = .load()

print(documents[0])

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF

to GPT.

Instead:

We split documents into smaller chunks.

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy

Most commonly used splitter.

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)
text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))

How large each chunk should be.

Example:

chunk_size=500

means:

500 characters per chunk.

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...

Chunk 2 starts with:

Intelligence is...

This preserves continuity.

Recommended:

chunk_size = 300–800
chunk_overlap = 30–100

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)

Not raw text.

This is where Embeddings come in.

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"

becomes:

[0.24, -0.76, 0.88, ....]

These vectors help us find:

Example:

What is AI?

and

Explain Artificial Intelligence

have similar meanings.

Embedding models place them close together in vector space.

Without embeddings:

Search becomes:

Keyword matching

Example:

Searching:

CEO

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search

Meaning-based retrieval.

Even if wording differs.

We will use:

NVIDIA Llama Nemotron Embedding Model

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)
load_dotenv()
embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Before embedding:

We only need:

page_content

from chunks.

texts = [
    chunk.page_content
    for chunk in chunks
]
embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)
print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)

Output:

50
2048

Meaning:

50 chunks
2048 dimensional vector

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Now query and document vectors can be compared.

Imagine storing:

Millions of embeddings

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search

We need:

Examples:

We will use:

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors

pip install pymilvus
python
from pymilvus import (
    MilvusClient
)
client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)

A collection is like:

SQL Table

for vector data.

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)

Embedding vector size:

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail

We store:

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })
client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)

Now comes the real magic.

When user asks:

"What is RAG?"

We do:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)

How many chunks to retrieve.

Example:

limit=5

returns:

Top 5 relevant chunks

Fields to return.

Example:

"text"

returns chunk text.

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )

Sometimes:

Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

Run your AI side-project on zahid.host