# Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

> Source: <https://dev.to/sridhar_s_dfc5fa7b6b295f9/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking-azure-openai-118c>
> Published: 2026-05-26 07:23:51+00:00

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination means:

The model confidently generates incorrect information.

Example:

**Question:**

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where **RAG (Retrieval-Augmented Generation)** becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.

RAG stands for:

**Retrieval-Augmented Generation**

Instead of:

```
Question → LLM → Answer
```

We do:

```
Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response
```

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready

```
Documents (PDFs, DOCX, TXT)
            ↓
      Document Loading
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation
```

Before starting, install all dependencies.

```
pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt
```

Never hardcode API keys.

Create a `.env`

file.

```
NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
```

LangChain stores documents in a standardized format.

A document contains:

This contains actual text.

Example:

```
page_content = "Generative AI is growing rapidly."
```

Metadata stores additional information.

Examples:

``` python
from langchain_core.documents import Document
python
from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)
Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)
```

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.

Before processing documents, we must load them.

LangChain provides multiple loaders.

Used for:

`.txt`

files

``` python
from langchain_community.document_loaders import TextLoader
loader = TextLoader(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = loader.load()

print(documents)
```

Loads multiple files from a folder.

Useful when:

You have:

```
100 PDFs
50 TXT files
many documents
python
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(
    "data/text",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={
        "encoding":"utf-8"
    }
)

documents = loader.load()

print(documents)
```

Most enterprise RAG systems use PDFs.

LangChain supports:

Simple and fast.

``` python
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
    "data/pdf/rag_guide.pdf"
)

documents = loader.load()

print(documents[0])
```

Each page becomes:

```
Document(
    page_content="Page text",
    metadata={"page":1}
)
```

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

```
500 page PDF
```

to GPT.

Instead:

We split documents into smaller chunks.

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy

Most commonly used splitter.

``` python
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)
text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))
```

How large each chunk should be.

Example:

```
chunk_size=500
```

means:

500 characters per chunk.

Prevents context loss.

Example:

Chunk 1:

```
Artificial Intelligence is...
```

Chunk 2 starts with:

```
Intelligence is...
```

This preserves continuity.

Recommended:

```
chunk_size = 300–800
chunk_overlap = 30–100
```

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

```
Numbers (Vectors)
```

Not raw text.

This is where **Embeddings** come in.

Embeddings convert text into numerical vector representations.

Example:

Text:

```
"Artificial Intelligence"
```

becomes:

```
[0.24, -0.76, 0.88, ....]
```

These vectors help us find:

Example:

```
What is AI?
```

and

```
Explain Artificial Intelligence
```

have similar meanings.

Embedding models place them close together in vector space.

Without embeddings:

Search becomes:

```
Keyword matching
```

Example:

Searching:

```
CEO
```

Only returns exact keyword matches.

With embeddings:

Search becomes:

```
Semantic Search
```

Meaning-based retrieval.

Even if wording differs.

We will use:

```
NVIDIA Llama Nemotron Embedding Model
```

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier

``` python
import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)
load_dotenv()
embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)
```

Before embedding:

We only need:

```
page_content
```

from chunks.

```
texts = [
    chunk.page_content
    for chunk in chunks
]
embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)
print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)
```

Output:

```
50
2048
```

Meaning:

```
50 chunks
2048 dimensional vector
```

User questions also need embeddings.

Example:

```
query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
```

Now query and document vectors can be compared.

Imagine storing:

```
Millions of embeddings
```

in SQL.

Very slow.

Traditional databases are not optimized for:

```
Similarity Search
```

We need:

Examples:

We will use:

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors

```
pip install pymilvus
python
from pymilvus import (
    MilvusClient
)
client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)
```

A collection is like:

```
SQL Table
```

for vector data.

```
try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)
```

Embedding vector size:

```
2048
```

Collection dimension must match embedding dimension.

Otherwise:

```
Insertion will fail
```

We store:

```
data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })
client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)
```

Now comes the real magic.

When user asks:

```
"What is RAG?"
```

We do:

```
query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)
```

How many chunks to retrieve.

Example:

```
limit=5
```

returns:

```
Top 5 relevant chunks
```

Fields to return.

Example:

```
"text"
```

returns chunk text.

```
for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )
```

Sometimes:

Top results are not the best.

Example:

Query:

```
What is RAG?
```

Retrieved:

```
Machine Learning
```

instead of:

```
Retrieval-Augmented Generation
```

This happens because:

Vector similarity is approximate.

Solution?

Reranking improves retrieval quality.

Instead of trusting:

```
Top K vectors
```

We re-score chunks.

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.

``` python
from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)
reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)
```

Reranker expects:

```
LangChain Documents
```

not strings.

```
from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]
reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)
for doc in reranked_docs:

    print(
        doc.page_content
    )
```

Now quality improves significantly.

Finally:

We generate answer.

``` python
from langchain_openai import (
    AzureChatOpenAI
)
llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)
```

Lower:

```
temperature=0.2
```

means:

More factual answers.

Good for:

```
RAG systems
context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])
prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""
```

Strict prompt:

Prevents hallucination.

```
response = llm.invoke(
    prompt
)

print(
    response.content
)
```

Production AI systems require monitoring.

Questions:

```
Did retrieval work?
Did hallucination happen?
Was response relevant?
```

Langfuse solves this.

```
pip install langfuse
python
from langfuse import (
    Langfuse
)
langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)
langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)
```

We evaluate:

Were chunks relevant?

Was answer grounded?

Did model invent information?

Did answer actually solve query?

Example evaluation prompt:

```
evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""
PDFs
 ↓
Loaders
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation
```

Fix:

✅ Better chunking

✅ Reranking

✅ Hybrid Search

Fix:

✅ Strict prompts

✅ Low temperature

✅ Better retrieval

Fix:

✅ Chunking strategy

✅ Metadata filtering

One chunk → multiple embeddings.

Better retrieval.

Generate hypothetical answer first.

Then search.

Hierarchical retrieval tree.

Better long document understanding.

Route query dynamically.

Token-level retrieval.

Highly accurate.

Basic RAG:

```
Retrieve → Generate
```

Production RAG:

```
Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve
```

That is how enterprise AI systems are built 🚀
