Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀 A developer has built an end-to-end Retrieval-Augmented Generation (RAG) pipeline using LangChain, Milvus, reranking, and Azure OpenAI to reduce hallucination in large language models. The system retrieves relevant documents from external sources, processes them through chunking and embedding into a vector database, then applies similarity search and reranking before providing context to the LLM for grounded responses. The pipeline supports multiple document formats including PDFs and text files, with metadata tracking for enterprise traceability. Retrieval-Augmented Generation RAG is one of the most important concepts in modern Generative AI. Large Language Models LLMs like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue: Hallucination means: The model confidently generates incorrect information. Example: Question: Who is the CEO of my company? Without access to your internal company data, an LLM may generate a completely wrong answer. This is where RAG Retrieval-Augmented Generation becomes useful. Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response. RAG stands for: Retrieval-Augmented Generation Instead of: Question → LLM → Answer We do: Question ↓ Retrieve Relevant Documents ↓ Provide Context to LLM ↓ Generate Grounded Response This makes responses: ✅ More accurate ✅ Context-aware ✅ Less hallucinated ✅ Enterprise-ready Documents PDFs, DOCX, TXT ↓ Document Loading ↓ Chunking ↓ Embeddings ↓ Vector Database ↓ Similarity Search ↓ Reranking ↓ Context Building ↓ LLM ↓ Final Answer ↓ Monitoring & Evaluation Before starting, install all dependencies. pip install langchain pip install langchain-community pip install langchain-core pip install langchain-openai pip install langchain-text-splitters pip install langchain-nvidia-ai-endpoints pip install pymilvus pip install pymupdf pip install pypdf pip install langfuse pip install python-dotenv project/ │ ├── data/ │ ├── pdf/ │ └── text/ │ ├── .env ├── rag pipeline.py └── requirements.txt Never hardcode API keys. Create a .env file. NVIDIA API KEY=your key AZURE OPENAI ENDPOINT=your endpoint AZURE OPENAI KEY=your key AZURE OPENAI DEPLOYMENT=gpt-4o LANGFUSE PUBLIC KEY=your key LANGFUSE SECRET KEY=your key LANGFUSE BASE URL=https://cloud.langfuse.com LangChain stores documents in a standardized format. A document contains: This contains actual text. Example: page content = "Generative AI is growing rapidly." Metadata stores additional information. Examples: python from langchain core.documents import Document python from langchain core.documents import Document doc = Document page content=""" Generative AI is a subset of Artificial Intelligence focused on creating content. """, metadata={ "source": "genai.pdf", "author": "Sridhar", "pages": 10 } print doc Document page content='Generative AI...', metadata={ 'source': 'genai.pdf', 'author': 'Sridhar', 'pages': 10 } Why metadata matters? In enterprise AI: You often want: “Show answer from document X page 5” Metadata helps with traceability. Before processing documents, we must load them. LangChain provides multiple loaders. Used for: .txt files python from langchain community.document loaders import TextLoader loader = TextLoader "data/text/sample.txt", encoding="utf-8" documents = loader.load print documents Loads multiple files from a folder. Useful when: You have: 100 PDFs 50 TXT files many documents python from langchain community.document loaders import DirectoryLoader loader = DirectoryLoader "data/text", glob=" .txt", loader cls=TextLoader, loader kwargs={ "encoding":"utf-8" } documents = loader.load print documents Most enterprise RAG systems use PDFs. LangChain supports: Simple and fast. python from langchain community.document loaders import PyPDFLoader loader = PyPDFLoader "data/pdf/rag guide.pdf" documents = loader.load print documents 0 Each page becomes: Document page content="Page text", metadata={"page":1} Chunking is one of the most important parts of RAG. Why? Because LLMs have token limits. You cannot send: 500 page PDF to GPT. Instead: We split documents into smaller chunks. Bad chunking causes: ❌ poor retrieval ❌ hallucination ❌ context loss Good chunking improves: ✅ retrieval quality ✅ relevance ✅ accuracy Most commonly used splitter. python from langchain text splitters import RecursiveCharacterTextSplitter text splitter = RecursiveCharacterTextSplitter chunk size=500, chunk overlap=50, length function=len, separators= "\n\n", "\n", " ", "" chunks = text splitter.split documents documents print len chunks How large each chunk should be. Example: chunk size=500 means: 500 characters per chunk. Prevents context loss. Example: Chunk 1: Artificial Intelligence is... Chunk 2 starts with: Intelligence is... This preserves continuity. Recommended: chunk size = 300–800 chunk overlap = 30–100 Once chunking is completed, we need to convert text into a format machines can understand. LLMs understand: Numbers Vectors Not raw text. This is where Embeddings come in. Embeddings convert text into numerical vector representations. Example: Text: "Artificial Intelligence" becomes: 0.24, -0.76, 0.88, .... These vectors help us find: Example: What is AI? and Explain Artificial Intelligence have similar meanings. Embedding models place them close together in vector space. Without embeddings: Search becomes: Keyword matching Example: Searching: CEO Only returns exact keyword matches. With embeddings: Search becomes: Semantic Search Meaning-based retrieval. Even if wording differs. We will use: NVIDIA Llama Nemotron Embedding Model Advantages: ✅ Fast ✅ High-quality embeddings ✅ Good semantic understanding ✅ Free developer tier python import os from dotenv import load dotenv from langchain nvidia ai endpoints import NVIDIAEmbeddings load dotenv embedding model = NVIDIAEmbeddings model= "nvidia/llama-nemotron-embed-vl-1b-v2", nvidia api key= os.getenv "NVIDIA API KEY" Before embedding: We only need: page content from chunks. texts = chunk.page content for chunk in chunks embedded vectors = embedding model.embed documents texts print len embedded vectors print len embedded vectors 0 Output: 50 2048 Meaning: 50 chunks 2048 dimensional vector User questions also need embeddings. Example: query = "What is RAG?" query embedding = embedding model.embed query query Now query and document vectors can be compared. Imagine storing: Millions of embeddings in SQL. Very slow. Traditional databases are not optimized for: Similarity Search We need: Examples: We will use: Why? ✅ Fast retrieval ✅ Open-source ✅ Enterprise-ready ✅ Optimized for vectors pip install pymilvus python from pymilvus import MilvusClient client = MilvusClient uri="milvus demo.db" print "Connected Successfully" A collection is like: SQL Table for vector data. try: client.create collection collection name= "rag collection", dimension=2048 print "Collection Created" except Exception as e: print e Embedding vector size: 2048 Collection dimension must match embedding dimension. Otherwise: Insertion will fail We store: data = for i, chunk, embedding in enumerate zip chunks, embedded vectors : data.append { "id": i, "vector": embedding, "text": chunk.page content } client.insert collection name= "rag collection", data=data print "Inserted Successfully" Now comes the real magic. When user asks: "What is RAG?" We do: query = "What is RAG?" query embedding = embedding model.embed query query results = client.search collection name= "rag collection", data= query embedding , limit=5, output fields= "text" How many chunks to retrieve. Example: limit=5 returns: Top 5 relevant chunks Fields to return. Example: "text" returns chunk text. for result in results 0 : print result "entity" "text" print "----------------" Sometimes: Top results are not the best. Example: Query: What is RAG? Retrieved: Machine Learning instead of: Retrieval-Augmented Generation This happens because: Vector similarity is approximate. Solution? Reranking improves retrieval quality. Instead of trusting: Top K vectors We re-score chunks. Without reranking: Bad chunks may enter context. Result: ❌ hallucination ❌ irrelevant answers With reranking: Only most relevant chunks are sent to LLM. python from langchain nvidia ai endpoints import NVIDIARerank reranker = NVIDIARerank nvidia api key= os.getenv "NVIDIA API KEY" Reranker expects: LangChain Documents not strings. from langchain core.documents import Document retrieved docs = Document page content= r "entity" "text" for r in results 0 reranked docs = reranker.compress documents documents= retrieved docs, query=query for doc in reranked docs: print doc.page content Now quality improves significantly. Finally: We generate answer. python from langchain openai import AzureChatOpenAI llm = AzureChatOpenAI azure endpoint= os.getenv "AZURE OPENAI ENDPOINT" , api key= os.getenv "AZURE OPENAI KEY" , deployment name= "gpt-4o", temperature=0.2 Lower: temperature=0.2 means: More factual answers. Good for: RAG systems context = "\n".join doc.page content for doc in reranked docs prompt = f""" Answer ONLY from context. Context: {context} Question: {query} """ Strict prompt: Prevents hallucination. response = llm.invoke prompt print response.content Production AI systems require monitoring. Questions: Did retrieval work? Did hallucination happen? Was response relevant? Langfuse solves this. pip install langfuse python from langfuse import Langfuse langfuse = Langfuse public key= os.getenv "LANGFUSE PUBLIC KEY" , secret key= os.getenv "LANGFUSE SECRET KEY" , host= os.getenv "LANGFUSE BASE URL" langfuse.create event name="retrieval", input={ "query": query }, output={ "chunks": context } We evaluate: Were chunks relevant? Was answer grounded? Did model invent information? Did answer actually solve query? Example evaluation prompt: evaluation prompt = f""" Evaluate: Question: {query} Answer: {response.content} Context: {context} Score: 1. faithfulness 2. hallucination 3. relevance """ PDFs ↓ Loaders ↓ Chunking ↓ Embeddings ↓ Milvus ↓ Retrieval ↓ Reranking ↓ Prompt Building ↓ GPT-4o ↓ Answer ↓ Langfuse Monitoring ↓ Evaluation Fix: ✅ Better chunking ✅ Reranking ✅ Hybrid Search Fix: ✅ Strict prompts ✅ Low temperature ✅ Better retrieval Fix: ✅ Chunking strategy ✅ Metadata filtering One chunk → multiple embeddings. Better retrieval. Generate hypothetical answer first. Then search. Hierarchical retrieval tree. Better long document understanding. Route query dynamically. Token-level retrieval. Highly accurate. Basic RAG: Retrieve → Generate Production RAG: Retrieve → Rerank → Evaluate → Monitor → Improve That is how enterprise AI systems are built 🚀