{"slug": "master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking", "title": "Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀", "summary": "A developer has built an end-to-end Retrieval-Augmented Generation (RAG) pipeline using LangChain, Milvus, reranking, and Azure OpenAI to reduce hallucination in large language models. The system retrieves relevant documents from external sources, processes them through chunking and embedding into a vector database, then applies similarity search and reranking before providing context to the LLM for grounded responses. The pipeline supports multiple document formats including PDFs and text files, with metadata tracking for enterprise traceability.", "body_md": "Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.\n\nLarge Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:\n\nHallucination means:\n\nThe model confidently generates incorrect information.\n\nExample:\n\n**Question:**\n\nWho is the CEO of my company?\n\nWithout access to your internal company data, an LLM may generate a completely wrong answer.\n\nThis is where **RAG (Retrieval-Augmented Generation)** becomes useful.\n\nInstead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.\n\nRAG stands for:\n\n**Retrieval-Augmented Generation**\n\nInstead of:\n\n```\nQuestion → LLM → Answer\n```\n\nWe do:\n\n```\nQuestion\n   ↓\nRetrieve Relevant Documents\n   ↓\nProvide Context to LLM\n   ↓\nGenerate Grounded Response\n```\n\nThis makes responses:\n\n✅ More accurate\n\n✅ Context-aware\n\n✅ Less hallucinated\n\n✅ Enterprise-ready\n\n```\nDocuments (PDFs, DOCX, TXT)\n            ↓\n      Document Loading\n            ↓\n         Chunking\n            ↓\n         Embeddings\n            ↓\n      Vector Database\n            ↓\n      Similarity Search\n            ↓\n         Reranking\n            ↓\n       Context Building\n            ↓\n            LLM\n            ↓\n         Final Answer\n            ↓\n     Monitoring & Evaluation\n```\n\nBefore starting, install all dependencies.\n\n```\npip install langchain\npip install langchain-community\npip install langchain-core\npip install langchain-openai\npip install langchain-text-splitters\npip install langchain-nvidia-ai-endpoints\npip install pymilvus\npip install pymupdf\npip install pypdf\npip install langfuse\npip install python-dotenv\nproject/\n│\n├── data/\n│   ├── pdf/\n│   └── text/\n│\n├── .env\n├── rag_pipeline.py\n└── requirements.txt\n```\n\nNever hardcode API keys.\n\nCreate a `.env`\n\nfile.\n\n```\nNVIDIA_API_KEY=your_key\nAZURE_OPENAI_ENDPOINT=your_endpoint\nAZURE_OPENAI_KEY=your_key\nAZURE_OPENAI_DEPLOYMENT=gpt-4o\n\nLANGFUSE_PUBLIC_KEY=your_key\nLANGFUSE_SECRET_KEY=your_key\nLANGFUSE_BASE_URL=https://cloud.langfuse.com\n```\n\nLangChain stores documents in a standardized format.\n\nA document contains:\n\nThis contains actual text.\n\nExample:\n\n```\npage_content = \"Generative AI is growing rapidly.\"\n```\n\nMetadata stores additional information.\n\nExamples:\n\n``` python\nfrom langchain_core.documents import Document\npython\nfrom langchain_core.documents import Document\n\ndoc = Document(\n    page_content=\"\"\"\n    Generative AI is a subset of Artificial Intelligence\n    focused on creating content.\n    \"\"\",\n    metadata={\n        \"source\": \"genai.pdf\",\n        \"author\": \"Sridhar\",\n        \"pages\": 10\n    }\n)\n\nprint(doc)\nDocument(\n    page_content='Generative AI...',\n    metadata={\n        'source': 'genai.pdf',\n        'author': 'Sridhar',\n        'pages': 10\n    }\n)\n```\n\nWhy metadata matters?\n\nIn enterprise AI:\n\nYou often want:\n\n“Show answer from document X page 5”\n\nMetadata helps with traceability.\n\nBefore processing documents, we must load them.\n\nLangChain provides multiple loaders.\n\nUsed for:\n\n`.txt`\n\nfiles\n\n``` python\nfrom langchain_community.document_loaders import TextLoader\nloader = TextLoader(\n    \"data/text/sample.txt\",\n    encoding=\"utf-8\"\n)\n\ndocuments = loader.load()\n\nprint(documents)\n```\n\nLoads multiple files from a folder.\n\nUseful when:\n\nYou have:\n\n```\n100 PDFs\n50 TXT files\nmany documents\npython\nfrom langchain_community.document_loaders import DirectoryLoader\nloader = DirectoryLoader(\n    \"data/text\",\n    glob=\"*.txt\",\n    loader_cls=TextLoader,\n    loader_kwargs={\n        \"encoding\":\"utf-8\"\n    }\n)\n\ndocuments = loader.load()\n\nprint(documents)\n```\n\nMost enterprise RAG systems use PDFs.\n\nLangChain supports:\n\nSimple and fast.\n\n``` python\nfrom langchain_community.document_loaders import PyPDFLoader\nloader = PyPDFLoader(\n    \"data/pdf/rag_guide.pdf\"\n)\n\ndocuments = loader.load()\n\nprint(documents[0])\n```\n\nEach page becomes:\n\n```\nDocument(\n    page_content=\"Page text\",\n    metadata={\"page\":1}\n)\n```\n\nChunking is one of the most important parts of RAG.\n\nWhy?\n\nBecause LLMs have token limits.\n\nYou cannot send:\n\n```\n500 page PDF\n```\n\nto GPT.\n\nInstead:\n\nWe split documents into smaller chunks.\n\nBad chunking causes:\n\n❌ poor retrieval\n\n❌ hallucination\n\n❌ context loss\n\nGood chunking improves:\n\n✅ retrieval quality\n\n✅ relevance\n\n✅ accuracy\n\nMost commonly used splitter.\n\n``` python\nfrom langchain_text_splitters import (\n    RecursiveCharacterTextSplitter\n)\ntext_splitter = (\n    RecursiveCharacterTextSplitter(\n        chunk_size=500,\n        chunk_overlap=50,\n        length_function=len,\n        separators=[\n            \"\\n\\n\",\n            \"\\n\",\n            \" \",\n            \"\"\n        ]\n    )\n)\n\nchunks = text_splitter.split_documents(\n    documents\n)\n\nprint(len(chunks))\n```\n\nHow large each chunk should be.\n\nExample:\n\n```\nchunk_size=500\n```\n\nmeans:\n\n500 characters per chunk.\n\nPrevents context loss.\n\nExample:\n\nChunk 1:\n\n```\nArtificial Intelligence is...\n```\n\nChunk 2 starts with:\n\n```\nIntelligence is...\n```\n\nThis preserves continuity.\n\nRecommended:\n\n```\nchunk_size = 300–800\nchunk_overlap = 30–100\n```\n\nOnce chunking is completed, we need to convert text into a format machines can understand.\n\nLLMs understand:\n\n```\nNumbers (Vectors)\n```\n\nNot raw text.\n\nThis is where **Embeddings** come in.\n\nEmbeddings convert text into numerical vector representations.\n\nExample:\n\nText:\n\n```\n\"Artificial Intelligence\"\n```\n\nbecomes:\n\n```\n[0.24, -0.76, 0.88, ....]\n```\n\nThese vectors help us find:\n\nExample:\n\n```\nWhat is AI?\n```\n\nand\n\n```\nExplain Artificial Intelligence\n```\n\nhave similar meanings.\n\nEmbedding models place them close together in vector space.\n\nWithout embeddings:\n\nSearch becomes:\n\n```\nKeyword matching\n```\n\nExample:\n\nSearching:\n\n```\nCEO\n```\n\nOnly returns exact keyword matches.\n\nWith embeddings:\n\nSearch becomes:\n\n```\nSemantic Search\n```\n\nMeaning-based retrieval.\n\nEven if wording differs.\n\nWe will use:\n\n```\nNVIDIA Llama Nemotron Embedding Model\n```\n\nAdvantages:\n\n✅ Fast\n\n✅ High-quality embeddings\n\n✅ Good semantic understanding\n\n✅ Free developer tier\n\n``` python\nimport os\n\nfrom dotenv import load_dotenv\n\nfrom langchain_nvidia_ai_endpoints import (\n    NVIDIAEmbeddings\n)\nload_dotenv()\nembedding_model = (\n    NVIDIAEmbeddings(\n        model=\n        \"nvidia/llama-nemotron-embed-vl-1b-v2\",\n\n        nvidia_api_key=\n        os.getenv(\n            \"NVIDIA_API_KEY\"\n        )\n    )\n)\n```\n\nBefore embedding:\n\nWe only need:\n\n```\npage_content\n```\n\nfrom chunks.\n\n```\ntexts = [\n    chunk.page_content\n    for chunk in chunks\n]\nembedded_vectors = (\n    embedding_model.embed_documents(\n        texts\n    )\n)\nprint(\n    len(\n        embedded_vectors\n    )\n)\n\nprint(\n    len(\n        embedded_vectors[0]\n    )\n)\n```\n\nOutput:\n\n```\n50\n2048\n```\n\nMeaning:\n\n```\n50 chunks\n2048 dimensional vector\n```\n\nUser questions also need embeddings.\n\nExample:\n\n```\nquery = (\n    \"What is RAG?\"\n)\n\nquery_embedding = (\n    embedding_model.embed_query(\n        query\n    )\n)\n```\n\nNow query and document vectors can be compared.\n\nImagine storing:\n\n```\nMillions of embeddings\n```\n\nin SQL.\n\nVery slow.\n\nTraditional databases are not optimized for:\n\n```\nSimilarity Search\n```\n\nWe need:\n\nExamples:\n\nWe will use:\n\nWhy?\n\n✅ Fast retrieval\n\n✅ Open-source\n\n✅ Enterprise-ready\n\n✅ Optimized for vectors\n\n```\npip install pymilvus\npython\nfrom pymilvus import (\n    MilvusClient\n)\nclient = MilvusClient(\n    uri=\"milvus_demo.db\"\n)\n\nprint(\n    \"Connected Successfully\"\n)\n```\n\nA collection is like:\n\n```\nSQL Table\n```\n\nfor vector data.\n\n```\ntry:\n\n    client.create_collection(\n        collection_name=\n        \"rag_collection\",\n\n        dimension=2048\n    )\n\n    print(\n        \"Collection Created\"\n    )\n\nexcept Exception as e:\n\n    print(e)\n```\n\nEmbedding vector size:\n\n```\n2048\n```\n\nCollection dimension must match embedding dimension.\n\nOtherwise:\n\n```\nInsertion will fail\n```\n\nWe store:\n\n```\ndata = []\n\nfor i, (\n    chunk,\n    embedding\n) in enumerate(\n    zip(\n        chunks,\n        embedded_vectors\n    )\n):\n\n    data.append({\n\n        \"id\": i,\n\n        \"vector\":\n        embedding,\n\n        \"text\":\n        chunk.page_content\n    })\nclient.insert(\n    collection_name=\n    \"rag_collection\",\n\n    data=data\n)\n\nprint(\n    \"Inserted Successfully\"\n)\n```\n\nNow comes the real magic.\n\nWhen user asks:\n\n```\n\"What is RAG?\"\n```\n\nWe do:\n\n```\nquery = (\n    \"What is RAG?\"\n)\n\nquery_embedding = (\n    embedding_model.embed_query(\n        query\n    )\n)\nresults = client.search(\n\n    collection_name=\n    \"rag_collection\",\n\n    data=[\n        query_embedding\n    ],\n\n    limit=5,\n\n    output_fields=[\n        \"text\"\n    ]\n)\n```\n\nHow many chunks to retrieve.\n\nExample:\n\n```\nlimit=5\n```\n\nreturns:\n\n```\nTop 5 relevant chunks\n```\n\nFields to return.\n\nExample:\n\n```\n\"text\"\n```\n\nreturns chunk text.\n\n```\nfor result in results[0]:\n\n    print(\n        result[\"entity\"]\n        [\"text\"]\n    )\n\n    print(\n        \"----------------\"\n    )\n```\n\nSometimes:\n\nTop results are not the best.\n\nExample:\n\nQuery:\n\n```\nWhat is RAG?\n```\n\nRetrieved:\n\n```\nMachine Learning\n```\n\ninstead of:\n\n```\nRetrieval-Augmented Generation\n```\n\nThis happens because:\n\nVector similarity is approximate.\n\nSolution?\n\nReranking improves retrieval quality.\n\nInstead of trusting:\n\n```\nTop K vectors\n```\n\nWe re-score chunks.\n\nWithout reranking:\n\nBad chunks may enter context.\n\nResult:\n\n❌ hallucination\n\n❌ irrelevant answers\n\nWith reranking:\n\nOnly most relevant chunks are sent to LLM.\n\n``` python\nfrom langchain_nvidia_ai_endpoints import (\n    NVIDIARerank\n)\nreranker = (\n    NVIDIARerank(\n        nvidia_api_key=\n        os.getenv(\n            \"NVIDIA_API_KEY\"\n        )\n    )\n)\n```\n\nReranker expects:\n\n```\nLangChain Documents\n```\n\nnot strings.\n\n```\nfrom langchain_core.documents import (\n    Document\n)\n\nretrieved_docs = [\n\n    Document(\n        page_content=\n        r[\"entity\"]\n        [\"text\"]\n    )\n\n    for r in results[0]\n]\nreranked_docs = (\n    reranker.compress_documents(\n\n        documents=\n        retrieved_docs,\n\n        query=query\n    )\n)\nfor doc in reranked_docs:\n\n    print(\n        doc.page_content\n    )\n```\n\nNow quality improves significantly.\n\nFinally:\n\nWe generate answer.\n\n``` python\nfrom langchain_openai import (\n    AzureChatOpenAI\n)\nllm = AzureChatOpenAI(\n\n    azure_endpoint=\n    os.getenv(\n        \"AZURE_OPENAI_ENDPOINT\"\n    ),\n\n    api_key=\n    os.getenv(\n        \"AZURE_OPENAI_KEY\"\n    ),\n\n    deployment_name=\n    \"gpt-4o\",\n\n    temperature=0.2\n)\n```\n\nLower:\n\n```\ntemperature=0.2\n```\n\nmeans:\n\nMore factual answers.\n\nGood for:\n\n```\nRAG systems\ncontext = \"\\n\".join([\n\n    doc.page_content\n\n    for doc in reranked_docs\n])\nprompt = f\"\"\"\n\nAnswer ONLY\nfrom context.\n\nContext:\n\n{context}\n\nQuestion:\n\n{query}\n\n\"\"\"\n```\n\nStrict prompt:\n\nPrevents hallucination.\n\n```\nresponse = llm.invoke(\n    prompt\n)\n\nprint(\n    response.content\n)\n```\n\nProduction AI systems require monitoring.\n\nQuestions:\n\n```\nDid retrieval work?\nDid hallucination happen?\nWas response relevant?\n```\n\nLangfuse solves this.\n\n```\npip install langfuse\npython\nfrom langfuse import (\n    Langfuse\n)\nlangfuse = Langfuse(\n\n    public_key=\n    os.getenv(\n        \"LANGFUSE_PUBLIC_KEY\"\n    ),\n\n    secret_key=\n    os.getenv(\n        \"LANGFUSE_SECRET_KEY\"\n    ),\n\n    host=\n    os.getenv(\n        \"LANGFUSE_BASE_URL\"\n    )\n)\nlangfuse.create_event(\n\n    name=\"retrieval\",\n\n    input={\n        \"query\":\n        query\n    },\n\n    output={\n        \"chunks\":\n        context\n    }\n)\n```\n\nWe evaluate:\n\nWere chunks relevant?\n\nWas answer grounded?\n\nDid model invent information?\n\nDid answer actually solve query?\n\nExample evaluation prompt:\n\n```\nevaluation_prompt = f\"\"\"\n\nEvaluate:\n\nQuestion:\n{query}\n\nAnswer:\n{response.content}\n\nContext:\n{context}\n\nScore:\n1. faithfulness\n2. hallucination\n3. relevance\n\"\"\"\nPDFs\n ↓\nLoaders\n ↓\nChunking\n ↓\nEmbeddings\n ↓\nMilvus\n ↓\nRetrieval\n ↓\nReranking\n ↓\nPrompt Building\n ↓\nGPT-4o\n ↓\nAnswer\n ↓\nLangfuse Monitoring\n ↓\nEvaluation\n```\n\nFix:\n\n✅ Better chunking\n\n✅ Reranking\n\n✅ Hybrid Search\n\nFix:\n\n✅ Strict prompts\n\n✅ Low temperature\n\n✅ Better retrieval\n\nFix:\n\n✅ Chunking strategy\n\n✅ Metadata filtering\n\nOne chunk → multiple embeddings.\n\nBetter retrieval.\n\nGenerate hypothetical answer first.\n\nThen search.\n\nHierarchical retrieval tree.\n\nBetter long document understanding.\n\nRoute query dynamically.\n\nToken-level retrieval.\n\nHighly accurate.\n\nBasic RAG:\n\n```\nRetrieve → Generate\n```\n\nProduction RAG:\n\n```\nRetrieve\n→ Rerank\n→ Evaluate\n→ Monitor\n→ Improve\n```\n\nThat is how enterprise AI systems are built 🚀", "url": "https://wpnews.pro/news/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking", "canonical_source": "https://dev.to/sridhar_s_dfc5fa7b6b295f9/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking-azure-openai-118c", "published_at": "2026-05-26 07:23:51+00:00", "updated_at": "2026-05-26 07:33:37.836276+00:00", "lang": "en", "topics": ["generative-ai", "large-language-models", "artificial-intelligence", "natural-language-processing", "ai-tools"], "entities": ["GPT-4", "Claude", "LLaMA", "Gemini", "LangChain", "Milvus", "Azure OpenAI"], "alternates": {"html": "https://wpnews.pro/news/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking", "markdown": "https://wpnews.pro/news/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking.md", "text": "https://wpnews.pro/news/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking.txt", "jsonld": "https://wpnews.pro/news/master-rag-systems-build-an-end-to-end-langchain-pipeline-with-milvus-reranking.jsonld"}}