Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.
Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:
Hallucination means:
The model confidently generates incorrect information.
Example:
Question:
Who is the CEO of my company?
Without access to your internal company data, an LLM may generate a completely wrong answer.
This is where RAG (Retrieval-Augmented Generation) becomes useful.
Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.
RAG stands for:
Retrieval-Augmented Generation
Instead of:
Question β LLM β Answer
We do:
Question
β
Retrieve Relevant Documents
β
Provide Context to LLM
β
Generate Grounded Response
This makes responses:
β More accurate
β Context-aware
β Less hallucinated
β Enterprise-ready
Documents (PDFs, DOCX, TXT)
β
Document
β
Chunking
β
Embeddings
β
Vector Database
β
Similarity Search
β
Reranking
β
Context Building
β
LLM
β
Final Answer
β
Monitoring & Evaluation
Before starting, install all dependencies.
pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
project/
β
βββ data/
β βββ pdf/
β βββ text/
β
βββ .env
βββ rag_pipeline.py
βββ requirements.txt
Never hardcode API keys.
Create a .env
file.
NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
LangChain stores documents in a standardized format.
A document contains:
This contains actual text.
Example:
page_content = "Generative AI is growing rapidly."
Metadata stores additional information.
Examples:
from langchain_core.documents import Document
python
from langchain_core.documents import Document
doc = Document(
page_content="""
Generative AI is a subset of Artificial Intelligence
focused on creating content.
""",
metadata={
"source": "genai.pdf",
"author": "Sridhar",
"pages": 10
}
)
print(doc)
Document(
page_content='Generative AI...',
metadata={
'source': 'genai.pdf',
'author': 'Sridhar',
'pages': 10
}
)
Why metadata matters?
In enterprise AI:
You often want:
βShow answer from document X page 5β
Metadata helps with traceability.
Before processing documents, we must load them.
LangChain provides multiple s.
Used for:
.txt
files
from langchain_community.document_s import Text
= Text(
"data/text/sample.txt",
encoding="utf-8"
)
documents = .load()
print(documents)
Loads multiple files from a folder.
Useful when:
You have:
100 PDFs
50 TXT files
many documents
python
from langchain_community.document_s import Directory
= Directory(
"data/text",
glob="*.txt",
_cls=Text,
_kwargs={
"encoding":"utf-8"
}
)
documents = .load()
print(documents)
Most enterprise RAG systems use PDFs.
LangChain supports:
Simple and fast.
from langchain_community.document_s import PyPDF
= PyPDF(
"data/pdf/rag_guide.pdf"
)
documents = .load()
print(documents[0])
Each page becomes:
Document(
page_content="Page text",
metadata={"page":1}
)
Chunking is one of the most important parts of RAG.
Why?
Because LLMs have token limits.
You cannot send:
500 page PDF
to GPT.
Instead:
We split documents into smaller chunks.
Bad chunking causes:
β poor retrieval
β hallucination
β context loss
Good chunking improves:
β retrieval quality
β relevance
β accuracy
Most commonly used splitter.
from langchain_text_splitters import (
RecursiveCharacterTextSplitter
)
text_splitter = (
RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
separators=[
"\n\n",
"\n",
" ",
""
]
)
)
chunks = text_splitter.split_documents(
documents
)
print(len(chunks))
How large each chunk should be.
Example:
chunk_size=500
means:
500 characters per chunk.
Prevents context loss.
Example:
Chunk 1:
Artificial Intelligence is...
Chunk 2 starts with:
Intelligence is...
This preserves continuity.
Recommended:
chunk_size = 300β800
chunk_overlap = 30β100
Once chunking is completed, we need to convert text into a format machines can understand.
LLMs understand:
Numbers (Vectors)
Not raw text.
This is where Embeddings come in.
Embeddings convert text into numerical vector representations.
Example:
Text:
"Artificial Intelligence"
becomes:
[0.24, -0.76, 0.88, ....]
These vectors help us find:
Example:
What is AI?
and
Explain Artificial Intelligence
have similar meanings.
Embedding models place them close together in vector space.
Without embeddings:
Search becomes:
Keyword matching
Example:
Searching:
CEO
Only returns exact keyword matches.
With embeddings:
Search becomes:
Semantic Search
Meaning-based retrieval.
Even if wording differs.
We will use:
NVIDIA Llama Nemotron Embedding Model
Advantages:
β Fast
β High-quality embeddings
β Good semantic understanding
β Free developer tier
import os
from dotenv import load_dotenv
from langchain_nvidia_ai_endpoints import (
NVIDIAEmbeddings
)
load_dotenv()
embedding_model = (
NVIDIAEmbeddings(
model=
"nvidia/llama-nemotron-embed-vl-1b-v2",
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Before embedding:
We only need:
page_content
from chunks.
texts = [
chunk.page_content
for chunk in chunks
]
embedded_vectors = (
embedding_model.embed_documents(
texts
)
)
print(
len(
embedded_vectors
)
)
print(
len(
embedded_vectors[0]
)
)
Output:
50
2048
Meaning:
50 chunks
2048 dimensional vector
User questions also need embeddings.
Example:
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
Now query and document vectors can be compared.
Imagine storing:
Millions of embeddings
in SQL.
Very slow.
Traditional databases are not optimized for:
Similarity Search
We need:
Examples:
We will use:
Why?
β Fast retrieval
β Open-source
β Enterprise-ready
β Optimized for vectors
pip install pymilvus
python
from pymilvus import (
MilvusClient
)
client = MilvusClient(
uri="milvus_demo.db"
)
print(
"Connected Successfully"
)
A collection is like:
SQL Table
for vector data.
try:
client.create_collection(
collection_name=
"rag_collection",
dimension=2048
)
print(
"Collection Created"
)
except Exception as e:
print(e)
Embedding vector size:
2048
Collection dimension must match embedding dimension.
Otherwise:
Insertion will fail
We store:
data = []
for i, (
chunk,
embedding
) in enumerate(
zip(
chunks,
embedded_vectors
)
):
data.append({
"id": i,
"vector":
embedding,
"text":
chunk.page_content
})
client.insert(
collection_name=
"rag_collection",
data=data
)
print(
"Inserted Successfully"
)
Now comes the real magic.
When user asks:
"What is RAG?"
We do:
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
results = client.search(
collection_name=
"rag_collection",
data=[
query_embedding
],
limit=5,
output_fields=[
"text"
]
)
How many chunks to retrieve.
Example:
limit=5
returns:
Top 5 relevant chunks
Fields to return.
Example:
"text"
returns chunk text.
for result in results[0]:
print(
result["entity"]
["text"]
)
print(
"----------------"
)
Sometimes:
Top results are not the best.
Example:
Query:
What is RAG?
Retrieved:
Machine Learning
instead of:
Retrieval-Augmented Generation
This happens because:
Vector similarity is approximate.
Solution?
Reranking improves retrieval quality.
Instead of trusting:
Top K vectors
We re-score chunks.
Without reranking:
Bad chunks may enter context.
Result:
β hallucination
β irrelevant answers
With reranking:
Only most relevant chunks are sent to LLM.
from langchain_nvidia_ai_endpoints import (
NVIDIARerank
)
reranker = (
NVIDIARerank(
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Reranker expects:
LangChain Documents
not strings.
from langchain_core.documents import (
Document
)
retrieved_docs = [
Document(
page_content=
r["entity"]
["text"]
)
for r in results[0]
]
reranked_docs = (
reranker.compress_documents(
documents=
retrieved_docs,
query=query
)
)
for doc in reranked_docs:
print(
doc.page_content
)
Now quality improves significantly.
Finally:
We generate answer.
from langchain_openai import (
AzureChatOpenAI
)
llm = AzureChatOpenAI(
azure_endpoint=
os.getenv(
"AZURE_OPENAI_ENDPOINT"
),
api_key=
os.getenv(
"AZURE_OPENAI_KEY"
),
deployment_name=
"gpt-4o",
temperature=0.2
)
Lower:
temperature=0.2
means:
More factual answers.
Good for:
RAG systems
context = "\n".join([
doc.page_content
for doc in reranked_docs
])
prompt = f"""
Answer ONLY
from context.
Context:
{context}
Question:
{query}
"""
Strict prompt:
Prevents hallucination.
response = llm.invoke(
prompt
)
print(
response.content
)
Production AI systems require monitoring.
Questions:
Did retrieval work?
Did hallucination happen?
Was response relevant?
Langfuse solves this.
pip install langfuse
python
from langfuse import (
Langfuse
)
langfuse = Langfuse(
public_key=
os.getenv(
"LANGFUSE_PUBLIC_KEY"
),
secret_key=
os.getenv(
"LANGFUSE_SECRET_KEY"
),
host=
os.getenv(
"LANGFUSE_BASE_URL"
)
)
langfuse.create_event(
name="retrieval",
input={
"query":
query
},
output={
"chunks":
context
}
)
We evaluate:
Were chunks relevant?
Was answer grounded?
Did model invent information?
Did answer actually solve query?
Example evaluation prompt:
evaluation_prompt = f"""
Evaluate:
Question:
{query}
Answer:
{response.content}
Context:
{context}
Score:
1. faithfulness
2. hallucination
3. relevance
"""
PDFs
β
s
β
Chunking
β
Embeddings
β
Milvus
β
Retrieval
β
Reranking
β
Prompt Building
β
GPT-4o
β
Answer
β
Langfuse Monitoring
β
Evaluation
Fix:
β Better chunking
β Reranking
β Hybrid Search
Fix:
β Strict prompts
β Low temperature
β Better retrieval
Fix:
β Chunking strategy
β Metadata filtering
One chunk β multiple embeddings.
Better retrieval.
Generate hypothetical answer first.
Then search.
Hierarchical retrieval tree.
Better long document understanding.
Route query dynamically.
Token-level retrieval.
Highly accurate.
Basic RAG:
Retrieve β Generate
Production RAG:
Retrieve
β Rerank
β Evaluate
β Monitor
β Improve
That is how enterprise AI systems are built π