Retrieval-Augmented Generation (RAG) is a powerful pattern to build applications that can query, understand, and extract insights from your custom documents (like PDFs, resumes, and reports) by feeding them as context to Large Language Models (LLMs).
This guide walks you through building a complete RAG API step-by-step, explaining the architecture, code, and debugging learnings along the way.
A typical RAG pipeline is divided into two parts:
pgvector
extension.requirements.txt
Dependencies include FastAPI (API framework), LangChain (orchestration library), Google GenAI integration, and database drivers for PostgreSQL/pgvector.
fastapi
uvicorn
python-dotenv
langchain
langchain-community
langchain-postgres
langchain-google-genai
langchain-text-splitters
pypdf
psycopg[binary]
pgvector
.env
(Environment Variables) Store database credentials and the Google AI Studio API key.
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/ragdb
GOOGLE_API_KEY=YOUR_GEMINI_API_KEY
app/config.py
Loads variables from .env
to make them accessible across modules.
from dotenv import load_dotenv
import os
load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
DATABASE_URL = os.getenv("DATABASE_URL")
app/database.py
Sets up the SQLAlchemy engine instance to connect to PostgreSQL.
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os
load_dotenv()
engine = create_engine(
os.getenv("DATABASE_URL")
)
app/vector_store.py
Instantiates the embeddings model (models/gemini-embedding-2
) and connects it to PostgreSQL via PGVector
to index and search embeddings.
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_postgres import PGVector
from config import DATABASE_URL
embeddings = GoogleGenerativeAIEmbeddings(
model="models/gemini-embedding-2"
)
vector_store = PGVector(
embeddings=embeddings,
collection_name="financial_documents",
connection=DATABASE_URL,
use_jsonb=True,
)
app/ingest.py
This script reads the PDF, sanitizes the text, chunks it, enriches the chunks with metadata, and saves the vectors into the database.
[!NOTE]
PostgreSQL NUL constraint:Standard Python PDF s might parse special formatting as\x00
(NUL characters). Since PostgreSQL utilizes C-style null-terminated strings, attempting to write raw\x00
results in a write error. We explicitly remove them before chunking.
Context Enrichment:If chunking splits the document, text in the middle of pages may lack context (like the candidate's name). Prepending"Candidate: {title}"
to every chunk ensures search queries containing the subject name rank these chunks accurately.
from langchain_community.document_s import PyPDF
from langchain_text_splitters import RecursiveCharacterTextSplitter
from vector_store import vector_store
def ingest_pdf(pdf_path: str):
= PyPDF(pdf_path)
docs = .load()
for doc in docs:
doc.page_content = doc.page_content.replace("\x00", "")
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
for chunk in chunks:
title = chunk.metadata.get("title") or "Aditya Kumar"
chunk.page_content = f"Candidate: {title}\n{chunk.page_content}"
vector_store.add_documents(documents=chunks)
print(f"Stored {len(chunks)} chunks")
if __name__ == "__main__":
ingest_pdf("documents/aditya_resume.pdf")
app/chat.py
Queries the database for matching chunks, constructs the prompt context, feeds it to the LLM (gemini-2.5-flash
), and compiles the source page metadata.
from langchain_google_genai import ChatGoogleGenerativeAI
from vector_store import vector_store
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash"
)
def ask_question(question: str):
docs = vector_store.similarity_search(question, k=3)
context = "\n\n".join(doc.page_content for doc in docs)
prompt = f"""
You are a resume assistant
Answer ONLY from the provided context
If the answer does not exist in the context say "I don't know".
Context:{context}
Question:{question}
"""
response = llm.invoke(prompt)
return {
"answer": response.content,
"source": [
{
"page": doc.metadata.get("page"),
"source": doc.metadata.get("source")
}
for doc in docs
]
}
app/main.py
Hosts the FastAPI server. It appends the current directory path dynamically to resolve imports cleanly if run from the root project directory.
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
from fastapi import FastAPI
from pydantic import BaseModel
from chat import ask_question
app = FastAPI()
class QuestionRequest(BaseModel):
question: str
@app.get("/chat")
async def ask(request: QuestionRequest):
return ask_question(request.question)
client.models.list()
).gemini-2.5-pro
on unpaid tiers can result in 429 RESOURCE_EXHAUSTED
(quota limit of 0). Switching to gemini-2.5-flash
provides a cost-effective, high-quota alternative.\x00
markers. When writing these raw strings to databases, PostgreSQL will fail. Implementing a simple .replace('\x00', '')
filter is mandatory."Where does Aditya Kumar work?"
, chunks containing "Aditya Kumar"
(like the footer/header) rank high, while relevant work history chunks lacking his name rank extremely low."Candidate: Aditya Kumar"
to each chunk) forces the system to find the correct chunk and enables accurate generation.