Building a Production-Ready RAG Application with LangChain, pgvector, and Gemini

A developer built a production-ready Retrieval-Augmented Generation (RAG) application using LangChain, pgvector, and Google's Gemini models. The application ingests PDF documents, splits them into chunks enriched with context, stores embeddings in PostgreSQL via pgvector, and queries them with Gemini to generate answers. Key learnings include handling PostgreSQL NUL character constraints and improving search relevance by prepending subject names to chunks.

Retrieval-Augmented Generation RAG is a powerful pattern to build applications that can query, understand, and extract insights from your custom documents like PDFs, resumes, and reports by feeding them as context to Large Language Models LLMs . This guide walks you through building a complete RAG API step-by-step, explaining the architecture, code, and debugging learnings along the way. A typical RAG pipeline is divided into two parts: pgvector extension. requirements.txt Dependencies include FastAPI API framework , LangChain orchestration library , Google GenAI integration, and database drivers for PostgreSQL/pgvector. fastapi uvicorn python-dotenv langchain langchain-community langchain-postgres langchain-google-genai langchain-text-splitters pypdf psycopg binary pgvector .env Environment Variables Store database credentials and the Google AI Studio API key. DATABASE URL=postgresql://postgres:postgres@localhost:5432/ragdb GOOGLE API KEY=YOUR GEMINI API KEY app/config.py Loads variables from .env to make them accessible across modules. python from dotenv import load dotenv import os load dotenv GOOGLE API KEY = os.getenv "GOOGLE API KEY" DATABASE URL = os.getenv "DATABASE URL" app/database.py Sets up the SQLAlchemy engine instance to connect to PostgreSQL. python from sqlalchemy import create engine from dotenv import load dotenv import os load dotenv engine = create engine os.getenv "DATABASE URL" app/vector store.py Instantiates the embeddings model models/gemini-embedding-2 and connects it to PostgreSQL via PGVector to index and search embeddings. python from langchain google genai import GoogleGenerativeAIEmbeddings from langchain postgres import PGVector from config import DATABASE URL Set up the embeddings generator embeddings = GoogleGenerativeAIEmbeddings model="models/gemini-embedding-2" Connect embeddings to PostgreSQL collection vector store = PGVector embeddings=embeddings, collection name="financial documents", connection=DATABASE URL, use jsonb=True, app/ingest.py This script reads the PDF, sanitizes the text, chunks it, enriches the chunks with metadata, and saves the vectors into the database. NOTE PostgreSQL NUL constraint:Standard Python PDF loaders might parse special formatting as \x00 NUL characters . Since PostgreSQL utilizes C-style null-terminated strings, attempting to write raw \x00 results in a write error. We explicitly remove them before chunking. Context Enrichment:If chunking splits the document, text in the middle of pages may lack context like the candidate's name . Prepending "Candidate: {title}" to every chunk ensures search queries containing the subject name rank these chunks accurately. python from langchain community.document loaders import PyPDFLoader from langchain text splitters import RecursiveCharacterTextSplitter from vector store import vector store def ingest pdf pdf path: str : 1. Load document loader = PyPDFLoader pdf path docs = loader.load 2. Sanitize null bytes \x00 which PostgreSQL does not support for doc in docs: doc.page content = doc.page content.replace "\x00", "" 3. Chunk the document splitter = RecursiveCharacterTextSplitter chunk size=1000, chunk overlap=200 chunks = splitter.split documents docs 4. Context Enrichment for chunk in chunks: title = chunk.metadata.get "title" or "Aditya Kumar" chunk.page content = f"Candidate: {title}\n{chunk.page content}" 5. Insert into pgvector vector store.add documents documents=chunks print f"Stored {len chunks } chunks" if name == " main ": ingest pdf "documents/aditya resume.pdf" app/chat.py Queries the database for matching chunks, constructs the prompt context, feeds it to the LLM gemini-2.5-flash , and compiles the source page metadata. python from langchain google genai import ChatGoogleGenerativeAI from vector store import vector store Initialize Chat Model llm = ChatGoogleGenerativeAI model="gemini-2.5-flash" def ask question question: str : 1. Query vector database for top-3 most similar chunks docs = vector store.similarity search question, k=3 2. Combine chunk text contents into single context block context = "\n\n".join doc.page content for doc in docs 3. Prompt instructions enforcing zero-shot constraints prompt = f""" You are a resume assistant Answer ONLY from the provided context If the answer does not exist in the context say "I don't know". Context:{context} Question:{question} """ 4. Request generation from LLM response = llm.invoke prompt return { "answer": response.content, "source": { "page": doc.metadata.get "page" , "source": doc.metadata.get "source" } for doc in docs } app/main.py Hosts the FastAPI server. It appends the current directory path dynamically to resolve imports cleanly if run from the root project directory. python import sys import os Ensure the root directory imports resolve correctly sys.path.append os.path.dirname os.path.abspath file from fastapi import FastAPI from pydantic import BaseModel from chat import ask question app = FastAPI class QuestionRequest BaseModel : question: str @app.get "/chat" async def ask request: QuestionRequest : return ask question request.question client.models.list . gemini-2.5-pro on unpaid tiers can result in 429 RESOURCE EXHAUSTED quota limit of 0 . Switching to gemini-2.5-flash provides a cost-effective, high-quota alternative. \x00 markers. When writing these raw strings to databases, PostgreSQL will fail. Implementing a simple .replace '\x00', '' filter is mandatory. "Where does Aditya Kumar work?" , chunks containing "Aditya Kumar" like the footer/header rank high, while relevant work history chunks lacking his name rank extremely low. "Candidate: Aditya Kumar" to each chunk forces the system to find the correct chunk and enables accurate generation.