RAG and Vector Search with pgvector and Amazon Bedrock (Part 4) To implement retrieval-augmented generation (RAG) using the pgvector extension within an existing PostgreSQL database, combined with Amazon Bedrock's Titan embedding model and Claude for answer generation. It details the full query pipeline, including how user questions are converted to vectors, how pgvector performs similarity searches using IVFFlat indexes, and how the system produces cited answers while maintaining row-level security. The post emphasizes that using the same embedding model for both ingestion and querying is essential, and provides specific implementation details such as using 1024-dimensional normalized vectors and IAM-based authentication. RAG and Vector Search with pgvector and Amazon Bedrock Part 4 How to build retrieval-augmented generation that actually cites its sources — without a vector database subscription. Most RAG tutorials reach for Pinecone, Chroma, or Weaviate as the vector store. Those are all fine services, but they add another cost line, another auth boundary, and a dependency you don't control. If you're already running Postgres — and for multi-tenant SaaS, you should be — the pgvector extension gives you vector similarity search inside your existing database, protected by the same Row-Level Security policies you already have. This post covers the full query path in Sift: how a user's question becomes a vector, how pgvector finds the closest document chunks, and how Claude turns those chunks into a cited answer. What RAG Actually Does The core idea is simple. At query time: - Embed the user's question with the same model used to embed the documents - Find the document chunks whose embeddings are closest to the question embedding - Send those chunks to an LLM, tell it to answer the question using only that context - Return the answer with numbered citations linking back to the source text That's it. The sophistication is in the details of each step. Embeddings with Bedrock Titan Embed v2 Both the pipeline at ingest time and the chat handler at query time use the same embedding model: amazon.titan-embed-text-v2:0 . Using the same model for both sides of the search is a hard requirement — embeddings from different models live in incompatible vector spaces. The Python implementation in the pipeline's shared module: php EMBED MODEL ID = "amazon.titan-embed-text-v2:0" def embed text: str - list float : payload = json.dumps {"inputText": text, "dimensions": 1024, "normalize": True} response = get client .invoke model modelId=EMBED MODEL ID, contentType="application/json", accept="application/json", body=payload, return json.loads response "body" .read "embedding" Two parameters worth noting. dimensions: 1024 — Titan Embed v2 supports multiple output sizes 256, 512, or 1024 dimensions . Fewer dimensions mean smaller storage and faster search at the cost of some precision. 1024 is the maximum and gives the best retrieval quality; for a demo at this scale, there's no reason to trade it away. normalize: True — this asks Bedrock to return a unit-length vector. Normalized embeddings mean cosine similarity is equivalent to dot product. pgvector can compute dot products slightly faster than cosine distance, and it simplifies reasoning about scores. More importantly, it means you don't have to normalize manually — if you skip it and your embeddings have different magnitudes, your similarity scores will be skewed by vector length rather than semantic meaning. Authentication is IAM. The Lambda execution role has bedrock:InvokeModel permission via its attached policy — no API keys, no secrets to rotate. Schema: Storing Vectors in Postgres The document chunks table has a vector 1024 column — the native pgvector type: CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE document chunks id UUID PRIMARY KEY DEFAULT gen random uuid , document id UUID NOT NULL REFERENCES documents id ON DELETE CASCADE, tenant id UUID NOT NULL, chunk index INT NOT NULL, content TEXT NOT NULL, embedding vector 1024 , created at TIMESTAMPTZ NOT NULL DEFAULT NOW ; The 1024 in the column type is a hard constraint — Postgres will reject inserts with a vector of any other dimension. That's a useful guardrail: if the embedding model changes and the dimension changes with it, the insert fails loudly rather than silently storing mismatched vectors. The IVFFlat Index An exact nearest-neighbor search scans every vector in the table and computes distance to the query vector. For a small dataset that's fine. At tens of millions of chunks it becomes expensive. IVFFlat Inverted File Flat is an approximate nearest-neighbor index. It clusters the vectors into groups called "lists" at index build time. At query time, it only searches the most promising lists rather than the entire table: CREATE INDEX ON document chunks USING ivfflat embedding vector cosine ops WITH lists = 100 ; vector cosine ops tells the index to use cosine distance as its metric, which matches the <= operator in the query. The lists = 100 parameter controls how many clusters to build — the pgvector docs recommend roughly sqrt rows as a starting point. The IVFFlat gotcha: the index needs data to exist when it's built. An IVFFlat index built on an empty table is useless. In Sift, the initial migration creates the index after the schema is established, and the seed data runs in the same migration. For a production system where the table grows continuously, HNSW is a better choice — it maintains good search quality as data is inserted without needing a rebuild. Inserting Vectors from Python The psycopg2 driver doesn't natively understand the pgvector type. Rather than adding the pgvector Python package which requires a compiled extension and adds deploy complexity , the pipeline constructs a Postgres vector literal as a plain string and casts it: vector literal = " " + ",".join str v for v in embedding + " " cur.execute """ INSERT INTO document chunks document id, tenant id, chunk index, content, embedding VALUES %s, %s, %s, %s, %s::vector ON CONFLICT DO NOTHING """, document id, tenant id, chunk index, content, vector literal , The ::vector cast in the SQL converts the string to the native vector type at insert time. This works on any Postgres driver, any Lambda architecture x86 or ARM , without native extensions. The ON CONFLICT DO NOTHING handles at-least-once delivery from the Step Functions Map state — if an EmbedChunk Lambda retries, it won't create duplicate chunks. Similarity Search At query time, the C ChatService embeds the user's question and runs the search. The same vector literal approach works from the .NET side: private async Task