Getting Started with Vector Databases Using Amazon Aurora PostgreSQL + pgvector

Satoshi Kaneyasu, a DevOps engineer at Serverworks, has published a guide explaining vector databases and their implementation using Amazon Aurora PostgreSQL with the pgvector extension. The tutorial covers how vector databases store data as multidimensional arrays and perform semantic similarity searches, contrasting them with traditional relational databases that rely on exact or partial text matching. The guide also details common use cases including RAG (Retrieval-Augmented Generation), semantic search, recommendation systems, and image search, while explaining the vectorization process that converts both stored data and search queries into numerical representations.

Hello I'm Satoshi Kaneyasu, DevOps engineer at Serverworks. In this article, I'll introduce the basic concepts and terminology of vector databases for those who are just starting to learn about them. This article is aimed at beginners to vector databases. You may have heard that vector databases are related to LLMs and RAG, but aren't quite sure what they actually are. Think of this as written with that kind of reader in mind. A vector database is a database that stores data as vectors arrays of numbers and searches for data using "distance" or "similarity" between vectors. Traditional relational databases search for data using "exact match" or "partial match" LIKE queries , but vector databases can search for things that are semantically similar . For example, searching for "weather in Tokyo" might return results like "temperature in Tokyo" or "weather conditions in Kanto" — data that differs as a string but is semantically related. In a vector database, all data is represented as points in a multidimensional space. When searching, the query is also converted into a vector, and data that is "close in distance" within that space is retrieved. This diagram represents it in two dimensions, but in a real vector database, proximity and distance are defined across many dimensions. Vector databases are used across a wide range of applications: | Use Case | Description | |---|---| RAG Retrieval-Augmented Generation | Knowledge base search to provide external knowledge to LLMs. Allows internal documents and up-to-date information to be reflected in LLM responses | Semantic Search | Searching internal documents or FAQs by meaning rather than keywords. Handles spelling variations and synonyms | Recommendation | Recommending products and content whose vectors are close to a user's preference vector. Used as an alternative or complement to collaborative filtering | Image Search | Searching for similar images face recognition, product image matching . Images are vectorized using an embedding model and compared | Anomaly Detection | Detecting data that deviates far from the vector of normal patterns. Used in log analysis and security monitoring | Duplicate Detection | Detecting similar documents or code. Used for plagiarism detection and content deduplication | The most common use case is RAG. RAG Retrieval-Augmented Generation is a technique that improves LLM response accuracy by searching for relevant information from external data sources before generating a response, then including that information in the prompt. LLMs cannot accurately respond to information not included in their training data internal documents, recent news, specialized technical information, etc. . With RAG, you can have the LLM reference external knowledge stored in a vector database to generate more accurate and up-to-date responses. When using Amazon Bedrock as the LLM for RAG, there is a fully managed RAG feature called Knowledge Bases . With Knowledge Bases, you simply register documents stored in S3 and AWS manages everything — vectorization, vector database setup, and search. Since you don't need to set up a vector database yourself, this is ideal when you want to try RAG quickly or minimize infrastructure management. Since this article focuses on the vector database itself, we'll proceed without using Knowledge Bases. The RAG process follows this flow: As you can see, the vector database plays a central role in RAG as the "search engine for external knowledge." From here, let's dive deeper into the "vector database search" step. The vector database search flow works as follows: In a vector database, data is represented as multidimensional numbers. Therefore, data and search queries are converted to numbers at insertion time. This is called vectorization, or Embedding. The key point of vector database search is that the search query itself is also vectorized . Instead of searching with raw text, it is converted to a vector using an embedding model described later , and data that is close in vector space is retrieved. From here, I'll use implementation examples with Aurora PostgreSQL + pgvector abbreviated throughout and Python code. There are multiple options for building a vector database on AWS, but I find Aurora PostgreSQL + pgvector to be the most approachable starting point, and it's a great way to feel the difference between a conventional relational database and a vector database. Here is an implementation example using Aurora PostgreSQL + pgvector: ① Vectorize the text query handler.py embedding result = generate embedding query query embedding = embedding result.embedding 1024-dimensional vector ② Search the DB with the vectorized query logic.py with connection.cursor as cur: cur.execute Calculate cosine distance between query vector and DB vectors, return top k results in ascending distance order "SELECT content, embedding <= %s::vector AS distance " "FROM embeddings ORDER BY distance LIMIT %s;", query embedding, top k , results = cur.fetchall The <= operator here is pgvector's cosine distance operator. A smaller value means higher similarity. Because we're using Aurora PostgreSQL + pgvector, we can use SQL to query the vector DB. This code uses a prepared statement to safely pass the vectorized search text and the result count top k into the %s placeholders. Several terms have appeared in this simple search, so let me explain them. Embedding refers to the process of converting data such as text or images into a numerical vector. It is also called "vectorization." Humans intuitively know that "Tokyo weather forecast" and "Tokyo temperature" are similar, but computers can only compare strings. By numerically representing meaning through embedding, computers can mathematically calculate "semantic closeness." Before: "Tokyo weather forecast" After: 0.0231, -0.0142, 0.0567, ..., 0.0412 ← 1024 numbers Here is an implementation example using Amazon Bedrock's Titan Embeddings V2. The generate embedding function implemented here is called at step ① in the "Search Implementation Code" above. php def generate embedding text: str - EmbeddingResult: """Vectorize text using Bedrock Titan Embeddings V2.""" client = get bedrock client body = json.dumps { "inputText": text, Before: text "dimensions": 1024, Output dimensions "normalize": True, Normalize set vector length to 1 } response = client.invoke model modelId="amazon.titan-embed-text-v2:0", body=body, response body = json.loads response "body" .read embedding = response body "embedding" After: float × 1024 return EmbeddingResult embedding=embedding, time ms=elapsed ms Specifying normalize=True normalizes the output vector length to 1. This makes cosine similarity calculation equivalent to a dot product calculation, improving search efficiency. In the embedding implementation code, there was a keyword called "dimensions." Dimensions refer to the number of numbers in a single vector. 3-dimensional vector: 0.5, -0.3, 0.8 ← 3 numbers 1024-dimensional vector: 0.023, -0.014, ..., 0.041 ← 1024 numbers More dimensions allow for finer representation of "meaning," but storage consumption increases accordingly. | Dimensions | Size per vector | Size for 100k records | |---|---|---| | 256 | 1 KB | ~100 MB | | 1024 | 4 KB | ~400 MB | | 1536 | 6 KB | ~600 MB | | 3072 | 12 KB | ~1.2 GB | The number of dimensions is determined by the embedding model you use. Titan Embeddings V2 lets you choose from 256, 512, or 1024, allowing you to balance accuracy and cost based on your use case. Specialized models that convert text to vectors are distinct from LLMs generative models . Embedding models specialize in generating representations for computing semantic similarity. | Model | Provider | Dimensions | Features | |---|---|---|---| | Titan Embeddings V2 | AWS Bedrock | 256/512/1024 | AWS native. Has normalization option. High affinity with AWS environments | | Cohere Embed v3 | AWS Bedrock | 1024 | Multilingual support. Evaluated as highly accurate for Japanese | | text-embedding-3-small | OpenAI | 256~1536 | Lightweight and low cost. Multilingual support. Best for cost-sensitive use cases | | text-embedding-3-large | OpenAI | 256~3072 | High accuracy and multilingual support. Flexible dimension selection | An important note: you must use the same model for both search and registration . Vectors generated by different models don't exist in the same space, so distance calculations are meaningless. Amazon Titan Text Embeddings V2 - Bedrock Documentation https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html Cosine similarity represents "how much two vectors point in the same direction" as a number between -1 and 1. Closer to 1 means more semantically similar, closer to 0 means unrelated, and closer to -1 means semantically opposite. Cosine distance is defined as 1 - cosine similarity and ranges from 0 to 2. A smaller value means higher similarity, and pgvector's <= operator returns this cosine distance. "Distance" and "similarity" are just opposite representations of the same concept. | Metric | Range | "More similar" direction | Use case | |---|---|---|---| | Cosine Similarity | -1 to 1 | Larger value closer to 1 | Threshold judgment e.g., "hit if = 0.95" | | Cosine Distance | 0 to 2 | Smaller value closer to 0 | ORDER BY in SQL, KNN search | The search implementation code embedding <= %s::vector sorts by cosine distance, while the threshold judgment in semantic cache described later similarity = 0.95 uses cosine similarity. top k is the number of top-k results to return from a search. Set an appropriate value based on the use case. In RAG, it is common to pass the full set of top k results as context to the LLM. Be aware that making top k too large will lengthen the context, increasing the LLM's token consumption and latency. Normalization is the process of setting the length norm of a vector to 1. With Titan Embeddings V2, specifying normalize=True automatically normalizes the output vector. Cosine similarity between normalized vectors becomes equivalent to a simple dot product. Since dot products have lower computational cost than cosine similarity, this leads to more efficient search. Also, by standardizing vector lengths, distance comparisons purely reflect "differences in direction," which stabilizes search result quality. Of course, data must be registered in advance before you can search a vector database. Let's now look at data registration in a vector database. Data registration in a vector database follows this flow: As with the search explanation, I'll use implementation examples with Aurora PostgreSQL + pgvector and Python. The following table and index are created on Aurora PostgreSQL with the pgvector extension enabled: -- Enable pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- embeddings table storage for vector data CREATE TABLE IF NOT EXISTS embeddings id SERIAL PRIMARY KEY, content TEXT NOT NULL, embedding vector 1024 NOT NULL ; -- HNSW index speeds up ANN search CREATE INDEX IF NOT EXISTS idx embeddings embedding ON embeddings USING hnsw embedding vector cosine ops WITH m = 16, ef construction = 64 ; The content column in the embeddings table stores the text data, and the embedding column stores the vectorized text. An HNSW index is then created on the embedding column. Vector databases have indexes too, and in Aurora PostgreSQL + pgvector, you create indexes with the CREATE INDEX statement just like regular indexes. Here, ON embeddings USING hnsw specifies something called the index algorithm. The index algorithm is closely related to the search algorithm, and these two algorithms are critical in vector databases. There are two main types of search methods in vector databases: | Search Method | Full Name | Features | |---|---|---| KNN | K-Nearest Neighbor | Compares against all data exhaustively. Accuracy is perfect but computation cost increases linearly as data grows, making it slow | ANN | Approximate Nearest Neighbor | Searches approximately. Slightly lower accuracy but can search at high speed even with large volumes of data | In practical systems, ANN is almost always used. KNN is fine for small-scale data of a few thousand records, but ANN becomes essential when dealing with tens of thousands of records or more. The data structures used to implement ANN are called index algorithms, and there are several types: | Algorithm | Mechanism | Features | |---|---|---| HNSW | Builds a hierarchical graph structure and progressively narrows the search range from upper to lower layers | High accuracy and high speed. Higher memory consumption but currently the most widely used | IVF | Clusters data and performs partial search only on clusters close to the query | Memory-efficient. Suitable for large-scale data but may have lower accuracy than HNSW | Currently, the ANN + HNSW combination is the standard for building vector databases. AWS offers multiple ways to build vector databases, and Aurora PostgreSQL + pgvector, OpenSearch, and MemoryDB all support HNSW. -- HNSW index speeds up ANN search CREATE INDEX IF NOT EXISTS idx embeddings embedding ON embeddings USING hnsw embedding vector cosine ops WITH m = 16, ef construction = 64 ; The WITH clause in the index creation SQL specifies the HNSW index parameters: | Parameter | Meaning | Effect when increased | Typical value | |---|---|---|---| m | Connections per node | Search accuracy ↑ / Memory consumption ↑ / Build time ↑ | 16 | ef construction | Search width during construction | Search accuracy ↑ / Build time ↑ | 64~200 | Here is the Python code to register a substantial amount of data into Aurora PostgreSQL + pgvector: class AuroraIngester: """Batch INSERT data into Aurora pgvector. Efficiently inserts vector data using batch INSERT of 500 records at a time. """ def init self, connection: psycopg2.extensions.connection - None: self. connection = connection def ingest batch self, start index: int, end index: int - int: """Batch INSERT records in the specified range. Args: start index: Start index inclusive end index: End index exclusive Returns: Number of records inserted """ values parts: list str = params: list str | list float = for i in range start index, end index : values parts.append " %s, %s::vector " params.append f"doc-{i}" params.append generate vector seed=i sql = f"INSERT INTO embeddings content, embedding VALUES {', '.join values parts };" with self. connection.cursor as cur: cur.execute sql, params self. connection.commit return end index - start index def ingest all self, record count: int, batch size: int = 500 - int: """Insert all records into Aurora in batches. Args: record count: Total number of records to insert batch size: Number of records per batch default 500 Returns: Total number of records inserted """ log = logger.bind database="aurora pgvector" total inserted = 0 for start in range 0, record count, batch size : end = min start + batch size, record count for attempt in range 1, MAX RETRIES + 1 : try: count = self.ingest batch start, end total inserted += count break except Exception as e: log.warning "batch insert retry", start=start, end=end, attempt=attempt, error=str e if attempt == MAX RETRIES: log.error "batch insert failed", start=start, end=end, error=str e break time.sleep RETRY DELAY SECONDS log.info "ingest all complete", total inserted=total inserted return total inserted python def run database ingestion index manager, ingester, record count : """Execute bulk data insertion into the database. Args: index manager: Object managing index drop and creation implementation omitted ingester: Object that inserts data in batches described above record count: Total number of records to insert """ ① Drop index speeds up registration index manager.drop index SQL executed internally: DROP INDEX IF EXISTS embeddings hnsw idx; TRUNCATE TABLE embeddings; ② Batch registration 500 records at a time ingester.ingest all record count, batch size=500 ③ Bulk index creation index manager.create index SQL executed internally: CREATE INDEX embeddings hnsw idx ON embeddings USING hnsw embedding vector cosine ops WITH m = 16, -- Connections per node more = higher accuracy, more memory ef construction = 64 ; -- Search width during construction more = higher accuracy, slower build The reason for dropping the index first, registering data, and then recreating the index is that registering data while an index exists makes processing time unpredictable. This technique is commonly used in relational databases and applies equally to Aurora PostgreSQL + pgvector. For more details, see: Index Considerations When Bulk-Inserting Large Amounts of Data into a Database Japanese https://blog.serverworks.co.jp/database-bulk-insert-index-strategy One technique for speeding up search and data retrieval is caching. For vector databases, there is a technology called semantic cache that differs slightly from conventional caching. Semantic cache is a mechanism that uses the embedding vector of a query as a key to cache past search results or FM Foundation Model responses, and quickly returns results from the cache for semantically similar queries. Comparing it with conventional caching reveals its unique characteristics: | Conventional Cache | Semantic Cache | | |---|---|---| | Key | Exact string match | Vector similarity | | Hit condition | Only the exact same query | Semantically similar queries also hit | | Example | Only "weather in Tokyo" hits | "Tokyo weather forecast" and "What's the weather in Tokyo today?" also hit | With conventional caching, "weather in Tokyo" and "Tokyo weather forecast" are treated as different keys, resulting in lower cache hit rates. Semantic cache can group semantically equivalent queries together for caching, dramatically improving hit rates. When implementing semantic cache on AWS, Amazon ElastiCache or Amazon MemoryDB are the typical options. Here, I'll introduce a semantic cache implementation using Amazon MemoryDB hereafter, MemoryDB , referencing the following documentation: Amazon MemoryDB - Vector Search Examples https://docs.aws.amazon.com/memorydb/latest/devguide/vector-search-examples.html Setting aside the RAG with a vector database for a moment, if you introduce semantic cache for Foundation Model queries, the processing flow would look like this: Note: MemoryDB is a Redis-compatible key-value store and does not have "tables" like RDBs. Data is stored in Hash-type keys, and the search schema is defined as an "index" using the FT.CREATE command. In this repository, the following FT.CREATE command creates the index for semantic cache: FT.CREATE semantic cache idx ON HASH PREFIX 1 cache: SCHEMA embedding VECTOR HNSW 10 TYPE FLOAT32 DIM 1024 DISTANCE METRIC COSINE M 16 EF CONSTRUCTION 512 query text TAG result TEXT created at NUMERIC ttl NUMERIC | Field | Type | Description | |---|---|---| embedding | VECTOR HNSW | Query embedding vector 1024 dimensions . Target for KNN search | query text | TAG | Original query text. For exact match filtering | result | TEXT | FM response result cached answer | created at | NUMERIC | Cache entry creation time UNIX timestamp | ttl | NUMERIC | Cache expiration time seconds | PREFIX 1 cache: means only Hashes whose key name starts with cache: are indexed EF CONSTRUCTION=512 is set higher than Aurora pgvector 64 . Since MemoryDB operates in-memory, build cost is relatively low, so accuracy is prioritizedThe threshold for semantic cache is the cosine similarity value used to determine cache hits. | Threshold | Characteristics | Recommended Use Case | |---|---|---| | 0.95~1.0 | Only nearly identical queries hit | Accuracy-focused. When you want to minimize the risk of returning incorrect cached responses | | 0.80~0.90 | Synonymous phrasing variations also hit | Practical balance. Recommended for most use cases | | 0.70~0.80 | Related queries also broadly hit | Hit rate-focused. However, the risk of returning unrelated results increases | The appropriate threshold depends on business requirements, so I think it's safe to start with a high threshold around 0.95 and gradually lower it while monitoring cache hit rates. These are not keywords specific to vector databases or semantic cache — they are Redis commands, which is the engine underlying MemoryDB. A command that saves field-value pairs together in a Hash-type key. Multiple fields like embedding , query text , result , and created at can be stored as a single entry. In Redis / MemoryDB, it's conventional to use colon-separated naming like cache:abc123 for key names. This simply means "entry abc123 in the cache category" — the colon itself has no special function. The PREFIX 1 cache: in the index definition is a setting to make only keys starting with this prefix subject to search. A command that sets an expiration time TTL on a key. After the specified number of seconds, the key is automatically deleted. This prevents stale cache entries from accumulating. The implementation code got a bit long, but what it does is the same as typical cache-based data retrieval: use the cache if available, otherwise search and save the result to cache. I'll introduce the implementation code in three stages. python def handler event, context : query = event "query" "What is AWS S3?" ① Vectorize the query Bedrock Titan V2 embedding result = generate embedding query query embedding = embedding result.embedding ② Cache lookup via MemoryDB → FM call cache result = process query query text=query, query embedding=query embedding, redis client=redis client, threshold=0.95, Environment variable SIMILARITY THRESHOLD ttl seconds=3600, Environment variable CACHE TTL ③ Return response with metrics return {"statusCode": 200, "body": {...}} python def process query query text, query embedding, redis client, threshold, ttl seconds : ① Query MemoryDB cache FT.SEARCH KNN search results = search similar redis client, query embedding if search results: key, similarity, fields = search results 0 ② Cache hit → Return result from cache no FM call if similarity = threshold: return CacheResult hit=True, source="cache", result=fields "result" ③ Cache miss → Query FM directly and get result fm result = invoke fm query text ④ Save result to cache HSET + EXPIRE store cache entry redis client, query text, query embedding, fm result, ttl seconds return CacheResult hit=False, source="fm", result=fm result python def search similar redis client, query embedding, top k=1 : """Execute KNN vector search with FT.SEARCH.""" query vec = struct.pack f"<{len query embedding }f", query embedding query = Query f" = KNN {top k} @embedding $query vec AS score " .return fields "query text", "result", "created at", "score" .sort by "score", asc=True .paging 0, top k .dialect 2 .timeout 3000 3-second timeout results = redis client.ft "semantic cache idx" .search query, query params={"query vec": query vec} Convert cosine distance to similarity distance = 1 - similarity return doc.id, 1.0 - float doc.score , fields for doc in results.docs MemoryDB's FT.SEARCH command is compatible with Redis's RediSearch module and natively supports KNN vector search. score is returned as cosine distance 1 - cosine similarity , theoretically in the range 0~2 . 1.0 - score converts it to cosine similarity. With Titan V2's normalize=True , output vectors are already normalized, so actual scores fall in the range 0~1, meaning the converted similarity also stays in the 0~1 range. Here are the measured results under the following conditions: | Item | Value | |---|---| | FM Foundation Model | Claude 3 Haiku anthropic.claude-3-haiku-20240307-v1:0 | | Embedding Model | Titan Embeddings V2 1024 dimensions | | Cache Store | Amazon MemoryDB | | Similarity Threshold | 0.95 | | Test Query | "What is AWS S3?" same query run twice | The threshold is set high at 0.95. Please treat these measurement results as reference values to demonstrate that semantic cache has a certain level of effectiveness. | Metric | Cache Miss 1st run | Cache Hit 2nd run | Reduction | |---|---|---|---| | Total Response Time | 4,573ms | 279ms | 94% | | Embedding Generation | 194ms | 192ms | — | | Cache Lookup | 4ms | 3ms | — | | FM Call | 4,375ms | 0ms | 100% | When there's a cache hit, the FM call is completely skipped, reducing response time by 94%. Since only embedding generation ~190ms and cache lookup ~3ms are needed to complete the response, user experience is dramatically improved. Skipping the FM call also directly translates to reduced API usage costs. Semantic cache can be integrated into a RAG system. In that case, the processing flow would look like this: In this article, I covered everything from the basic concepts of vector databases to implementation on AWS and optimization with semantic cache. That's all for this time. Thank you for reading this lengthy article