Getting Started with Vector Databases Using Amazon Aurora PostgreSQL + pgvector

wpnews.pro

Hello!

I'm Satoshi Kaneyasu, DevOps engineer at Serverworks.

In this article, I'll introduce the basic concepts and terminology of vector databases for those who are just starting to learn about them.

This article is aimed at beginners to vector databases.

You may have heard that vector databases are related to LLMs and RAG, but aren't quite sure what they actually are.

Think of this as written with that kind of reader in mind.

A vector database is a database that stores data as vectors (arrays of numbers) and searches for data using "distance" or "similarity" between vectors.

Traditional relational databases search for data using "exact match" or "partial match" (LIKE queries), but vector databases can search for things that are semantically similar.

For example, searching for "weather in Tokyo" might return results like "temperature in Tokyo" or "weather conditions in Kanto" — data that differs as a string but is semantically related.

In a vector database, all data is represented as points in a multidimensional space. When searching, the query is also converted into a vector, and data that is "close in distance" within that space is retrieved.

This diagram represents it in two dimensions, but in a real vector database, proximity and distance are defined across many dimensions.

Vector databases are used across a wide range of applications:

Use Case	Description
RAG (Retrieval-Augmented Generation)
Knowledge base search to provide external knowledge to LLMs. Allows internal documents and up-to-date information to be reflected in LLM responses
Semantic Search
Searching internal documents or FAQs by meaning rather than keywords. Handles spelling variations and synonyms
Recommendation
Recommending products and content whose vectors are close to a user's preference vector. Used as an alternative or complement to collaborative filtering
Image Search
Searching for similar images (face recognition, product image matching). Images are vectorized using an embedding model and compared
Anomaly Detection
Detecting data that deviates far from the vector of normal patterns. Used in log analysis and security monitoring
Duplicate Detection
Detecting similar documents or code. Used for plagiarism detection and content deduplication

The most common use case is RAG.

RAG (Retrieval-Augmented Generation) is a technique that improves LLM response accuracy by searching for relevant information from external data sources before generating a response, then including that information in the prompt.

LLMs cannot accurately respond to information not included in their training data (internal documents, recent news, specialized technical information, etc.).

With RAG, you can have the LLM reference external knowledge stored in a vector database to generate more accurate and up-to-date responses.

When using Amazon Bedrock as the LLM for RAG, there is a fully managed RAG feature called Knowledge Bases.

With Knowledge Bases, you simply register documents stored in S3 and AWS manages everything — vectorization, vector database setup, and search.

Since you don't need to set up a vector database yourself, this is ideal when you want to try RAG quickly or minimize infrastructure management.

Since this article focuses on the vector database itself, we'll proceed without using Knowledge Bases.

The RAG process follows this flow:

As you can see, the vector database plays a central role in RAG as the "search engine for external knowledge."

From here, let's dive deeper into the "vector database search" step.

The vector database search flow works as follows:

In a vector database, data is represented as multidimensional numbers.

Therefore, data and search queries are converted to numbers at insertion time.

This is called vectorization, or Embedding.

The key point of vector database search is that the search query itself is also vectorized.

Instead of searching with raw text, it is converted to a vector using an embedding model (described later), and data that is close in vector space is retrieved.

From here, I'll use implementation examples with Aurora PostgreSQL + pgvector (abbreviated throughout) and Python code.

There are multiple options for building a vector database on AWS, but I find Aurora PostgreSQL + pgvector to be the most approachable starting point, and it's a great way to feel the difference between a conventional relational database and a vector database.

Here is an implementation example using Aurora PostgreSQL + pgvector:

embedding_result = generate_embedding(query)
query_embedding = embedding_result.embedding  # 1024-dimensional vector

with connection.cursor() as cur:
    cur.execute(
        "SELECT content, embedding <=> %s::vector AS distance "
        "FROM embeddings ORDER BY distance LIMIT %s;",
        (query_embedding, top_k),
    )
    results = cur.fetchall()

The <=>

operator here is pgvector's cosine distance operator.

A smaller value means higher similarity.

Because we're using Aurora PostgreSQL + pgvector, we can use SQL to query the vector DB.

This code uses a prepared statement to safely pass the vectorized search text and the result count (top_k) into the %s

placeholders.

Several terms have appeared in this simple search, so let me explain them.

Embedding refers to the process of converting data such as text or images into a numerical vector.

It is also called "vectorization."

Humans intuitively know that "Tokyo weather forecast" and "Tokyo temperature" are similar, but computers can only compare strings.

By numerically representing meaning through embedding, computers can mathematically calculate "semantic closeness."

Before: "Tokyo weather forecast"
After:  [0.0231, -0.0142, 0.0567, ..., 0.0412]  ← 1024 numbers

Here is an implementation example using Amazon Bedrock's Titan Embeddings V2.

The generate_embedding

function implemented here is called at step ① in the "Search Implementation Code" above.

def generate_embedding(text: str) -> EmbeddingResult:
    """Vectorize text using Bedrock Titan Embeddings V2."""
    client = _get_bedrock_client()
    body = json.dumps({
        "inputText": text,        # Before: text
        "dimensions": 1024,       # Output dimensions
        "normalize": True,        # Normalize (set vector length to 1)
    })

    response = client.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=body,
    )

    response_body = json.loads(response["body"].read())
    embedding = response_body["embedding"]  # After: [float] × 1024
    return EmbeddingResult(embedding=embedding, time_ms=elapsed_ms)

Specifying normalize=True

normalizes the output vector length to 1.

This makes cosine similarity calculation equivalent to a dot product calculation, improving search efficiency.

In the embedding implementation code, there was a keyword called "dimensions."

Dimensions refer to the number of numbers in a single vector.

3-dimensional vector:    [0.5, -0.3, 0.8]           ← 3 numbers
1024-dimensional vector: [0.023, -0.014, ..., 0.041] ← 1024 numbers

More dimensions allow for finer representation of "meaning," but storage consumption increases accordingly.

Dimensions	Size per vector	Size for 100k records
256	1 KB	~100 MB
1024	4 KB	~400 MB
1536	6 KB	~600 MB
3072	12 KB	~1.2 GB

The number of dimensions is determined by the embedding model you use. Titan Embeddings V2 lets you choose from 256, 512, or 1024, allowing you to balance accuracy and cost based on your use case.

Specialized models that convert text to vectors are distinct from LLMs (generative models).

Embedding models specialize in generating representations for computing semantic similarity.

Model	Provider	Dimensions	Features
Titan Embeddings V2	AWS Bedrock	256/512/1024	AWS native. Has normalization option. High affinity with AWS environments
Cohere Embed v3	AWS Bedrock	1024	Multilingual support. Evaluated as highly accurate for Japanese
text-embedding-3-small	OpenAI	256~1536	Lightweight and low cost. Multilingual support. Best for cost-sensitive use cases
text-embedding-3-large	OpenAI	256~3072	High accuracy and multilingual support. Flexible dimension selection

An important note: you must use the same model for both search and registration.

Vectors generated by different models don't exist in the same space, so distance calculations are meaningless.

Amazon Titan Text Embeddings V2 - Bedrock Documentation

Cosine similarity represents "how much two vectors point in the same direction" as a number between -1 and 1.

Closer to 1 means more semantically similar, closer to 0 means unrelated, and closer to -1 means semantically opposite.

Cosine distance is defined as 1 - cosine similarity

and ranges from 0 to 2.

A smaller value means higher similarity, and pgvector's <=>

operator returns this cosine distance.

"Distance" and "similarity" are just opposite representations of the same concept.

Metric	Range	"More similar" direction	Use case
Cosine Similarity	-1 to 1	Larger value (closer to 1)	Threshold judgment (e.g., "hit if >= 0.95")
Cosine Distance	0 to 2	Smaller value (closer to 0)	ORDER BY in SQL, KNN search

The search implementation code (embedding <=> %s::vector

) sorts by cosine distance, while the threshold judgment in semantic cache (described later) (similarity >= 0.95

) uses cosine similarity.

top_k is the number of top-k results to return from a search. Set an appropriate value based on the use case.

In RAG, it is common to pass the full set of top_k results as context to the LLM.

Be aware that making top_k too large will lengthen the context, increasing the LLM's token consumption and latency.

Normalization is the process of setting the length (norm) of a vector to 1.

With Titan Embeddings V2, specifying normalize=True

automatically normalizes the output vector.

Cosine similarity between normalized vectors becomes equivalent to a simple dot product.

Since dot products have lower computational cost than cosine similarity, this leads to more efficient search.

Also, by standardizing vector lengths, distance comparisons purely reflect "differences in direction," which stabilizes search result quality.

Of course, data must be registered in advance before you can search a vector database.

Let's now look at data registration in a vector database.

Data registration in a vector database follows this flow:

As with the search explanation, I'll use implementation examples with Aurora PostgreSQL + pgvector and Python.

The following table and index are created on Aurora PostgreSQL with the pgvector extension enabled:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- embeddings table (storage for vector data)
CREATE TABLE IF NOT EXISTS embeddings (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1024) NOT NULL
);

-- HNSW index (speeds up ANN search)
CREATE INDEX IF NOT EXISTS idx_embeddings_embedding
    ON embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

The content

column in the embeddings

table stores the text data, and the embedding

column stores the vectorized text.

An HNSW index is then created on the embedding

column.

Vector databases have indexes too, and in Aurora PostgreSQL + pgvector, you create indexes with the CREATE INDEX

statement just like regular indexes.

Here, ON embeddings USING hnsw

specifies something called the index algorithm.

The index algorithm is closely related to the search algorithm, and these two algorithms are critical in vector databases.

There are two main types of search methods in vector databases:

Search Method	Full Name	Features
KNN
K-Nearest Neighbor	Compares against all data exhaustively. Accuracy is perfect but computation cost increases linearly as data grows, making it slow
ANN
Approximate Nearest Neighbor	Searches approximately. Slightly lower accuracy but can search at high speed even with large volumes of data

In practical systems, ANN is almost always used.

KNN is fine for small-scale data of a few thousand records, but ANN becomes essential when dealing with tens of thousands of records or more.

The data structures used to implement ANN are called index algorithms, and there are several types:

Algorithm	Mechanism	Features
HNSW
Builds a hierarchical graph structure and progressively narrows the search range from upper to lower layers	High accuracy and high speed. Higher memory consumption but currently the most widely used
IVF
Clusters data and performs partial search only on clusters close to the query	Memory-efficient. Suitable for large-scale data but may have lower accuracy than HNSW

Currently, the ANN + HNSW combination is the standard for building vector databases.

AWS offers multiple ways to build vector databases, and Aurora PostgreSQL + pgvector, OpenSearch, and MemoryDB all support HNSW.

-- HNSW index (speeds up ANN search)
CREATE INDEX IF NOT EXISTS idx_embeddings_embedding
    ON embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

The WITH clause in the index creation SQL specifies the HNSW index parameters:

Parameter	Meaning	Effect when increased
m
Connections per node	Search accuracy ↑ / Memory consumption ↑ / Build time ↑	16
ef_construction
Search width during construction	Search accuracy ↑ / Build time ↑	64~200

Here is the Python code to register a substantial amount of data into Aurora PostgreSQL + pgvector:

class AuroraIngester:
    """Batch INSERT data into Aurora pgvector.

    Efficiently inserts vector data using batch INSERT of 500 records at a time.
    """
    def __init__(self, connection: psycopg2.extensions.connection) -> None:
        self._connection = connection

    def ingest_batch(self, start_index: int, end_index: int) -> int:
        """Batch INSERT records in the specified range.

        Args:
            start_index: Start index (inclusive)
            end_index: End index (exclusive)

        Returns:
            Number of records inserted
        """
        values_parts: list[str] = []
        params: list[str | list[float]] = []
        for i in range(start_index, end_index):
            values_parts.append("(%s, %s::vector)")
            params.append(f"doc-{i}")
            params.append(generate_vector(seed=i))

        sql = f"INSERT INTO embeddings (content, embedding) VALUES {', '.join(values_parts)};"
        with self._connection.cursor() as cur:
            cur.execute(sql, params)
        self._connection.commit()
        return end_index - start_index

    def ingest_all(self, record_count: int, batch_size: int = 500) -> int:
        """Insert all records into Aurora in batches.

        Args:
            record_count: Total number of records to insert
            batch_size: Number of records per batch (default 500)

        Returns:
            Total number of records inserted
        """
        log = logger.bind(database="aurora_pgvector")
        total_inserted = 0

        for start in range(0, record_count, batch_size):
            end = min(start + batch_size, record_count)
            for attempt in range(1, MAX_RETRIES + 1):
                try:
                    count = self.ingest_batch(start, end)
                    total_inserted += count
                    break
                except Exception as e:
                    log.warning("batch_insert_retry", start=start, end=end, attempt=attempt, error=str(e))
                    if attempt == MAX_RETRIES:
                        log.error("batch_insert_failed", start=start, end=end, error=str(e))
                        break
                    time.sleep(RETRY_DELAY_SECONDS)

        log.info("ingest_all_complete", total_inserted=total_inserted)
        return total_inserted
python
def _run_database_ingestion(index_manager, ingester, record_count):
    """Execute bulk data insertion into the database.

    Args:
        index_manager: Object managing index drop and creation (implementation omitted)
        ingester: Object that inserts data in batches (described above)
        record_count: Total number of records to insert
    """
    index_manager.drop_index()

    ingester.ingest_all(record_count, batch_size=500)

    index_manager.create_index()

The reason for dropping the index first, registering data, and then recreating the index is that registering data while an index exists makes processing time unpredictable.

This technique is commonly used in relational databases and applies equally to Aurora PostgreSQL + pgvector.

For more details, see: Index Considerations When Bulk-Inserting Large Amounts of Data into a Database (Japanese)

One technique for speeding up search and data retrieval is caching.

For vector databases, there is a technology called semantic cache that differs slightly from conventional caching.

Semantic cache is a mechanism that uses the embedding vector of a query as a key to cache past search results or FM (Foundation Model) responses, and quickly returns results from the cache for semantically similar queries.

Comparing it with conventional caching reveals its unique characteristics:

Conventional Cache	Semantic Cache
Key	Exact string match	Vector similarity
Hit condition	Only the exact same query	Semantically similar queries also hit
Example	Only "weather in Tokyo" hits	"Tokyo weather forecast" and "What's the weather in Tokyo today?" also hit

With conventional caching, "weather in Tokyo" and "Tokyo weather forecast" are treated as different keys, resulting in lower cache hit rates. Semantic cache can group semantically equivalent queries together for caching, dramatically improving hit rates.

When implementing semantic cache on AWS, Amazon ElastiCache or Amazon MemoryDB are the typical options.

Here, I'll introduce a semantic cache implementation using Amazon MemoryDB (hereafter, MemoryDB), referencing the following documentation:

Amazon MemoryDB - Vector Search Examples

Setting aside the RAG with a vector database for a moment, if you introduce semantic cache for Foundation Model queries, the processing flow would look like this:

Note: MemoryDB is a Redis-compatible key-value store and does not have "tables" like RDBs. Data is stored in Hash-type keys, and the search schema is defined as an "index" using theFT.CREATE

command.

In this repository, the following FT.CREATE

command creates the index for semantic cache:

FT.CREATE semantic_cache_idx
  ON HASH
  PREFIX 1 cache:
  SCHEMA
    embedding    VECTOR HNSW 10
                   TYPE FLOAT32
                   DIM 1024
                   DISTANCE_METRIC COSINE
                   M 16
                   EF_CONSTRUCTION 512
    query_text   TAG
    result       TEXT
    created_at   NUMERIC
    ttl          NUMERIC

Field	Type	Description
embedding
VECTOR (HNSW)	Query embedding vector (1024 dimensions). Target for KNN search
query_text
TAG	Original query text. For exact match filtering
result
TEXT	FM response result (cached answer)
created_at
NUMERIC	Cache entry creation time (UNIX timestamp)
ttl
NUMERIC	Cache expiration time (seconds)

PREFIX 1 cache:

means only Hashes whose key name starts with cache:

are indexedEF_CONSTRUCTION=512

is set higher than Aurora pgvector (64). Since MemoryDB operates in-memory, build cost is relatively low, so accuracy is prioritizedThe threshold for semantic cache is the cosine similarity value used to determine cache hits.

Threshold	Characteristics	Recommended Use Case
0.95~1.0	Only nearly identical queries hit	Accuracy-focused. When you want to minimize the risk of returning incorrect cached responses
0.80~0.90	Synonymous phrasing variations also hit	Practical balance. Recommended for most use cases
0.70~0.80	Related queries also broadly hit	Hit rate-focused. However, the risk of returning unrelated results increases

The appropriate threshold depends on business requirements, so I think it's safe to start with a high threshold around 0.95 and gradually lower it while monitoring cache hit rates.

These are not keywords specific to vector databases or semantic cache — they are Redis commands, which is the engine underlying MemoryDB.

A command that saves field-value pairs together in a Hash-type key.

Multiple fields like embedding

, query_text

, result

, and created_at

can be stored as a single entry.

In Redis / MemoryDB, it's conventional to use colon-separated naming like cache:abc123

for key names.

This simply means "entry abc123 in the cache category" — the colon itself has no special function.

The PREFIX 1 cache:

in the index definition is a setting to make only keys starting with this prefix subject to search.

A command that sets an expiration time (TTL) on a key. After the specified number of seconds, the key is automatically deleted. This prevents stale cache entries from accumulating.

The implementation code got a bit long, but what it does is the same as typical cache-based data retrieval: use the cache if available, otherwise search and save the result to cache.

I'll introduce the implementation code in three stages.

def handler(event, context):
    query = event["query"]  # "What is AWS S3?"

    embedding_result = generate_embedding(query)
    query_embedding = embedding_result.embedding

    cache_result = process_query(
        query_text=query,
        query_embedding=query_embedding,
        redis_client=redis_client,
        threshold=0.95,   # Environment variable SIMILARITY_THRESHOLD
        ttl_seconds=3600, # Environment variable CACHE_TTL
    )

    return {"statusCode": 200, "body": {...}}
python
def process_query(query_text, query_embedding, redis_client,
                  threshold, ttl_seconds):
    search_results = search_similar(redis_client, query_embedding)

    if search_results:
        key, similarity, fields = search_results[0]

        if similarity >= threshold:
            return CacheResult(hit=True, source="cache",
                               result=fields["result"])

    fm_result = _invoke_fm(query_text)

    _store_cache_entry(redis_client, query_text,
                       query_embedding, fm_result, ttl_seconds)

    return CacheResult(hit=False, source="fm", result=fm_result)
python
def search_similar(redis_client, query_embedding, top_k=1):
    """Execute KNN vector search with FT.SEARCH."""
    query_vec = struct.pack(f"<{len(query_embedding)}f", *query_embedding)

    query = (
        Query(f"*=>[KNN {top_k} @embedding $query_vec AS score]")
        .return_fields("query_text", "result", "created_at", "score")
        .sort_by("score", asc=True)
        .paging(0, top_k)
        .dialect(2)
        .timeout(3000)  # 3-second timeout
    )

    results = redis_client.ft("semantic_cache_idx").search(
        query, query_params={"query_vec": query_vec}
    )

    return [(doc.id, 1.0 - float(doc.score), fields) for doc in results.docs]

MemoryDB's FT.SEARCH command is compatible with Redis's RediSearch module and natively supports KNN vector search.

score

is returned as cosine distance (1 - cosine similarity

, theoretically in the range 0~2). 1.0 - score

converts it to cosine similarity.

With Titan V2's normalize=True

, output vectors are already normalized, so actual scores fall in the range 0~1, meaning the converted similarity also stays in the 0~1 range.

Here are the measured results under the following conditions:

Item	Value
FM (Foundation Model)	Claude 3 Haiku (`anthropic.claude-3-haiku-20240307-v1:0` )
Embedding Model	Titan Embeddings V2 (1024 dimensions)
Cache Store	Amazon MemoryDB
Similarity Threshold	0.95
Test Query	"What is AWS S3?" (same query run twice)

The threshold is set high at 0.95.

Please treat these measurement results as reference values to demonstrate that semantic cache has a certain level of effectiveness.

Metric	Cache Miss (1st run)	Cache Hit (2nd run)	Reduction
Total Response Time	4,573ms	279ms	94%
Embedding Generation	194ms	192ms	—
Cache Lookup	4ms	3ms	—
FM Call	4,375ms	0ms	100%

When there's a cache hit, the FM call is completely skipped, reducing response time by 94%.

Since only embedding generation (~190ms) and cache lookup (~3ms) are needed to complete the response, user experience is dramatically improved.

Skipping the FM call also directly translates to reduced API usage costs.

Semantic cache can be integrated into a RAG system.

In that case, the processing flow would look like this:

In this article, I covered everything from the basic concepts of vector databases to implementation on AWS and optimization with semantic cache.

That's all for this time.

Thank you for reading this lengthy article!

source & further reading

dev.to — original article Teaching Agents to Slow Down Where It Matters Introducing Radar: An Open-Source, Self-Hosted AI Media Intelligence Platform Cross-Vendor Audit: What It Caught in My Own Model's Writing, and What It Got Wrong

Getting Started with Vector Databases Using Amazon Aurora PostgreSQL + pgvector

Run your AI side-project on zahid.host