Vibe-Memory Part 3: pgvector Performance Optimization - How I Cut Query Time From 800ms to 15ms

wpnews.pro

Honestly, when I first built vibe-memory, I thought "how hard can pgvector be?" It's just vector search, right? Everyone's doing it. I added the extension, created a table with an embedding column, threw an index on it, and called it a day.

Three weeks later, with 5,000 memories in my database, I was staring at 800ms+ query times wondering what went wrong. That's slower than generating the AI response! What's the point of having semantic memory if you have to wait forever for it?

So here's the thing — I learned the hard way that pgvector performance isn't automatic. The defaults work great for 100 vectors, but once you start getting serious, you need to actually think about what you're doing. After three days of digging through documentation, experimenting, and pulling my hair out, I got query time down from 800ms to 15ms.

Let me share what actually worked (and what didn't).

First, let's recap what vibe-memory is. It's a personal semantic memory service that stores all your AI conversation context and lets you search it with natural language. Every message gets embedded, stored in PostgreSQL with pgvector, and when you ask a question, it finds the most relevant memories and sends them to the AI.

My initial schema was about as simple as it gets:

type Memory struct {
    ID        string    `json:"id"`
    Content   string    `json:"content"`
    Embedding pgvector.Vec `gorm:"type:vector(1536)" json:"-"`
    CreatedAt time.Time `json:"created_at"`
    UpdatedAt time.Time `json:"updated_at"`
}

// Then I just added...
CREATE INDEX ON memories USING ivffat (embedding vector_cosine_ops);

That's it. That's what every tutorial tells you to do. And with a few hundred vectors, it worked fine. 20-30ms, no problem.

But once I hit 5,000 vectors, queries started taking 400-800ms. And when I hit 10,000? Forget about it — 1.5+ seconds per query. That's unusable.

I'm not a DBA. I'm just a developer who wanted to add semantic memory to my AI assistant. So I had to learn all this the hard way. Here's what I discovered, step by step.

The first thing I learned: the default IVFFlat index I was using isn't actually the best choice for most cases. I had no idea there were two different index types when I started.

IVFFlat is the "inverted file with flat compression" index. It's faster to build, uses less memory, but query time grows linearly with your dataset size. It's great for bulk , but not great for low-latency queries.

HNSW is the "hierarchical navigable small world" index. It's slower to build, uses more memory, but query time is basically constant even as your dataset grows. It's what you want for production.

I switched to HNSW and immediately dropped from 800ms to 150ms. That's a 5x improvement just from changing the index type!

Here's what the migration looked like in Go/GORM:

// Up migration
func up(tx *gorm.DB) error {
    // Drop old IVFFlat index
    _ = tx.Exec(`DROP INDEX IF EXISTS idx_memories_embedding`)

    // Create new HNSW index with cosine distance
    return tx.Exec(`CREATE INDEX idx_memories_embedding ON memories USING hnsw (embedding vector_cosine_ops)`)
}

That's it. Just changing ivffat

to hnsw

gave me a massive improvement.

But wait — why doesn't everyone use HNSW all the time? The tradeoff is build time and memory. HNSW takes longer to create the index and uses more RAM. On my 10k vector dataset with 1536 dimensions, it uses about 50-60MB more RAM. That's nothing for a modern server, but it's something to be aware of.

Pro tip: If you're running a personal project or small service with less than 100k vectors, use HNSW. The memory tradeoff is totally worth it for the query speed. Don't do what I did and start with IVFFlat "because it's the default".

I was using OpenAI's text-embedding-3-small

which gives you 1536 dimensions by default. It's great quality, but do you really need all those dimensions for personal semantic memory?

OpenAI actually lets you reduce the dimensions while keeping most of the performance. I dropped from 1536 to 512 dimensions and barely noticed any difference in result quality. But look what happened to query time: another 2x improvement. From 150ms to 70ms.

Wait, that makes sense — fewer dimensions means less data to process, smaller index size, faster distance calculations.

Here's how that looks in code when you generate embeddings with OpenAI:

req := &openai.EmbeddingRequest{
    Model: openai.SmallEmbedding3,
    Input:  []string{content},
    Dimensions: 512, // This is the magic line
}

That's it. Just add that one parameter. I did several side-by-side comparisons, and for personal use cases like "find my previous conversation about X", 512 dimensions is 99% as good as 1536. Your postgres will thank you.

Pros of dimension reduction:

Cons:

For a personal AI memory project like vibe-memory, this is a no-brainer. I've been using 512 dimensions for months and haven't noticed a single bad search result that I can blame on dimension reduction.

Here's where I wasted half a day — I kept seeing weird query performance after adding the HNSW index. Sometimes it was fast, sometimes it was slow. What gives?

Turns out pgvector (and PostgreSQL in general) needs proper maintenance after you insert a bunch of data. HNSW index doesn't automatically vacuum dead tuples the way you expect. If you've been inserting and deleting a lot of data (which I was doing while developing), you need to manually vacuum.

VACUUM ANALYZE memories;

I ran this and dropped another 20ms — from 70ms to 50ms. More importantly, query time became consistent. No more random slowdowns. That's the thing that surprised me the most — I had no idea this was necessary.

Pro tip: After you do a big batch insert or delete a lot of rows, run VACUUM ANALYZE

on your table. It takes a minute, but it makes a huge difference in consistency.

I made an obvious mistake that I didn't even think about — I was doing vector search first, then filtering by date. That meant pgvector had to scan a bunch of vectors that were going to get filtered out anyway.

Bad approach:

SELECT *, embedding <-> $1 AS distance
FROM memories
WHERE created_at > $2
ORDER BY distance ASC
LIMIT $3;

Wait, no — actually that's what I was doing, but it's not that bad because PostgreSQL's query planner should push the filter before the index scan right? Well, not always. Especially if you have a lot of old data that's rarely queried.

I started explicitly putting my common filters in the query before the vector search, and when I added an index on the filter column (in my case, created_at

), I got another 10-15ms improvement.

Better approach with composite indexing:

// Add composite index for common query patterns
// If you always filter by conversation_id, make a composite index:
CREATE INDEX idx_memories_conversation_id_created_at ON memories (conversation_id, created_at);

-- Then your query becomes:
SELECT id, content, created_at, embedding <-> $1 AS distance
FROM memories
WHERE conversation_id = $2
ORDER BY distance ASC
LIMIT $3;

In vibe-memory, most searches are within a specific conversation, so adding that conversation_id filter early drastically reduces the number of vectors pgvector needs to search.

The general principle is: filter what you can filter before doing the vector search. The fewer vectors you need to compare, the faster your query will be.

After doing all that, I was down to about 30-50ms per query. That's pretty good! But I noticed that people tend to search for similar things repeatedly. When you're having a long conversation about a topic, you're going to be querying for related memories over and over.

So I added a simple in-memory cache for the most recent queries. Nothing fancy — just an LRU cache that stores the query embedding → result IDs mapping.

import "github.com/hashicorp/golang-lru/v2"

type CachedMemorySearch struct {
    cache *lru.Cache[string, []string]
    repo  MemoryRepository
}

func NewCachedMemorySearch(repo MemoryRepository, size int) *CachedMemorySearch {
    cache, _ := lru.New[string, []string](size)
    return &CachedMemorySearch{cache: cache, repo: repo}
}

func (c *CachedMemorySearch) Search(ctx context.Context, embedding []float32, limit int) ([]Memory, error) {
    // Create a cache key from the embedding (we can hash it)
    key := fmt.Sprintf("%v", embedding)

    if ids, ok := c.cache.Get(key); ok {
        // Cache hit - just fetch the memories by ID
        return c.repo.FindByIDs(ctx, ids)
    }

    // Cache miss - do the full search
    results, err := c.repo.Search(ctx, embedding, limit)
    if err == nil {
        // Cache the result IDs
        ids := make([]string, len(results))
        for i, res := range results {
            ids[i] = res.ID
        }
        c.cache.Add(key, ids)
    }
    return results, err
}

This gave me another huge improvement for common usage patterns. Cache hits are sub-1ms. Even cache misses only add a tiny bit of overhead. For my usage, this probably averages out to another 2-3x speedup in practice.

It's the low-hanging fruit that everyone thinks they don't need, but once you add it, you wonder how you lived without it.

After all these optimizations, what do we have?

Stage	Query Time	Improvement
Initial (IVFFlat, 1536d, no vacuum)	800ms	baseline
Switch to HNSW	150ms	5.3x faster
Drop from 1536d to 512d	70ms	11.4x faster
VACUUM ANALYZE	50ms	16x faster
Filter early + composite index	35ms	22.8x faster
Add LRU cache (hit)	<1ms	800x faster
Add LRU cache (average)	15ms	53x faster

That's insane. I started with 800ms and got down to average 15ms. That's faster than the network overhead of most API calls.

Let me be honest — not every optimization is right for every project. Here's what I think after going through this:

It's all standard PostgreSQL — no extra infrastructure, no new services to manage. Just your existing PostgreSQL instance with the pgvector extension. That's huge for a side project like this.

Progressive enhancement — you can start simple and add optimizations as you need them. I started with the basic setup and added each optimization one by one when I hit performance issues. You don't need to do all this day one.

The tradeoffs are worth it — 50MB extra memory for HNSW is nothing for a modern server. 512 dimensions instead of 1536 is barely noticeable in quality for personal use. The cache is like 50 lines of code. All cheap wins.

It scales surprisingly far — with these optimizations, I'm pretty sure I could get to 100k+ memories and still keep query time under 100ms. That's more than enough for any personal knowledge base.

HNSW takes longer to build — if you're bulk importing a million vectors, build time will be slower. But for personal use with 10k-100k vectors, it's still just a few minutes.

Dimension reduction does lose some quality — it's minimal for my use case, but if you're doing production semantic search for customers, you might want to keep full dimensions.

Cache invalidation can get tricky — if you update/delete a memory, you need to invalidate the cache. I keep it simple with a small cache size and it evicts automatically, but if you have lots of writes you need to think about this more carefully.

PostgreSQL still isn't a specialized vector database — if you're doing this at massive scale, you'll eventually outgrow PostgreSQL and need something like Pinecone or Weaviate. But for 99% of personal projects and small services, PostgreSQL + pgvector is more than enough when optimized properly.

Knowing what I know now, here's the starting schema I'd use if I was starting over:

-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with reduced dimensions
CREATE TABLE memories (
    id UUID PRIMARY KEY,
    conversation_id UUID NOT NULL,
    content TEXT NOT NULL,
    embedding vector(512) NOT NULL, -- 512 dimensions from the start
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL
);

-- HNSW index from day one
CREATE INDEX idx_memories_embedding ON memories USING hnsw (embedding vector_cosine_ops);

-- Composite index for common filtering
CREATE INDEX idx_memories_conversation_id_created_at ON memories (conversation_id, created_at);

That's it. That's the optimized starting point that would have saved me three days of debugging. Start with what works for 90% of small projects, don't bother with IVFFlat unless you know you need it.

So here's the thing — pgvector is amazing, but it doesn't magically give you great performance out of the box. A few simple changes made all the difference for me:

Honestly, I'm shocked at how far you can push PostgreSQL + pgvector for a personal AI memory project. I started thinking I'd have to move to a dedicated vector database once I got to a few thousand vectors, and now I'm pretty sure I can keep it here forever.

Have you optimized pgvector for your project? Did you find other tricks that I missed? I'd love to hear about your experience in the comments below — I'm still learning this stuff, and there's probably more I can do to make it even faster!

Vibe-Memory is open source and available on GitHub — go check it out if you want to see the full working code for everything I talked about here.

source & further reading

dev.to — original article Traces show what your agent did - a decision ledger shows what it was allowed to do AI Systems Need Evidence, Not Just Observability Top AI Papers on Hugging Face - 2026-06-25

Vibe-Memory Part 3: pgvector Performance Optimization - How I Cut Query Time From 800ms to 15ms

Run your AI side-project on zahid.host