Give Your AI Agent Persistent Long-Term Memory with Postgres and pgvector

wpnews.pro

Store and retrieve conversation history semantically across sessions using pgvector's HNSW index and OpenAI embeddings — no external vector database required.

Mariana Souza

What You'll Build #

A Python agent that embeds each conversation exchange and persists it in Postgres via pgvector. On every new message, it retrieves the most semantically similar past exchanges and injects them into the system prompt, giving the agent cross-session recall that survives process restarts.

Prerequisites #

Python 3.10+
Docker (used for the pgvector/pgvector

image) - OpenAI API key with access to text-embedding-3-small

andgpt-4o-mini

pgvector 0.6.0+ (bundled in the Docker image below)

Install Python dependencies:

pip install openai psycopg2-binary pgvector numpy python-dotenv

If you're running a self-managed Postgres instance instead of Docker, follow the build-from-source steps at github.com/pgvector/pgvector.

Step 1: Start Postgres with pgvector #

The official image ships with the extension already compiled against the correct Postgres version:

docker run -d \
  --name agent-memory \
  -e POSTGRES_USER=agent \
  -e POSTGRES_PASSWORD=changeme \
  -e POSTGRES_DB=agentdb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Verify the extension is present:

docker exec -it agent-memory psql -U agent -d agentdb \
  -c "SELECT extversion FROM pg_extension WHERE extname = 'vector';"

If no row comes back, connect via psql and run CREATE EXTENSION vector;

manually. That's a one-time operation per database.

Step 2: Create the Schema #

Save this as schema.sql

:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS agent_memories (
    id          BIGSERIAL PRIMARY KEY,
    session_id  TEXT        NOT NULL,
    content     TEXT        NOT NULL,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS memories_hnsw_idx
    ON agent_memories
    USING hnsw (embedding vector_cosine_ops);

Apply it:

docker exec -i agent-memory psql -U agent -d agentdb < schema.sql

vector(1536)

matches text-embedding-3-small

's output dimensionality. The HNSW index is the right choice here: unlike IVFFlat, it requires no training phase and delivers better recall. The tradeoff is slightly higher memory use, which only matters at millions of rows. vector_cosine_ops

tells Postgres to optimize the index for <=>

(cosine distance) queries, which is what you want for OpenAI embeddings.

Step 3: Build the Memory Store #

Create .env

:

OPENAI_API_KEY=sk-...
PG_HOST=localhost
PG_DB=agentdb
PG_USER=agent
PG_PASSWORD=changeme

Then memory_store.py

. Note the load_dotenv()

call at the top of this file. Because client = OpenAI()

runs at import time (module level), the environment variables must be populated before the module is imported anywhere. Calling load_dotenv()

here makes the module self-sufficient regardless of call order in the importer.

import os
import contextlib
import numpy as np
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()
EMBED_MODEL = "text-embedding-3-small"

@contextlib.contextmanager
def _db():
    conn = psycopg2.connect(
        host=os.getenv("PG_HOST", "localhost"),
        dbname=os.getenv("PG_DB", "agentdb"),
        user=os.getenv("PG_USER", "agent"),
        password=os.environ["PG_PASSWORD"],
    )
    register_vector(conn)
    try:
        yield conn
        conn.commit()
    finally:
        conn.close()

def embed(text: str) -> np.ndarray:
    raw = client.embeddings.create(model=EMBED_MODEL, input=text).data[0].embedding
    return np.array(raw, dtype=np.float32)

def save_memory(session_id: str, content: str) -> None:
    vec = embed(content)
    with _db() as conn, conn.cursor() as cur:
        cur.execute(
            "INSERT INTO agent_memories (session_id, content, embedding) "
            "VALUES (%s, %s, %s)",
            (session_id, content, vec),
        )

def recall(query: str, top_k: int = 4) -> list[str]:
    vec = embed(query)
    with _db() as conn, conn.cursor() as cur:
        cur.execute(
            """
            SELECT content
            FROM agent_memories
            ORDER BY embedding <=> %s
            LIMIT %s
            """,
            (vec, top_k),
        )
        return [row[0] for row in cur.fetchall()]

register_vector(conn)

teaches psycopg2 how to serialize numpy arrays into pgvector's text wire format and deserialize results back. It patches adapters on the connection object, so each new connection needs the call. pgvector stores vectors as 32-bit floats internally, so np.float32

matches the storage precision exactly.

recall

queries across all sessions by default. Add WHERE session_id = %s

if you have multiple users who need strict memory isolation.

Step 4: The Agent Loop #

agent.py

. The import order here is intentional: load_dotenv()

must fire before memory_store

is imported. Even though memory_store.py

now also calls load_dotenv()

, making it self-sufficient, keeping this order in agent.py

is defensive and costs nothing.

from dotenv import load_dotenv
load_dotenv()  # must run before memory_store is imported

import uuid
import os
from openai import OpenAI
from memory_store import save_memory, recall

client = OpenAI()
SESSION_ID = str(uuid.uuid4())

def build_system_prompt(memories: list[str]) -> str:
    if not memories:
        return "You are a helpful assistant."
    block = "\n".join(f"- {m}" for m in memories)
    return (
        "You are a helpful assistant.\n"
        "Relevant memories from past conversations:\n"
        f"{block}\n\n"
        "Draw on these only when they're relevant."
    )

def chat(user_message: str) -> str:
    memories = recall(user_message, top_k=4)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": build_system_prompt(memories)},
            {"role": "user", "content": user_message},
        ],
    )
    reply = response.choices[0].message.content
    save_memory(SESSION_ID, f"User: {user_message}\nAssistant: {reply}")
    return reply

if __name__ == "__main__":
    print(f"Session: {SESSION_ID}\nType 'quit' to exit.\n")
    while True:
        msg = input("You: ").strip()
        if not msg or msg.lower() in ("quit", "exit"):
            break
        print(f"Agent: {chat(msg)}\n")

Each turn follows the same sequence: embed the query, pull the top-4 semantically similar memories from Postgres, inject them into the system prompt, generate a response, then store the full exchange as a new memory. Storing the exchange as a single string (User: ... / Assistant: ...

) works well for recall because a single embedding captures both sides of the turn. An alternative: store only LLM-synthesized summaries per turn, cutting embedding calls in half and reducing noise in retrieval.

Verify It Works #

Run a first session and tell the agent something specific:

You: My name is Priya and I'm building a Rust compiler for embedded targets.
Agent: That's a fascinating project, Priya...

Exit with quit

, then restart the script entirely:

python agent.py
You: Do you remember what project I'm working on?
Agent: Yes, you mentioned you're building a Rust compiler for embedded targets.

That recall came from Postgres, not any in-process state. To inspect what's stored:

docker exec -it agent-memory psql -U agent -d agentdb \
  -c "SELECT session_id, left(content, 80), created_at FROM agent_memories ORDER BY created_at DESC LIMIT 5;"

Troubleshooting #

** OpenAIError: The api_key client option must be set** - Your

.env

file isn't being found or OPENAI_API_KEY

is missing from it. Confirm the file is in the same directory you're running the script from. Both memory_store.py

and agent.py

call load_dotenv()

, so one of them should pick it up, but python-dotenv won't override a variable that's already set to an empty string in the shell environment.** ERROR: type "vector" does not exist** - The extension isn't enabled in this database. Run

CREATE EXTENSION vector;

inside psql. One-time per database, not per connection.** connection refused on port 5432** - The container stopped. Check with

docker ps -a

, then docker start agent-memory

.** ERROR: expected 1536 dimensions, not N** - You mixed embedding models or changed the

vector(N)

column after inserting rows. Drop the table, re-apply schema.sql

, and commit to one model for the life of the table.Retrieved memories are irrelevant - With few rows, cosine similarity has little to work with and can surface weak matches. Add a distance threshold: modify the SELECT to WHERE (embedding <=> %s) < 0.4

and tune from there. You can also filter by created_at

to prefer recent memories.

Next Steps #

Memory compression: Periodically summarize older memories with the LLM and replace raw exchanges with a condensed form. Fewer rows, better signal-to-noise.Importance scoring: Add arelevance_score FLOAT

column. Rank retrieved memories by a blend of semantic similarity and recency rather than cosine distance alone.Async: Swap psycopg2 for asyncpg and thepgvector

package's asyncpg adapter for use inside async agent frameworks like LangGraph or Pydantic AI.Structured extraction: Before embedding, extract named entities and key facts from the exchange. Embedding structured facts rather than raw conversation text sharpens retrieval precision considerably.

Mariana Souza· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

sourcefeed.dev — original article AMD's GLM-5.2 win over Blackwell is a software story The Supply Chain Risk of LLM Code in Dependencies Google Sunsets Consumer Gemini Code Assist for Multi-Agent Pivot