LLM Memory System Pitfalls: A 3-Hour Bug Hunt Solved with Pytest Snapshot Testing

wpnews.pro

cd /news/large-language-models/llm-memory-system-pitfalls-a-3-hour-… · home › topics › large-language-models › article

[ARTICLE · art-24673] src=dev.to ↗ pub=2026-06-12T01:04Z topic=large-language-models verified=true sentiment=↓ negative

LLM Memory System Pitfalls: A 3-Hour Bug Hunt Solved with Pytest Snapshot Testing

A developer spent three hours debugging a production LLM memory system bug where a `rollback` method wiped out entire conversation histories instead of just undoing erroneous operations. The root cause was a code refactor that accidentally cleared the `snapshots` table during rollback, which existing unit tests failed to catch because they always started from empty databases and never simulated cross-session persistence. The developer resolved the issue by implementing snapshot testing that treats the SQLite database file itself as an immutable artifact, enabling tests to verify file-level persistent state across different connections.

read3 min views18 publishedJun 12, 2026

It was 2 a.m. when the alert call jolted me awake — our production Agent had suffered “amnesia” for three consecutive conversations. The context the user had carefully built was gone, and complaints were flooding in. Squinting at the logs, I discovered that the rollback

method in the memory management module had been broken by an innocuous-looking code refactor. Not only did the rollback undo the erroneous operation, it also wiped out the entire conversation history. Worse still, our existing unit tests never caught the bug: they always started from a fresh empty database and could never cover a cross-session scenario like “roll back dirty data to a previous snapshot.” I spent three hours debugging, manually simulating intermediate states, before I finally pinpointed the root cause. That’s when it hit me: we weren't lacking tests — we were missing snapshot tests that capture the entire “memory state.”

Our LLM memory system uses SQLite for local persistence. Each session owns a table that stores conversation turns, vector summaries, and tool-call records. Two critical operations are:

save_snapshot(session_id)

: serializes the full state of a session into the snapshots

table, creating a rollback checkpoint.rollback_to_snapshot(session_id, snapshot_id)

: when something goes wrong, it rebuilds the session table from a snapshot and discards all changes made after that point.This mechanism had been running smoothly — until a refactor I made changed the transaction boundaries inside the rollback logic. After the rollback executed, the conversations

table was rebuilt just fine, but the snapshots

table itself was accidentally wiped out. The next rollback attempt couldn’t find any previous checkpoints.

Why didn’t traditional unit tests catch this? Because the typical test flow looks like this:

def test_rollback():
    db = create_in_memory_db()
    db.save_snapshot("s1")
    db.rollback_to_snapshot("s1", ...)
    assert db.get_conversation("s1") == expected

Everything runs in a single process, inside a single temporary database. However, the production scenario was different: process A saves a snapshot and exits, then process B reopens the same database file and performs the rollback. File-level persistent state, WAL log merging, and even the visibility of the snapshots

table across different connections — none of that was tested. To put it bluntly, we tested the “logic” but never tested the “storage.”

I decided to bring in snapshot testing, but instead of using text-based snapshots, I would treat the SQLite database file itself as an immutable artifact.

Comparison of approaches:

tmp_path

manual comparisonThe architectural idea: provide a snapshot_db

fixture via conftest.py

that:

tests/snapshots/memory_test.sqlite

) exists before the test starts.--snapshot-update

flag) and the test passes immediately.With this approach, our tests truly simulate a “cross-process, cross-connection” persistence effect — each test case receives an independent copy of a database file, performs its operations, and then the entire file state is compared against the expected outcome.

This code clarifies what we intend to test. MemoryManager

wraps the SQLite connection, snapshot saving, and rollback — a simplified version of what we use in production.

import sqlite3
import uuid
from datetime import datetime, timezone

class MemoryManager:
    def __init__(self, db_path: str):
        self.db_path = db_path
        self._init_tables()

    def _get_conn(self) -> sqlite3.Connection:
        conn = sqlite3.connect(self.db_path)
        conn.execute("PRAGMA journal_mode=WAL")
        conn.row_factory = sqlite3.Row
        return conn

    def _init_tables(self):
        with self._get_conn() as conn:
            conn.executescript("""
                CREATE TABLE IF NOT EXISTS conversations (
                    session_id TEXT NOT NULL,
                    turn INTEGER NOT NULL,
                    role TEXT NOT NULL,
                    content TEXT NOT NULL,
                    PRIMARY KEY (session_id, turn)
                );
                CREATE TABLE IF NOT EXISTS snapshots (
                    snapshot_id TEXT PRIMARY KEY,
                    session_id TEXT NOT NULL,
                    created_at TEXT NOT NULL,
                    state_json TEXT NOT NULL
                );
            """)

    def add_message(self, session_id: str, role: str, content: str):
        with self._get_conn() as conn:
            turn = conn.execute(
                "SELECT COALESCE(MAX(turn), 0) + 1 FROM conversations WHERE session_id = ?",

source & further reading

dev.to — original article Vibe Coding: Endgame How I Let a Cloud AI Operate My Home Without Handing Over My Home Technical Debt in Property Management: Why Your Old System Blocks AI

~/api · this article 200

$curl api.wpnews.pro/v1/news/llm-memory-system-pitfal…

Read original on dev.to → dev.to/_eb7f2a654e97a60ae9f96e/llm-memory-system…

mentioned entities

SQLite

Pytest

metadata

slugllm-memory-system-pitfalls-a-3-hour-bug-hunt-solved-with-pytest-snapshot-testing

topic#large-language-models

secondary3 topics

sentimentnegative

canonicaldev.to

navigation

← prevI vibe coded a world cup cheer g…

next →Show HN: A PDF analysis tool for…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 28 Jul · #large-language-models

OpenClaw v2026.7.2: Remote Sessions and Crash Recovery

promptcube3.com · 28 Jul · #large-language-models

Agentic AI Deployment: My Team's Transition to LLM Agents

dev.to · 28 Jul · #large-language-models

AI Agents Are Not Magic. They Are Just Good Feedback Loops

dev.to · 28 Jul · #large-language-models

Vibe Coding: Endgame

── more on @sqlite 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required