{"slug": "your-agent-has-a-memory-that-runs-while-you-sleep", "title": "Your Agent Has a Memory That Runs While You Sleep", "summary": "A developer built a continuous AI agent memory system called `akm improve` that runs autonomously on local hardware, processing 14,189 memories across 48 scheduled runs in 24 hours with zero failures. The system operates on consumer-grade GPUs like the RTX 4060 Ti, completing consolidation passes in under 5 minutes without cloud APIs or per-token costs. The pipeline promoted 1,361 memories to persistent storage, merged 49 duplicate entries, and surfaced 211 contradictions for review.", "body_md": "This post is part of the akm-knowledge series. [Part ten](https://dev.to/itlackey/the-improvement-loop-how-akm-keeps-your-agent-sharp-2d4d) introduced the improve pipeline — what each phase does and how to schedule it. This post goes deeper on what continuous operation looks like in practice: the hardware numbers, the reliability bugs we hit at 48 runs per day, and the observability layer we built to keep watch.\n\nMost people think of AI agent memory as something that happens during a session. You talk to your agent, it learns things, maybe you save a few notes, the session ends. The next session starts cold.\n\n`akm improve`\n\nis built around a different model: a continuous background process that runs on your own hardware, against local models, and quietly curates your agent's knowledge base while you work on other things. No cloud API required. No per-token billing for the maintenance pass. A GPU you already own, a model you already have downloaded, running on a schedule.\n\nThis post covers what 24 hours of autonomous operation actually looks like, how consumer-grade GPUs handle the load, the reliability work that makes continuous operation viable, and the observability layer that lets you know it's working without watching logs.\n\n`akm improve`\n\nis a multi-phase pipeline. The core pass — consolidation — loads your memory pool, groups related memories into chunks, sends each chunk to a local LLM for a consolidation plan (merge similar memories, promote high-signal ones to your stash, delete redundant ones, surface contradictions), and then executes those plans. After consolidation, memory inference runs a lightweight factual extraction pass, and graph extraction updates the entity-relation index.\n\nThe pipeline is scheduled to run automatically. Here is what one 24-hour window produced:\n\n| Metric | Value |\n|---|---|\n| Runs completed |\n48 / 48 — zero failures |\n| Memories processed | 14,189 |\n| Promoted to stash | 1,361 |\n| Merged (deduplication) |\n49 (64 secondaries absorbed) |\n| Contradictions surfaced | 211 |\n| Deleted (redundant) | 31 |\n| Memory inference yield |\n69.3% — 115 new atomic facts written |\n| Graph entities extracted |\n181 across 9 files |\n| Task fail rate | 0% |\n| Index entries |\n7,398 — all embedded, status `ready-vec`\n|\n\nEvery run that completes leaves your stash in better shape than before it started. Memories that accumulated across dozens of agent sessions get compressed, merged, and organized without manual intervention. The 1,361 promotions in this window represent memories that were considered significant enough by the local LLM to persist as named stash entries. The 49 merges collapsed near-duplicate content. The 211 contradictions were flagged for review rather than silently overwritten.\n\nThis is the loop. It runs every 30 minutes. You don't have to think about it.\n\nThe consolidation LLM in this setup is `qwen3.5-9b`\n\n(or similar) running locally via LM Studio on an OpenAI-compatible endpoint. The model fits comfortably on most modern gaming GPUs. No API key. No per-call cost. The inference happens on hardware sitting on your desk.\n\nWe run two LM Studio servers — both serving the same model via OpenAI-compatible endpoints — and benchmarked them head to head.\n\n**Shredder** is a desktop with an RTX 5090. **Splinter** runs an RTX 4060 Ti, a card that launched at $299 and is common in mid-range gaming builds. Same model weights. Same chunk sizes. Different VRAM bandwidth and tensor core counts.\n\n| RTX 5090 (Shredder) | RTX 4060 Ti (Splinter) | |\n|---|---|---|\nPer-chunk latency |\n~6.8s | ~22.6s |\n13-chunk consolidation |\n~87s (~1.5 min) | ~290s (~4.8 min) |\nSpeed ratio |\n1× (baseline) | 3.3× slower |\nRuns per hour (fits schedule) |\n✅ yes | ✅ yes |\nApproximate street price |\n~$2,000 | ~$300 |\n\nThe 5090 is faster, but the 4060 Ti finishes a full consolidation pass in under 5 minutes — well inside the 30-minute run window. Both cards sustain 48 runs per day without missing a cycle.\n\nWhere the gap shows up is in the tail. Because the 24h window included runs on both backends, the aggregate latency numbers reflect both:\n\n| Phase | Median | P95 | What drives the P95 |\n|---|---|---|---|\n| Total (end-to-end) | 7.2 min |\n23.4 min | Splinter-routed consolidation runs |\n| Consolidation | 1.4 min |\n5.8 min | Chunk count variance + Splinter |\n| Memory inference | 5.9s |\n25.3s | Fresh (non-cached) inference attempts |\n| Graph extraction | < 1s |\n53s | Cache misses on modified files |\n\nThe median of 7.2 minutes reflects the majority of runs going to Shredder. The P95 of 23.4 minutes is almost entirely Splinter runs with larger chunk windows. A setup running exclusively on a 4060 Ti would see a flatter distribution — median around 10–12 minutes, P95 around 18–20 minutes — with no 5090 runs pulling the median down.\n\nFor most setups, a single mid-range GPU is the right starting point. The consolidation pass is CPU-light and network-light — the bottleneck is token generation throughput on the GPU. If you have a second machine with a GPU and spare VRAM, you can point a second LM Studio server at it and split load exactly as we did here.\n\nThe embeddings server is separate — `nomic-embed-text-v1.5`\n\nrunning on localhost — and handles the semantic search index. It stays warm between runs, so re-embedding after promotions adds negligible latency. Any GPU with 4GB+ of VRAM can host it alongside the consolidation model if you have the headroom, or it runs on CPU at acceptable speed for indexing workloads.\n\nThe concern with local models is usually quality: will a 9B parameter model running on a gaming GPU produce consolidation plans good enough to trust with your knowledge base?\n\nThe answer, based on 48 runs and 14,189 memories, is yes — with the right constraints.\n\nThe consolidation prompt is designed to be conservative. The LLM is asked to identify candidates for merge, promote, or delete within a bounded chunk of related memories. It is not given unbounded latitude. Plans are validated against the loaded memory pool before execution — if the model invents a ref that doesn't exist, the op is dropped with a warning. If a promoted memory fails schema validation, it is rejected.\n\nThe 69.3% yield rate on memory inference tells the same story. Out of 166 fresh attempts at factual extraction, 115 produced usable atomic facts. The model is making useful inferences at a rate that justifies running it continuously.\n\nThe practical limit of local 9B models shows up in graph extraction: 2 truncations in the 24h window indicate chunks that exceeded the model's context window. These produce partial rather than failed extractions — the model handles what it can see. Larger models extend this ceiling; a 5090 can hold larger quantizations in VRAM.\n\nRunning 48 times a day means reliability issues that would be minor in a manual workflow become systemic. Two bugs were affecting the consolidation pass and wasting inference on every affected run.\n\n**The stale database problem.** After a run that deleted files, the database retained entries pointing to files that no longer existed on disk. The next run loaded those ghost entries, the LLM generated merge plans against them, and Phase B failed silently when the file wasn't found. Every affected secondary in those plans was charged a wasted inference call.\n\nThe fix is a pre-flight filter that runs before the LLM sees anything:\n\n``` js\nmemories = memories.filter((m) => fs.existsSync(m.filePath));\n```\n\nStale entries never reach the model. A warning is logged so the count is visible in health output if the filter ever catches something:\n\n```\nPre-flight: filtered 3 stale DB entries (file absent on disk) from memory pool before chunking.\n```\n\n**The hallucination problem.** On certain chunk compositions — particularly when session checkpoint memories and named sessions appear in the same window — the local model would blend naming conventions and produce a merge plan with a primary ref that didn't exist in the pool.\n\nA typical example: `memory:opencode-session-20260529-a1b2`\n\nand `memory:checkpoint-20260529T214550`\n\nin the same chunk produce a hallucinated primary of `memory:opencode-session-20260529T214550-ses_18a4`\n\n. The plan looks reasonable at the chunk level. The ref doesn't exist at the pool level.\n\nBefore the fix, that hallucinated primary would reach Phase B and charge every real secondary (typically 4–8 refs) with a failed merge skip. After the fix, `mergePlans()`\n\nvalidates every primary ref against the loaded pool before execution:\n\n``` js\nconst knownRefs = new Set(memories.map((m) => `memory:${m.name}`));\nconst { ops: allOps, warnings: mergeWarnings } = mergePlans(chunkOpsArrays, knownRefs);\n```\n\nReal merge plans proceed. Hallucinated roots are dropped. The warning is distinguishable from the stale-DB warning, so health metrics can tell the two apart:\n\n```\nmergePlans: primary memory:... not in loaded memory pool (LLM hallucination) — dropping op before execution.\n```\n\nBoth fixes eliminate wasted inference. On the 4060 Ti at 22.6s per chunk, a single hallucinated primary that would have charged 6 secondaries saves over 2 minutes of inference time per occurrence — time that can go toward real consolidation work instead.\n\nRunning autonomously in the background only helps if you know when something goes wrong. `akm health`\n\nprovides a structured view of recent improve activity:\n\n```\nakm health --since 4h\nakm health --since 24h --format text\n```\n\nIt surfaces run counts, skip reason breakdowns, consolidation outcomes, memory inference yield, and phase latencies in a single command. The same JSON output feeds automation.\n\nFor continuous monitoring, we built a cron task that posts a rolling 4-hour health report to Discord every hour:\n\n```\n# ~/akm/tasks/akm-health-report.yml\nschedule: 0 * * * *\ncommand: akm env run fwdslsh -- bash ~/akm/scripts/akm-health-discord.sh\nenabled: true\n```\n\nThe script calls `akm health --since 4h`\n\nand `--since 8h`\n\n, computes deltas for trend context, and posts a Discord embed:\n\n```\nakm tasks sync   # register the cron\n```\n\nThe embed has three inline fields — Output (promoted, merged, MI yield), Failures (chunk failures, skip reason anomalies), and Latency (median, P95, previous-window comparison) — plus a Needs Attention section that only appears when something is actually off. The footer includes the hostname and timestamp so reports from multiple machines are distinguishable at a glance.\n\nThe result: a health check fires every 30 minutes from the pipeline, and a visibility report fires every hour to Discord. You see degradation before it accumulates.\n\nHere is what autonomous local-model memory curation looks like across a full day:\n\nThe hardware requirement to run this continuously is a mid-range gaming GPU. The model requirement is a 7–9B parameter instruction-tuned model quantized to 4–8 bits. Both are things a lot of developers already have.\n\nThe value is in what compounds. Each run makes the stash slightly more accurate, slightly more consolidated, slightly more consistent. After 48 runs, 14,000 memories have been through a curation pass that would have taken hours to do manually. After a week, the stash is a different kind of asset — not a pile of notes, but a continuously maintained knowledge base that your agent can rely on across sessions.\n\n`akm improve`\n\nis part of akm 0.8.x. The full pipeline configuration and local model setup docs are in the [configuration reference](https://github.com/itlackey/akm/blob/main/docs/configuration.md). Hardware requirements and LM Studio setup are covered in the [getting started guide](https://github.com/itlackey/akm/blob/main/docs/getting-started.md).", "url": "https://wpnews.pro/news/your-agent-has-a-memory-that-runs-while-you-sleep", "canonical_source": "https://dev.to/itlackey/your-agent-has-a-memory-that-runs-while-you-sleep-1j76", "published_at": "2026-06-04 00:31:26+00:00", "updated_at": "2026-06-04 00:42:25.146464+00:00", "lang": "en", "topics": ["ai-agents", "ai-infrastructure", "ai-tools", "machine-learning", "mlops"], "entities": ["akm-knowledge", "akm improve", "GPU", "LLM"], "alternates": {"html": "https://wpnews.pro/news/your-agent-has-a-memory-that-runs-while-you-sleep", "markdown": "https://wpnews.pro/news/your-agent-has-a-memory-that-runs-while-you-sleep.md", "text": "https://wpnews.pro/news/your-agent-has-a-memory-that-runs-while-you-sleep.txt", "jsonld": "https://wpnews.pro/news/your-agent-has-a-memory-that-runs-while-you-sleep.jsonld"}}