The Four Types of Memory for AI Agents (and How Claude Code Implements Each)

A Princeton team's CoALA framework defines four types of memory for AI agents—working, semantic, procedural, and episodic—and Claude Code implements each as plain files and commands rather than a vector database, with episodic memory consolidation being the most advanced frontier.

TLDR - Memory is what separates a chatbot from an agent. A chatbot answers. An agent answers shaped by what it knows about your project and what it learned last time. - There’s a clean framework for this from a Princeton team called CoALA, and it splits agent memory into four types: working, semantic, procedural, episodic . - This post defines all four in general, then shows how Claude Code implements each one as something you can point at on disk. - Claude Code maps every type to plain files and commands, not a vector database. The most interesting frontier is episodic memory and consolidation , the thing Anthropic calls “dreaming.” - Match the memory to the job. A thermostat needs one type. A coding agent wants all four. Why I care about agent memory I don’t write much code by hand anymore. I delegate to a coding agent and review the output. And the thing that decides whether that delegation is great or painful isn’t the model’s raw IQ, it’s what the agent remembers. Two failure modes I hit constantly: - I explain a project gotcha, we fix it together, and the next session the agent makes the exact same mistake. Yesterday’s context is gone. - I dump everything I know into one giant file so it can’t forget, and now the agent is drowning in noise and misses the thing that actually matters. Both are memory problems. And “memory” turns out not to be one thing. It’s four. I picked this framing up from an IBM Technology video, The Four Types of Memory Every AI Agent Needs https://www.youtube.com/watch?v=BacJ6sEhqMo , which is the plain-language on-ramp. The rigorous version is the CoALA framework https://arxiv.org/abs/2309.02427 Cognitive Architectures for Language Agents out of Princeton. Both land on the same four buckets and the same human-memory analogy. I’m going to define each type, then show how Claude Code implements all four, because seeing them as concrete files on disk is where it clicks. The human version first The framework lifts straight from how we remember things, so start there: Short-term or working memory : what’s active in your head right now. The sentence you’re reading. Volatile, small. Semantic memory : factual knowledge. “Python is interpreted.” You just know it, no lookup. Procedural memory : learned skills. Riding a bike, driving a car. You don’t re-derive it each time. Episodic memory : personal experience. The time I spent three hours debugging a Kubernetes cluster before realizing I was pointing at the wrong cluster the whole time. Well-designed agents need the same four, and CoALA gives them the same names. The parallel does real work later in the post: humans forget effortlessly and usefully, and consolidate memory in sleep. Agents can’t do either yet, which is exactly what the “dreaming” features are trying to fix. The four types, defined Working memory Working memory is the agent’s active scratchpad for the current decision cycle. CoALA puts it precisely: it “maintains active and readily available information as symbolic variables for the current decision cycle.” In practice it’s everything the model can see right now: the running conversation, the system prompt, tool and command outputs, and any files or memory loaded into the prompt. The analogy everyone uses is RAM. Fast, immediately accessible, but volatile: when the session ends, it’s gone. And it’s bounded by the context window. Context windows are huge now, a million tokens or more, but “huge” isn’t “infinite.” Stuff too much in there and quality drops as the model loses track of things buried in the middle. The key thing: working memory is the bridge every other memory passes through. The other three types are stores on disk. Working memory is what they get loaded into . Every agent has working memory. So does every chatbot: it’s just the context window. So working memory alone doesn’t make something an agent. The question is what else you add. Note Working memory is the one most people already understand, so I won’t belabor it. If you want the deep version why your context is “just an array of tokens” and how to keep noise out of it I wrote that up in Stop Bloating Your CLAUDE.md /posts/stop-bloating-your-claude-md-progressive-disclosure-ai-coding-tools/ Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools AI coding tools are stateless—every session starts fresh. The solution isn't cramming everything into CLAUDE.md, but building a layered context system where learnings accumulate in docs and specialized agents load on-demand. . Semantic memory Semantic memory is the agent’s stable knowledge about the world and itself. CoALA defines it tersely: it “stores an agent’s knowledge about the world and itself.” These are the general facts, rules, conventions, and reference material that are true independent of any single interaction. The academic version talks about vector databases and knowledge graphs. RAG counts too. But in a lot of production agentic systems today it’s something much simpler: Markdown files. Project architecture, coding conventions, build commands, which frameworks to use, and just as important, what not to do. Loaded into working memory up front, every session. Without it, the agent has no persistent knowledge to draw from, so it’s doomed to relearn your project every single time. Procedural memory Procedural memory is the agent’s knowledge of how to do things. Not facts. Steps. The paper splits this in two. CoALA is explicit: “Language agents contain two forms of procedural memory: implicit knowledge stored in the LLM weights, and explicit knowledge written in the agent’s code.” You can’t really edit the implicit half. Fine-tuning is the only lever and it’s expensive. The explicit half is the one you control, the agent’s own code and the workflows you author. The paper is blunt about the risk here: writing to procedural memory is the dangerous kind of learning. A new fact or note is harmless. A new step changes how the agent behaves. CoALA calls it “significantly riskier than writing to episodic or semantic memory, as it can easily introduce bugs or allow an agent to subvert its designers’ intentions.” So I author and review skills by hand. I don’t let the agent rewrite its own procedures on its own. Episodic memory Episodic memory is the agent’s record of specific past experiences. CoALA says it “stores experience from earlier decision cycles.” The paper gives examples: training input-output pairs, history event flows, game trajectories from previous episodes, or other representations of the agent’s experiences. Unlike semantic memory general facts , episodic memory is tied to particular events the agent lived through, retrieved to inform later behavior. The naive version is to save every conversation transcript and grep it later. That technically counts, but it’s useless. Nobody wants to search a 45-minute debugging log to find the one line that mattered. What good systems do is distill . Instead of a full transcript you get: Last time we debugged the auth module, the bug was in the middleware layer. That’s compressed experience, and it’s where memory starts to look like genuine learning. Episodic is the hardest type to get fully right, and the hard part isn’t storing. It’s forgetting . What do you delete? When does a note go stale? Humans forget effortlessly, and annoying as that is, it’s useful. For an agent, forgetting is an engineering problem nobody has fully solved. The paper says the same: “modifying and deleting a case of ‘unlearning’ are understudied in recent language agents.” Keep that line in mind. It’s the gap Claude Code’s newest features are reaching toward. How Claude Code implements all four To prove this framework isn’t theoretical, here’s Claude Code implementing all four buckets. Claude Code https://code.claude.com/docs/en/memory is the cleanest tool to learn on, because each memory type is something you can point at on disk: a file, a directory, or a command. Working : the context window you watch fill up. Run /context to see what’s eating space. /clear wipes it entirely. If you run long, /compact summarizes the history to buy room or the automatic version kicks in near the ceiling . The docs say it plainly: the context window holds “conversation history, file contents, command outputs, CLAUDE.md, auto memory, loaded skills, and system instructions.” Semantic : CLAUDE.md . It’s a hierarchy : a user-level ~/.claude/CLAUDE.md for your defaults, a project-level ./CLAUDE.md checked into the repo for the team. Both load every session. You keep it lean with @ -imports See @README.md ... Git workflow: @docs/git-instructions.md instead of pasting everything inline. There’s no enforced token limit, but shorter is better for adherence. I go deep on the CLAUDE.md-vs-skills split in my Claude Code customization guide /posts/claude-code-customization-guide-claudemd-skills-subagents/ Claude Code Customization Guide 2026 : CLAUDE.md vs Skills vs Subagents When should you use CLAUDE.md, a slash command, a skill, or a subagent in Claude Code? A decision guide with real examples for each, so you stop guessing which one fits the job. . Procedural : skills , a folder with a SKILL.md YAML frontmatter holds a description of what it does and when to use it, the body holds the steps . The clever bit is progressive disclosure . The agent only sees a lightweight index, name and description, about a hundred tokens each. When a task matches, it loads the full instructions. Linked files and scripts load only when the steps call for them. That’s how you give an agent fifty capabilities without paying for fifty on every turn. .claude/skills/ code-review/ SKILL.md ← name + description always visible ~100 tokens checklist.md ← loaded only when the skill runs review.py ← pulled in only if the steps call for it Episodic : Auto memory on by default, needs Claude Code v2.1.59+ . The agent writes distilled notes for itself to a memory directory you own, “build commands, debugging insights, architecture notes, code style preferences, and workflow habits,” and “intelligently decides what information is useful to remember for future conversations.” That’s distillation, not transcript-dumping. The first 200 lines or 25KB load at session start as an index. Topic files load on demand. Run /memory to browse and prune by hand. Claude Code’s distinctive choice is minimal-by-default : file hierarchy plus imports for semantic, progressive disclosure for procedural, passive distillation loaded as an index for episodic. And it’s pushing the forgetting frontier with Dreams Managed Agents API, research preview, billed per token : an on-demand asynchronous job you trigger via the API reads the memory store plus the session transcripts you supply, merges duplicates, replaces stale entries, and writes a new store. It never touches the input, so you review the output and keep it or throw it away. That’s the prune-by-hand step, done by the model, and it’s an on-demand job you invoke, not a schedule that fires on its own. Claude Code’s four types at a glance | Memory type | Where it lives in Claude Code | Manage it with | |---|---|---| Working | The context window | /context , /clear , /compact | Semantic | CLAUDE.md hierarchy user + project + @ -imports | Edit the files | Procedural | Skills SKILL.md + progressive disclosure | Author by hand in .claude/skills/ | Episodic | Auto memory distilled notes, index-loaded + Dreams consolidation | /memory | The pattern is the point: every type maps to plain files and commands, not a vector database. Semantic and procedural are Markdown you write; episodic is Markdown the agent writes for itself. And the newest frontier is consolidation, the “dreaming” layer that automates forgetting, an on-demand Dreams job that merges duplicates and replaces stale entries so you don’t have to prune by hand. I haven’t stress-tested the Dreams consolidation in production long enough to say whether the output is trustworthy without review. Check the results before you let it run unsupervised. Episodic memory you throw away Auto memory is permanent , carried across every session. But episodic memory doesn’t have to live that long. The most useful version I run day to day is scoped to a single task and thrown away when the task is done. Same memory type , deliberately short lifetime. The approach is simple: tell the agent to keep a progress file for this one job. Peter Steinberger https://x.com/steipete/status/2065357277880877413 posted a refactor prompt that does exactly this. The gist: Goal:refactor until you’re happy with the architecture. Live-test after each significant step, then auto-review and commit. Track progress in /tmp/refactor-{projectname}.md . That /tmp path is the whole idea. The file is the episodic memory for this task. After each step the agent appends what it changed, what passed, and what’s left. If the context compacts or you restart, the agent reads the file back and knows exactly where it was. And because it lives in /tmp , the memory evaporates when the job’s done. No permanent residue, no stale notes to prune. This same pattern is the engine behind the Ralph loop , Geoffrey Huntley’s https://ghuntley.com/ralph/ technique for long-running agents. You run a coding agent in an infinite shell loop: same prompt every iteration, fresh context window every time. The agent has zero conversation memory between loops, so the file system has to be its memory. Two files carry the state: : a prioritized todo list. Each iteration the agent picks one item, implements it, updates the plan, and commits. The instruction is blunt: “ALWAYS KEEP @fix plan.md up to date with your learnings… especially after wrapping up your turn.” fix plan.md : operational learnings, like how to run the build or the examples, kept brief. AGENT.md Every loop reads those files into the new context, does work, writes them back, and commits. As one writeup puts it, “the agent is no longer amnesiac between runs because the progress file is the memory.” Git history becomes a second episodic record on top of that. Every commit is a timestamped “here’s what I tried.” It costs almost nothing across iterations: the learnings sit on disk, only the relevant slice loads next turn. Not every agent needs all four This is the part that made the framework click for me. You don’t bolt on all four types because they exist. Add the ones the job needs. | Agent | Working | Semantic | Procedural | Episodic | |---|---|---|---|---| | Simple reflex thermostat, routing bot | ✅ | ||| | Narrow support agent password reset | ✅ | ✅ | || | Coding agent | ✅ | ✅ | ✅ | ✅ | A simple reflex agent a thermostat, a basic router needs working memory and basically nothing else. A narrow support agent that resets passwords needs working memory plus the procedural workflow, and that’s about it. A coding agent is the one case that wants all four: a context window to work in, semantic knowledge of your project, a skill system for repeatable workflows, and episodic memory so it stops repeating mistakes. The paper runs the same exercise on real agents. ReAct, the reason-then-act agent, “lacks semantic or episodic memory and has no retrieval or learning actions.” Working memory plus reasoning, nothing else. The agents that use all four are the ambitious ones: Voyager writes its own code skills to play Minecraft, and Generative Agents runs a whole sandbox simulation of human-like characters. Both wire up working, semantic, procedural, and episodic memory. Same conclusion: the richer the job, the more memory types it needs. The two middle rows in the table are my own illustrative extrapolation. The ReAct-vs-Voyager contrast is straight from CoALA’s Table 2. The paper’s own case studies are the cleanest examples of skipping types: Tree of Thoughts solves puzzles like the game of 24 and crosswords. Working memory plus reasoning, and “no long-term memory” at all. A pure problem-solver doesn’t need facts about your world or a record of last time. It just thinks. SayCan is a robot fetching a snack in a kitchen. Its long-term memory is procedural only, a fixed library of 551 skills, with “no internal actions of reasoning, retrieval, or learning.” No semantic facts, no episodic history. It scores the available skills and runs the best one. The same shapes show up in everyday systems: A docs Q&A bot RAG over your handbook : working plus semantic. It needs the knowledge base, not authored skills or any memory of you between questions. A password-reset support bot : working plus procedural. One fixed workflow, no project facts, no past tickets. The coding agent is the rare one that genuinely wants all four. Most agents don’t, and forcing the extra types on them just adds cost and noise. Wrapping up - Agent memory isn’t one thing. It’s four: working the context window , semantic stable facts , procedural how-to skills , episodic learned experience . - Claude Code implements all four, and maps each to something concrete: the context window, CLAUDE.md , skills, and auto memory. - The interesting work is in episodic and consolidation : distilled auto-memory plus Dreams, the on-demand job that merges duplicates and prunes stale notes. Automated forgetting is the most convincing answer to the unsolved problem I’ve seen. - Match the memory to the job. A thermostat needs one type. A coding agent needs all four. Memory is what lets an agent give you a response shaped by your project, your preferences, and the mistakes it already made, so it doesn’t make them again. Which of the four are you actually using in your own setup? Most people are running on working and semantic and leaving procedural and episodic on the table.