Others build agent memory, and what I took from each

Falconer engineer Ben introduced agent signals to solve the problem of AI agents starting every conversation from zero, unable to remember a user's role, preferences, or communication style. The system automatically extracts durable self-statements from user conversations every six hours and allows manual signal entry, then injects all signals into the system prompt at the start of each conversation. This approach ensures agents answer questions with appropriate context, format responses according to user preferences, and avoid disliked stylistic choices without requiring repeated corrections.

Back to Notes /notes How others build agent memory, and what I took from each Starting from zero, every time An engineer at Falconer asks our agent what’s safe to change to ship a new payments retry path. They’ve owned this code for two years. The agent answers like they’re new to the codebase. It explains what the orchestration layer is, where the idempotency keys live, the basics of the retry queue. None of that was useful. A different user prefers tight bullet lists for meeting summaries. The agent returns three paragraphs of flowing prose. Another user hates em dashes in writing. Every draft the agent produces is laced with them. None of this is a knowledge problem. The agent can look things up. It’s an identity and style problem the agent has no way to solve from a single conversation. By the time the user has corrected the tone or the framing three or four times, the conversation ends and the next one starts cold. Every interaction starts at zero. That’s the failure mode agent signals was built to fix. An agent signal is a short, durable, self-stated thing about the user. Their role. Their team. What they own. How they prefer to communicate. How they like things written. The signals get pulled into the system prompt at the start of every agent conversation, so the agent can write with that context baked in. Same questions, different answers. The codebase question gets an answer that assumes you know the codebase. The meeting summary comes back as bullets. The draft doesn’t have em dashes. That’s the whole feature. The substance is everything underneath: how signals get in, what counts as a signal in the first place, and what to do when there are too many to fit. The first two are about prompt design and ingestion plumbing. The last one is where the design choices get interesting, because it turns out every production AI memory system has solved it differently, and the differences matter. Two write paths, one read path There are two write paths for agent signals in Falconer, and one read path that gathers them up at conversation start. Automatic extraction Every six hours, a background job scans every conversation flagged for extraction. For each user, it loads the user’s messages from those conversations assistant turns and tool output are ignored , passes them to a deliberately restrictive LLM prompt, and gets back structured actions: create a new signal, update an existing one, or skip . Returning zero signals from a conversation is the expected outcome, not the exception. The prompt aggressively rejects anything that isn’t a durable self-statement. Task context like “currently debugging payments” gets skipped because it varies across conversations. Org-level facts like “Falconer uses Postgres” get skipped because they apply to everyone in the org. Neither is a signal about who the user is . Manual entry Users can add or edit signals directly in a settings page. This path matters more than it sounds. The user is the ground truth about themselves; the extraction LLM is a guess. If the system ever silently overwrites or contradicts something a user typed in, users stop trusting what’s stored about them. So manual signals get treated as sacred. Foreshadowing the tiering design later: they live in their own bucket and the system promises never to consolidate or drop them. The read path At the start of every agent conversation, I pull all of the user’s signals and render them as a bullet block under About this user in the system prompt. No retrieval step. No semantic search. The full list goes in. That last part deserves explaining, because the obvious instinct is “this should be RAG.” It shouldn’t. Tens of signals per user, not thousands. The math says inject everything. The model is better at deciding which signals are relevant to the current turn than any retrieval scheme I’d build. This was the first design decision I made, and the reason I made it was that ChatGPT and Claude Code had both arrived at the same conclusion, independently, at much larger scale. That’s what got me to read everything I could find about how they actually work. What others have figured out Three production AI memory systems mattered most to the design: OpenAI’s ChatGPT memory, Anthropic’s Claude Code memory, and the MemGPT / Letta core-memory architecture. Each has a different shape, and each makes a different bet about what’s worth solving in engineering and what’s worth handing to the LLM. ChatGPT memory OpenAI’s memory feature stores explicit memories as flat, timestamped one-liners. No categories. No tags. On top of that, ChatGPT periodically generates “User Knowledge Memories”, AI-summarized dense paragraphs about the user that get regenerated when raw memory grows beyond some size. The AI-summarized dossier gets injected into every conversation, alongside any explicit one-liner memories the user has saved. Neither layer relies on retrieval. The most useful reverse-engineering I read on this was Simon Willison’s I really don’t like ChatGPT’s new memory dossier https://simonwillison.net/2025/May/21/chatgpt-new-memory/ , which pulls apart the actual injected prompt and shows what the dossier looks like at the token level. Shlok Khemani’s is the better piece on https://www.shloked.com/writing/chatgpt-memory-bitter-lesson ChatGPT Memory and the Bitter Lesson why OpenAI built it this way. The argument is that clever retrieval consistently loses to brute-force injection plus periodic re-summarization, and ChatGPT’s design is a bet on that asymmetry. Deduplication in ChatGPT isn’t an engineering problem at all. It’s solved by the periodic summarization rewriting the dossier from scratch. If a user has five variants of “prefers concise writing” in their raw memory, the next summary regen collapses them. No similarity threshold, no embedding distance, no merge logic. The LLM does the work that infrastructure would do in a more traditional system. What I took:flat-string storage with no categories, full injection instead of retrieval, and the bias toward making the LLM do dedup-flavored work instead of writing similarity code. Claude Code memory Claude Code’s memory model looks superficially like ChatGPT’s. Plain text, no database, no vectors. But the structure is meaningfully different and the difference is what made it useful as a reference. There are two separate memory systems sitting side by side, with different writers, different loading strategies, and different lifecycles. The first is CLAUDE.md files . These are human-authored. They live in the filesystem at four different scopes: a managed system-wide policy file, a per-user file in ~/.claude/ , a per-project file in ./CLAUDE.md , and a local override file that’s gitignored. Each scope determines load order. Managed loads first, then user, then project, then local, with later scopes overriding earlier. The scopes also determine version-control behavior. The project file is committed, the local file is not. Every CLAUDE.md file in scope is loaded in full at the start of every session. No index, no on-demand reads. The second is auto memory . Auto memory is agent-authored. Claude writes notes for itself in ~/.claude/projects/<project /memory/ as it works. These can be anything: debugging notes, API conventions, gotchas, whatever it decides to remember. There’s a MEMORY.md file that acts as an index of the memory directory, and the first 200 lines about 25 KB of the index load at session start. Topic files like debugging.md or patterns.md are not loaded at startup. Claude reads them on demand when the index hints they’re relevant. The framing that mattered for me, looking at both systems together, was less about labels or taxonomies and more about who writes the memory and how it gets loaded . Human-authored memory is small and stable, so loading it in full is cheap. Agent-authored memory grows unboundedly, so it needs an index plus on-demand reads or it eats the context window. The split fell out of the writer, not from any imposed category system. For Falconer’s design, that distinction reframed something I’d been collapsing. Manual signals are like CLAUDE.md: a human typed them, the volume is small, just always inject them. Auto-extracted signals are like Claude Code’s auto memory: agent-authored, grows over time, needs structure to stay manageable. Same human-vs-agent split, same logic about loading strategy. What I took:the framing that the writer of a memory dictates how it should be loaded and managed. Human-authored signals are small, stable, and load in full. Agent-authored signals are larger, grow over time, and need structure. MemGPT and Letta’s core memory The MemGPT paper Packer et al., 2023 introduced the idea of a fixed-size core memory block that always stays in context, paired with an overflow tier the agent can read on demand. The agent edits its own core memory inline through tool calls during conversations. Letta later productionized this as a memory framework with a full SDK. The interesting decision in MemGPT, for my purposes, was the explicit two-tier split. Not all memory is equally important, and the system architecture treats them differently rather than trying to find a single relevance score that captures the difference. The core block is small a couple thousand tokens, configurable and guaranteed to be in context every turn. Everything else is in a larger overflow tier and only loads when something pulls it in. The classification of which memories go in which tier is, in MemGPT, done by the agent itself through tool calls. When the agent reads something it decides is important enough to keep around, it calls a tool to write it into core. When core fills up, the agent edits its own block to make room. What I took:the two-tier split itself. A small, hard-capped set of the most identity-defining signals, and a larger budget for situational ones. This is the core idea behind the tiering design I shipped.What I changed:in MemGPT the agent edits core inline. In Falconer, signals are extracted in a background cron, not inline, so the classification happens in the same LLM call that does extraction. The agent never edits its own memory. The cron does. What everyone agrees on Across all three systems, the same three opinions show up: - Brute-force context injection beats clever retrieval at this scale. - LLMs interpret relevance at read time better than engineering logic does at write time. - Conservative classification beats aggressive classification. The disagreements show up in how they protect important memories from getting buried as memory grows. ChatGPT solves it by re-summarizing the dossier. Claude Code solves it by separating the human-written file from the agent-written notes and loading them differently. MemGPT solves it with a hard-capped core block. Three different answers to the same shape of problem. When flat injection breaks Falconer’s first version of agent signals was the simplest thing that worked, modeled directly on ChatGPT’s flat-injection approach. One Postgres table, plain text content, one read query at conversation start: Top fifty signals by recency, dropped into the prompt as bullets. This was fine for about three months. Then I looked at prod. 487 signals across 8 users. Average of 60 per user. One user at 210. The extraction LLM was doing its job, and it was also producing near-duplicates faster than dedup could keep up. Five variants of “prefers concise writing” was not unusual. The failure mode this set up was obvious in hindsight. A signal like “Senior backend engineer at Meta” gets created in the user’s first or second conversation. It’s identity. The agent should know this on every turn, forever. Then over the next week the user has fifty conversations about random topics, the extraction LLM produces fifty assorted contextual observations, and “Senior backend engineer at Meta” falls off the bottom of the recency window. The system silently forgets what’s most important. Recency was the only signal that survived the prompt budget. That was fine when there were ten signals. It was actively wrong when there were sixty. ChatGPT had this exact problem. Their answer was the AI-summarized dossier. That works for them, but Falconer’s signals have a shape ChatGPT’s don’t. Provenance which conversation the signal came from , origin Slack vs web vs the manual settings page , and a bifurcated write pattern LLM-extracted vs user-curated . A blind periodic rewrite would lose all of that. Claude Code’s split also doesn’t quite map. Falconer doesn’t have a clean human-vs-agent boundary on the same memory. Manual and auto-extracted signals live in the same table and the agent reads them together. The human-vs-agent split is about how individual signals were written , not about which file to load. That left the MemGPT shape: tier the signals, hard-cap the top tier, give the rest a soft budget. That’s roughly what I shipped. What I shipped, briefly The architecture rests on three buckets, each filling a different role: Manual signals. Anything the user typed into the settings page. Sacred. Never consolidated, never demoted, never capped. Always injected. This is the CLAUDE.md equivalent: small, stable, human-authored, full-load. Inferred core. Durable identity traits the extraction LLM classified as core at extraction time. Hard cap of 20 per user. These are the signals that should survive regardless of how many newer contextual ones pile up. This is the MemGPT core-memory equivalent. Inferred contextual. Everything else from extraction. Top 50 by recency. The usual eviction rule from v1, but operating in its own budget so it can’t displace the identity tier. Classification happens inline at extraction time. The same LLM call that pulls signals out of a conversation also assigns each one a tier, with a “default to contextual when in doubt” bias borrowed straight from ChatGPT’s prompt structure. False-positive cores would defeat the cap; false-negative cores get a second chance through the contextual recency budget. The 20-cap is enforced at write time, not read time. When new core candidates would push a user over 20, an LLM call reconciles the entire set by keeping, merging, or demoting signals. Putting that call at the write path means the read path stays a plain Postgres query: cheap, predictable, and not multiplied across every conversation a user starts. On the read side, the three buckets get fetched in parallel and rendered into the system prompt under three sub-headings: The structure matters more than it might look at first glance. Up to ~120 signals in a flat list dilutes the model’s attention. The same content under labeled sub-sections gives the model a priority signal it can attend to: user-confirmed beats core identity beats recent context. That single framing choice changed agent behavior more than the tiering math did. What mattered Almost none of this design is novel. The two-tier idea came from MemGPT. The conservative classification bias and the full-injection model came from ChatGPT. The human-vs-agent framing for how memory should be loaded came from Claude Code. The work, looking back, was much less about engineering and much more about reading carefully and figuring out which decisions from which system applied to my shape of the problem. The thing that surprised me most was how consistent the production systems are on the load-bearing decisions: flat over hierarchical, full injection over retrieval, LLM judgment over engineering logic. And how different they are in the details. Once you’ve read all three, the design space narrows a lot. The hard part isn’t inventing. It’s choosing. References - Packer et al., MemGPT: Towards LLMs as Operating Systems , 2023. arxiv.org/abs/2310.08560 https://arxiv.org/abs/2310.08560 Letta documentation on the MemGPT architecture https://docs.letta.com/concepts/memgpt/ - OpenAI, , 2024 Memory and new controls for ChatGPT - Simon Willison, , 2025 I really don’t like ChatGPT’s new memory dossier - Shlok Khemani, ChatGPT Memory and the Bitter Lesson - Anthropic, Claude Code memory documentation https://code.claude.com/docs/en/memory