The original LLM Wiki turned reading into knowledge. After running that pattern in practice, the harder problem became turning knowledge into action. This is a field report on agent operating memory: evidence enters, understanding evolves, actions get compressed, and a human audits only where judgment becomes action.
The most dangerous page in an agent's memory is not the messy one. It is the tidy page that is quietly wrong: clean, well linked, confidently written, and weeks out of date. A human wiki fails loudly, by looking abandoned. A memory an agent maintains fails silently, by staying pretty while it goes stale. And once an agent acts on it, the silent failure is the expensive one.
Andrej Karpathy's original LLM Wiki showed how to compile knowledge instead of retrieving it on every query: the LLM builds a persistent, interlinked wiki, so knowledge is written once and kept current instead of rediscovered each time. After running that pattern in practice, I found the next question was different. The moment an agent stops just maintaining the wiki and starts acting on it, what you need is not a better knowledge base. It is operating memory: something you can run a decision on, and still audit.
I mean this as a field report, not a standard. It is written as an idea file, the way the original was. It describes a pattern, not an implementation. Take what is useful and build the rest yourself.
The system this came from is not large by company standards, but it is large enough to rot: a few hundred compiled wiki pages, hundreds of raw sources, meeting transcripts, and email evidence timelines.
First, knowledge bases had one reader: you. You wrote the notes, linked them, reread them, and forgot most of them. Then LLMs became the librarian: the agent reads a source, writes the summary, updates the links, files it, and you stay the reader. This is the LLM Wiki model, and it works.
The next shift is the one that matters here. With agents that take actions, the knowledge base is no longer something a human reads with help from an AI. It is something an agent reads, writes, and acts on, while a human checks the result.
The agent is no longer a tool inside the knowledge system. It is the primary operator of the knowledge system.
Everything else follows from this one move. An operator that never tires will maintain a wiki forever. It will maintain a wrong wiki forever too, with identical diligence, and then act on it. So the design problem is no longer "how do I store more." It is "how do I keep an operator from quietly corrupting its own memory, and how does a human stay in control without reading everything."
The old personal-knowledge question is "how do I remember more." It assumes the bottleneck is storage. The agent-era question is "how does an agent inherit context reliably, without fossilizing errors into long-term memory." The bottleneck is not volume. It is trust: can something that will act tomorrow rely on what was written today.
That changes which failures matter. The ones that bite:
- There is too much to read, so no human reviews the whole base, and the human stops being a safety check by default.
- Automatic capture pollutes: if every chat turn is stored, guesses get the permanence of facts.
- Clean surfaces go stale but stay credible, so they are believed past their expiry.
- Append-only pages rot: new judgment stacks on old, and no one can tell which line is currently true.
- Answers with no evidence chain are black boxes: they can be believed, not audited.
None of these is a storage problem. The reframe: an agent's memory does not need more storage. It needs governance, the kind that keeps bad inputs out instead of just making room for more.
flowchart TD
A([Evidence enters]) -->|"entry discipline: immune system"| B([Understanding evolves])
B -->|"evolution discipline: living pages"| C([Actions get compressed])
C -->|"audit discipline: derived cockpit"| D([Humans audit and decide])
D -->|"new evidence"| A
The loop is the pattern. What makes it governed rather than a pipeline is that every step carries a discipline whose only job is to stop one kind of rot:
- Evidence enters under an entry discipline: an immune system that keeps unconfirmed material out of long-term memory.
- Understanding evolves under an evolution discipline: living pages that get rewritten instead of stacked.
- Actions get compressed under an audit discipline: a small derived surface, kept honest, that is the only thing a human has to read.
These are not requests the agent is asked to remember. They live in a schema file that every agent reads at the start of every session, and can be enforced by a linter on a schedule. Write the rules down once, in a tool-neutral file, and any agent that operates the memory inherits the same governance. The complexity lives there, in the background, not in the daily writing.
Split the system by who is allowed to change what:
raw -> immutable evidence (no one rewrites it, not even the agent)
wiki -> compiled understanding (evolves, merges, corrects itself)
cockpit -> derived operating view (reflects judgment; never originates truth)
This looks like a way to sort files. It is not. The point is not what goes where. It is who is allowed to change what. Raw evidence is the source of truth precisely because nothing may edit it, not even the agent; corrections live in the wiki as errata, never as edits to the source. The wiki is the only place the current best answer lives, and it is expected to move. The cockpit may reflect but never originate. A derived view may reflect the truth. It may never be the truth.
The tradeoff is real: immutability means redundancy, and you follow a link to reconcile a correction with what it corrects. You give up editing in place to keep one layer that has never been quietly rewritten. That layer is the anchor everything else is audited against.
This is the most important rule, and the least obvious.
Most of what passes between a human and an agent is not knowledge. It is thinking out loud: guesses, frustration, a hypothesis that will die in ten minutes, a number from a call that turns out wrong. If all of it flows automatically into long-term memory, the memory does not get richer. It gets confidently poisoned. A hypothesis floated on Tuesday becomes a "fact" on a page by Friday, sitting next to real facts and indistinguishable from them.
Memory needs an immune system, not more storage.
The rules are few, and they all apply at the moment of writing:
- Discussion does not persist by default. A conversation is not an ingest.
- Nothing enters canonical memory until a human confirms it. Confirmation is a deliberate act: promote the candidate, date it, type it. Silence is not confirmation.
- On confirmation, classify before storing: fact, judgment, hypothesis, decision, or someone else's claim. These are not the same and must not be filed as if they were.
- Unconfirmed material may live as an explicit, marked candidate, never dressed as established.
Many memory designs put their effort into cleaning up after the fact: confidence scores, decay curves, contradiction resolution. This rule sits upstream of all that. The cheapest way to keep bad data out of memory is to not let it in.
This also cuts against the most seductive idea in the original pattern, that good answers should always be filed back so explorations compound. Half right. The missing half: an operator that writes everything down compounds its mistakes as efficiently as its insights. The discipline is not "capture more." It is "capture deliberately." The cost is real, and worth naming: an immune system has friction, and now and then you lose a true signal because it was never confirmed. A slightly forgetful memory you can trust beats a total memory you cannot. For something that acts on its own memory, this is the trade that matters most.
Notes fail by being append-only. Add to a page for a year and it becomes sediment: every opinion ever held, with no marker for which is current. A living page evolves instead. The shape:
## Current Understanding (rewritten in place when judgment changes)
## Evidence Timeline (append-only, dated, each entry cites a source)
## Open Questions
## Sources
No confidence numbers, no decay schedule. Two rules carry it. The current understanding is rewritten, not extended: it may be wrong tomorrow, but it may not be stale today. The timeline is append-only and never edited: a wrong judgment is corrected by a new dated entry, not by deleting the trail. History is preserved, and history is never allowed to impersonate the present.
The original gist says to note where new data contradicts old claims. Living pages decide where: resolved at the top, recorded at the bottom. The risk is that rewriting the top is exactly where a confident but wrong rewrite can launder itself into truth. The defense is structural: the timeline keeps the receipts and the top must cite them, so an unsupported rewrite is visibly unsupported.
Prior work foregrounds memory quality and scale. What I have not seen given the same weight is the small, human-audited surface where memory turns into action. Return to the tidy page that is quietly wrong: that is a failure of this layer, and it is the one that costs the most, because the cockpit is what a human actually acts on.
A cockpit is a derived view, never a source of truth. It exists because a human cannot and should not browse the whole memory to learn what is going on. So you compress: out of the entire base, render only the small surface where current judgment becomes a next action, and have the human read just that. What rises to that surface is ranked, not dumped: roughly urgency Γ consequence Γ uncertainty Γ your leverage, with stale-but-important items pushed up rather than buried. Picture what a CEO needs to decide a week: the few live decisions, each with a source to click, and nothing else.
Humans should not browse the whole memory. They should audit the compressed surface where judgment turns into action.
A good cockpit:
- Fits on one screen. If it does not fit, that is a bug, not a big dashboard.
- States its freshness out loud: "last confirmed on this date, N items unverified."
- Cites the pages every block came from, so an audit is one click away.
- Carries current judgment and next actions only, never the evidence itself.
- Is regenerated, never hand-edited. If a page cannot be thrown away and rebuilt from the layers below, it is not a derived view.
The trap is the freshness contract itself. A derived view is only as honest as its regeneration. Stamp a page "fresh" without re-scanning the evidence beneath it and you have manufactured the exact thing the system exists to prevent: a clean, confident page that is silently wrong, the kind a human acts on. The fix is mechanical and cultural, and the split matters. A linter catches the easy half: when a source changed after the page's last_confirmed date, the page is stale and has to say so. It cannot catch the hard half: a regeneration that stamps the page fresh when no one actually re-read the evidence. No tool sees that. Only a standing rule does: never stamp a freshness you did not earn. That rule is easy to write and easy to break, which is why this layer is the highest-leverage one and the easiest to turn into a comfortable lie.
The recent direction of this lineage has been toward more machinery: typed knowledge graphs, confidence scores on every claim, hybrid vector search, consolidation tiers. All of it is valuable, and at enough scale, correct. I want to argue the other way, for the single operator.
The reason to delay is attention, not capability. If the agent must hand-maintain rich structure on every write, the cost of writing rises and attention drifts from what matters, reading the evidence and deciding the next action, to what does not, keeping the ontology well-formed. Attention spent keeping the ontology well-formed is attention not spent on whether the judgment is right.
Complexity belongs in the background tools, not in the everyday writing loop.
So: start with markdown, backlinks, and light frontmatter. Add a few typed relations only on the core living pages. Use linters and scripts to extract and check structure in the background, not at write time. Build a real graph or search layer only when scale forces it, as a background index over the markdown rather than a thing a human feeds by hand. You give up query power early; retrieval is "read the index, go to the hub, read a few pages," not semantic search over thousands of chunks. Let that break first, then add the machinery when the pain is real.
A minimal skeleton to start from. Plain markdown and a few conventions, no framework.
On disk:
memory/
βββ AGENTS.md # the schema: rules every agent reads first
βββ index.md # catalog, one line per page: the agent's map
βββ log.md # append-only record of what happened and when
βββ cockpit.md # derived operating view (regenerated, never hand-edited)
βββ raw/ # immutable evidence (never edited, not even by the agent)
βββ wiki/ # compiled understanding (the agent owns this)
βββ entities/ # people, orgs, products
βββ topics/ # long-running threads
βββ notes/ # everything else
Claim types. Every timeline entry and cockpit line carries one, so how much a statement has earned is visible at a glance:
fact
: supported by cited evidence the operator judged sufficient.judgment
: my current call. Expected to change.hypothesis
: a guess, not yet tested.decision
: a commitment to act.external_claim
: someone else asserted it; not yet mine.
A living page:
---
type: entity
updated: 2026-06-21
last_confirmed: 2026-06-21
---
## Current Understanding (rewritten in place)
Phase 2 signed. Likely to renew. Open risk: pricing.
## Evidence Timeline (append-only, dated, typed, cited)
- 2026-06-18 [fact] Phase 2 contract signed Β· [[2026-06-18 email]]
- 2026-06-12 [judgment] likely to renew Β· [[2026-06-12 call]]
- 2026-06-05 [hypothesis] price is the real blocker (unconfirmed)
## Open Questions
- Is pricing the blocker, or a proxy for scope?
## Sources
- [[2026-06-18 email]], [[2026-06-12 call]]
The cockpit, regenerated from the wiki, never typed by hand:
---
generated_at: 2026-06-21
last_confirmed: 2026-06-21
---
> Last confirmed 2026-06-21 Β· 2 inputs unverified
## Decide now
- Renew Client A? Likely yes; pricing risk open.
next: get sign-off by Friday Β· [judgment] Β· [[Client A]]
Five linter checks turn the disciplines into pass or fail, so the rules survive without willpower:
- Every cockpit line links at least one wiki page. No claim without a source.
- The cockpit carries
generated_at
andlast_confirmed
. Past the threshold, it must render a loud stale banner or the lint fails. raw/
is unchanged. Any edit to an evidence file fails the build.- Every timeline entry is dated and carries a claim type.
- Every wiki page has at least one inbound link from another content page (links from
index.md
do not count) and appears inindex.md
. No orphans hiding behind the catalog.
One pass through the loop, end to end:
- An email lands in
raw/
. It is evidence, so nothing edits it, ever. - Entry: it does not auto-persist. Until confirmed it is at most an
external_claim
. On confirmation it gets a claim type. - Evolution: on Client A's living page, the Current Understanding is rewritten ("Phase 2 signed"), and a dated
[fact]
line is appended to the timeline with a link back to the email. - Compression: the cockpit is regenerated. Client A's line updates, freshness is re-stamped, the source is linked.
- Audit: the human reads one line, and clicks through to the email only if they doubt it. The other few hundred pages stay unread. That is the point.
Karpathy's original is the seed: it shows how an LLM can turn raw sources into a persistent, interlinked wiki, so knowledge compiles once instead of being re-retrieved on every query.
A later community branch pushed that seed toward production: confidence scoring, knowledge graphs, hybrid search, lifecycle management. Its center of gravity is making a large memory scale and stay clean.
This goes the other way, and it is not a bid for the next version number. It is a field report on one branch: the layer you need once the wiki becomes operating memory, where knowledge has to drive action for a single operator without becoming quietly untrustworthy. The seed makes knowledge compound. The production branch makes it scale. This branch asks how it acts while staying auditable.
This is not a finished design, and pretending otherwise would be its own stale page:
- How do you automatically detect that a judgment has gone stale, instead of waiting for a human to notice?
- How do you extract structured relations cheaply, without turning every write into metadata work?
- How do you resolve conflicts when several agents write the same memory at once?
- Which judgments need a confidence score, and which are just noise with a number on them?
- At what scale does heavier machinery, like structured graphs or hybrid search, start to pay for itself for a solo operator?
This is not a note-taking method. The operator of your knowledge is now an agent, and the task is to give it a memory it can run on without a human reading everything, and without the system drifting into confident wrongness.
The goal is not a bigger notebook. It is a memory that agents can operate, humans can audit, and both can trust. The difference is not storage. It is an immune system at the door, an honest record of how judgment changed, and a small surface a human actually checks.
The original LLM Wiki gist is the seed and still the place to start. This is the layer I had to add in my own use: what memory becomes once the agent stops being a helper and becomes the operator. Take the pattern and build your own.