My LLM Wiki in Practice — applying Karpathy's LLM Wiki architecture to a professional CRM knowledge base

A product manager at a large B2B e-commerce platform applied Andrej Karpathy's LLM Wiki architecture to a professional CRM knowledge base, extending it with source reliability grading and a multi-file schema to handle contradictory sources. The wiki, built in Obsidian with an LLM as the programmer, uses a three-layer architecture of raw sources, wiki pages, and schema, with a dispatch table for efficient context window usage.

This is an account of how I run a personal knowledge base using LLMs, built on the architecture Andrej Karpathy described in his LLM Wiki https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f gist. The core pattern — Raw Sources → Wiki → Schema, with Ingest / Query / Lint as primitive operations — is his. What follows is what I’ve learned applying it to a specific professional context over several months, and the extensions I’ve found necessary that his intentionally abstract document leaves open. Karpathy’s document communicates the idea. This one documents an instance. I’m a product manager at a large B2B e-commerce platform. My domain is a CRM system used by thousands of sales reps. The knowledge I need to manage is not academic research or personal notes — it’s a mix of internal strategy documents, meeting transcripts, sales methodology, competitive intelligence, product specs, and cross-team OKR dependencies. Much of it involves multiple competing stakeholders, changes weekly, and comes from sources of wildly varying reliability. This is a harder problem than Karpathy’s research wiki use case in one specific way: the sources actively contradict each other, and that’s expected. A VP’s strategy deck says one thing, the sales team’s behavior says another, and the data tells a third story. The wiki needs to hold all three and make the contradictions explicit rather than resolving them prematurely. The three-layer architecture is exactly right. Raw sources go in raw/ and are never modified. The LLM generates and maintains all wiki pages. The schema AGENTS.md + rules/ governs how the LLM behaves. I use Obsidian as the IDE, the LLM as the programmer, the wiki as the codebase — his exact metaphor, and it works. The index.md + log.md pattern is also correct as described. At ~120 files my wiki is still small enough that index-based navigation works without embedding search. The log is grep-parseable with a strict YYYY-MM-DD operation | title format, which gives me a timeline I can pipe through unix tools. The key insight I want to emphasize from his document: the wiki is a persistent, compounding artifact. This single idea changed how I think about AI-assisted work. Before this, every conversation with an LLM was ephemeral — valuable synthesis disappeared into chat history. Now it gets compiled into the wiki and compounds. Karpathy describes the schema as a single file CLAUDE.md or AGENTS.md . For a personal research wiki, that’s fine. For a professional knowledge base with multiple workflows, one file isn’t enough. My schema is three files: AGENTS.md — the entry point. Loaded automatically at session start. Contains identity, architecture overview, operation protocols, and a dispatch table pointing to the other two files. rules/crm-rules.md — global execution constraints. Information grading, citation obligations, anti-sycophancy rules, naming conventions, documentation standards. rules/crm-skills.md — a skill execution manual with a router that dispatches inputs to specialized workflows. The reason for splitting: context window is expensive. Loading all rules on every session wastes tokens when you’re doing a simple query. The dispatch table in AGENTS.md lets the LLM load only what’s relevant. Think of it as lazy loading — the entry point is always in memory, the details are fetched on demand. Karpathy doesn’t discuss source reliability, which makes sense for his use case — a research wiki where sources are published papers. In my context, sources range from "official system data" to "something someone said in a meeting that might have been a joke." If the LLM treats them equally, the wiki becomes unreliable. I grade sources P0 through P3: - P0: first-party system data, approved PRDs, official OKR docs - P1: formal documents from partner teams - P2: meeting transcripts, chat messages, verbal discussions - P3: external references, industry reports The rule is simple: a P2 source alone cannot serve as the basis for a decision. It must be cross-verified against P0 or P1. The LLM enforces this — when I ask a question and the answer rests on P2-only evidence, it flags that explicitly rather than presenting it as settled fact. This sounds bureaucratic, but it solves a real problem: meeting transcripts are the most voluminous source I have, and also the least reliable. Without grading, the wiki would be dominated by half-remembered statements from meetings. This is the gap I found most clearly after reading Karpathy’s document. He mentions it in one sentence: "good answers can be filed back into the wiki as new pages." True, and important enough to deserve a formal protocol. In practice, I found that without an explicit write-back step, valuable query outputs were disappearing into session history. The LLM would synthesize something brilliant — a competitive analysis, a decision framework, a connection between two separate threads — and then it would be gone. My implementation: after every skill execution completes, the LLM runs a judgment step. "Does this output contain reusable knowledge?" The criteria are: - Contains 3+ structured insights or conclusions - Introduces a new decision, hypothesis, or verification metric - Corrects or supplements an existing wiki page - The user explicitly says "remember this" If yes, the LLM drafts a wiki page proposal title, summary, suggested location . If I confirm, it creates the page and updates the index. If no, it writes a one-line entry in the log so there’s at least a breadcrumb. The key constraint: it never creates a wiki page without my confirmation. The write-back is a proposal, not an automatic action. This matters because not every good answer deserves a permanent page — some are contextual, some are preliminary, some are useful once but not worth maintaining. Karpathy’s schema is descriptive — it tells the LLM what the wiki looks like and what conventions to follow. My schema is procedural — it tells the LLM how to execute specific professional workflows with structured inputs and outputs. I have eight skills, dispatched by a router: Report QA : When a subordinate submits a weekly report, the LLM extracts the implicit claim, checks data consistency, compares against prior commitments, generates pointed questions, and identifies what’s conspicuously absent. Decision stress-testing : When I’m considering a major decision, the LLM runs it through multiple adversarial perspectives — each with a defined identity, core interests, fears, and argumentation style. Not generic "consider the other side" advice, but specific personas with specific stakes. Meeting briefing : Before a meeting where I need to take a position, the LLM generates a one-page brief with decision points, stakeholder interests, collective blind spots, and high-leverage questions. Upward reporting : When I need to present to leadership, the LLM drafts in a specific writing style calibrated to my VP’s preferences data-first, conclusion-led, before→after framing, revenue arithmetic visible . Meeting transcript compilation : Raw meeting recordings get processed through signal extraction strategy shifts, corrected conclusions, new facts, action items, silence signals , routed to the appropriate wiki pages, and presented as a change proposal for my confirmation. The router pattern is worth emphasizing. The LLM doesn’t need to figure out what to do — it pattern-matches the input against signal words and dispatches to the right skill. "我打算..." triggers stress-testing. A meeting transcript triggers compilation. A subordinate’s report triggers QA. This makes the system predictable and debuggable. Karpathy’s model is: ingest source → update wiki. That’s the right foundation, but for professional documents that will be read by executives, there’s a quality assurance problem. A first draft that goes through stress-testing and editorial review is qualitatively different from one that doesn’t. I formalized this as a five-stage pipeline: Draft — initial generation Style polish — align to the target reader’s preferences Stress test — adversarial review of core claims lightweight version: 2 perspectives, half a screen Quality check — data consistency, source grading, logical coherence, terminology Final — confirmed deliverable + feedback collection The important protocol: the LLM must prompt the next stage. It cannot silently stop at stage 1 and wait for me to remember that stress-testing exists. I can skip any stage explicitly, but the LLM must at least offer it. This is overkill for a personal wiki. It’s essential for documents that will influence business decisions. Karpathy mentions "you and the LLM co-evolve the schema over time." I’ve found this needs more structure than that implies, because the failure mode is schema drift — rules accumulate without pruning, and the schema becomes so large it’s ignored. Four layers of evolution: L0 per-session : After every skill execution, judge whether the output should write back to the wiki. This is the query write-back mechanism described above. L1 per-interaction : After every skill output, the LLM asks "was anything inaccurate or particularly useful?" Feedback gets translated into a patch and applied immediately. No batching. L2 monthly : Full wiki health check. Broken links, misplaced files, stale information, perspective cards needing updated data. L3 quarterly : Prune rules that haven’t generated value in 3 months. Refresh the router’s signal word table. Version bump the schema. The key insight: rules that aren’t actively maintained become lies. A rule in the schema that the LLM has been ignoring for two months is worse than no rule — it erodes trust in the entire system. Quarterly pruning is hygiene. Karpathy’s model is conversational — you talk to the LLM, it updates the wiki. But some knowledge sources update on their own schedule, and I don’t want to manually check them. I have cron jobs that: - Monitor internal documentation platforms weekly for new articles, ingest them automatically following the standard pipeline - Scan for uncompiled meeting transcripts and prompt me when there’s a backlog These are registered in a dedicated automation.md with explicit IDs, schedules, error handling, and notification templates. The LLM can reference them, modify them, and explain their status. The automation layer turns the wiki from a passive store into an active monitoring system. I use multiple AI agents — Claude Code, QoderWork, and AccioWork for desktop work, sometimes others for different contexts. The vault is the shared fact source for all of them. Each agent has its own ephemeral memory, but long-term facts live exclusively in the vault. The rule: agent memory is a session cache. The wiki is the database. If an insight matters beyond the current conversation, it goes into the vault. Agent-local memory just holds pointers and job IDs. This prevents the common failure where important context lives in one agent’s memory and is invisible to another. The vault is the coordination layer. The biggest value is not retrieval — it’s forcing compilation. When you know the LLM will integrate new information into existing pages, you start noticing connections you’d otherwise miss. The act of ingestion — deciding where something goes, what it contradicts, what it extends — is itself an analytical act. The wiki is a byproduct; the real product is the thinking that happens during compilation. The schema is the most valuable file in the system. Not the wiki pages, not the raw sources — the schema. It encodes how I think about problems, what I consider important, what I refuse to tolerate. If I lost all wiki pages but kept the schema, I could rebuild in weeks. If I lost the schema but kept the pages, I’d have an unstructured pile of text. The LLM’s role is bookkeeping, not thinking. I direct the analysis. I decide what’s important. I make the judgment calls. The LLM does the cross-referencing, filing, formatting, consistency-checking, and maintenance that I would never do manually. This division of labor is stable and sustainable. The moment I start relying on the LLM for judgment rather than bookkeeping, quality degrades. Start with Schema, not with sources. My instinct was to dump everything in and let the LLM sort it out. Wrong. Start with: What decisions do I need to make? What information do I need to make them? What does good output look like? Then build the schema to produce that output. Then start feeding sources. The schema-first approach means every source you ingest has a clear destination. Contradictions are features, not bugs. The wiki doesn’t need to be internally consistent. Reality isn’t internally consistent. A wiki that accurately represents "the VP said X, the data shows Y, and sales believes Z" is more useful than one that prematurely resolves the contradiction. Flag it, don’t fix it. The 120-file ceiling hasn’t been a problem. Karpathy mentions that the index-based approach works at "moderate scale ~100 sources, ~hundreds of pages ." I’m at about 120 files and the index + MOC Map of Content approach still works well. I suspect I’ll need proper search around 300-400 pages, but not yet. The main thing I haven’t solved is multi-user collaboration . The wiki is currently a single-player system — one person’s knowledge, one person’s judgment, one person’s confirmation on all writes. A team wiki maintained by LLMs, as Karpathy mentions, would need consensus mechanisms, conflict resolution, and access control on the schema layer. I don’t have a good design for that yet. The other open question is how much of the schema is transferable. I’ve extracted a generic version of this system the vault structure, the skill router, the pipeline, the evolution layers into a shareable template. But the most valuable parts — the perspective cards for stress-testing, the information grading rubric, the specific writing style calibration — are deeply domain-specific. The pattern is transferable; the instantiation isn’t. That’s fine. As Karpathy says: "The document’s only job is to communicate the pattern. Your LLM can figure out the rest." This document describes the system as of June 2026. The schema version is 2.5. The vault has been active since June 10, 2026 — approximately one week of intensive use with daily ingestion and skill execution.