{"slug": "building-a-personal-wiki-from-your-own-digital-exhaust", "title": "Building a personal wiki from your own digital exhaust", "summary": "A developer built a personal wiki from their digital exhaust by processing 55 export directories into ~2,750 structured markdown pages. The project normalizes data from dozens of services into a unified schema, resolves duplicate entities across sources, and enables LLM-powered querying of contacts, health records, and activity logs. The developer emphasizes that data preparation—not the model—is the core challenge.", "body_md": "Someone I haven't talked to in years reaches out. I remember the name, nothing else. I open four apps to reconstruct who they are and what we worked on, lose ten minutes, and still walk into the conversation cold.\n\nWhat I actually wanted was to ask a question and get an answer. Who is this, when did we last talk, what did we work on. And then the harder ones: which people I think of as close friends haven't I had a real conversation with in three months? Was my recovery score worse the weeks my ferritin was low?\n\nYou can't ask your raw data any of that. I read Karpathy's note on building a personal LLM wiki, recognized the problem, and built mine.\n\nYour digital exhaust is spread across dozens of services, each with its own export format. Raw, it's noise: no shared schema, the same person fragmented across sources under different names, lab tests labeled differently by every provider, nothing linked to anything else. Point an LLM at that and it drowns.\n\nThe work is turning that noise into one structured layer an LLM (and plain search) can actually query. That is the whole project. The model is the easy part. Preparing the data so the model can use it is the job.\n\n\"Make it legible\" is concrete. It means: normalize every format to one schema, resolve the same entity across sources, canonicalize labels, link records to each other, and compile the result to clean markdown. Raw exports are noise. Structured pages have one schema, consistent labels, and cross-links, and that is what an LLM can reason over.\n\nRaw data exports in. A structured, queryable markdown wiki out.\n\n``` php\nsources/ -> parsers -> normalize -> dedup -> link -> compile -> wiki/\n```\n\nMine processes 55 export directories into ~2,750 wiki pages.\n\nThis is the schema each parser produces. Define it before writing any code. It is the contract every other component depends on, and it is the unit the LLM ends up reading.\n\n```\n{\n    \"name\": \"Andrey Petrov\",\n    \"emails\": {\"andrey.petrov@company.com\"},\n    \"phones\": {\"+14155559876\"},\n    \"platform_ids\": {\n        \"linkedin\": {\"slug\": \"andrew-petrov\"},\n        \"telegram\": {\"user_id\": \"814375234\"},\n    },\n    \"sources\": {\"linkedin\", \"gmail\", \"whatsapp\"},\n    \"orgs\": {\"Acme Corp\"},\n    \"email_count\": 47,\n    \"message_count\": 312,\n    \"last_contact\": \"2025-11-03\",\n}\n```\n\n`sources`\n\ntells you which exports merged into this record. `platform_ids`\n\nis how entity resolution finds matches across parsers. `email_count`\n\nand `message_count`\n\nfeed the scoring formula. `last_contact`\n\ndrives recency decay.\n\n**Relationship data.** Professional network export, email archive, direct messages from every platform you use, social media archives, phone contacts.\n\n**Health data.** Lab results, clinical records from your health system if it exports them, wearable sensor data, genetic raw data.\n\n**Activity data.** Calendar events, version control history, video watch history, search history, tasks, bookmarks, GPS workout routes.\n\nStart with relationship data. A professional network export plus email archive gives you ~80% of the value. Add everything else later.\n\nMine has 7,405 contacts merged into 832 person profiles and 167 organization pages. 53,836 emails. 24,809 direct messages. 1,440+ posts. 1,934 biomarker readings across 78 tests, back to 2015.\n\nEvery step below exists for one reason: to take a pile of inconsistent exports and turn it into something an LLM can answer questions over. Dedup is the hardest of them, so it gets the most space, but it is one transform among several, not the point on its own.\n\n**1. Normalize every format to one schema.**\n\nEach export speaks its own dialect. The parser's only job is to read one format and emit the contact record above. One parser per source, a couple hundred lines each. Most of the codebase is parsers handling format variations, and that is fine: this is where raw noise becomes a uniform shape.\n\n**2. Resolve the same entity across sources (dedup). The hard one.**\n\nI didn't think this would be the hard part. I was wrong.\n\nThe same person appears differently in every export. \"Andrey\" in one messaging app. \"Andrew Petrov\" in a professional network. `andrey.petrov@company.com`\n\nin email. A phone number in another app. No source has a primary key.\n\nTwo phases.\n\n*Phase 1: Union-Find on hard identifiers.* Merge records that share an email, phone, platform user ID, or profile slug. Fast, exact, handles most matches.\n\n*Phase 2: Probabilistic matching.* For pairs that survived Phase 1 without a shared identifier, I use Splink, an open-source library for probabilistic record linkage. Jaro-Winkler similarity on first and last name, weighted by employer overlap. 0.85 auto-merges; 0.70-0.85 goes to a review file I reconcile by hand.\n\nOne guard that took me too long to add: exclude single-word name clusters from Phase 2. \"Andrey\" alone will false-positive at scale. Leave it unmerged until a hard identifier shows up.\n\n**3. Canonicalize labels.**\n\nThe same lab test arrives under different names from different providers. I keep a canonical alias table: every incoming name routes through it before any chart renders. 40 variants mapped to 78 canonical tests. Without it the biomarker timeline fragments and nothing aggregates. The same idea applies to org names, titles, anything a human typed slightly differently each time.\n\n**4. Link records to each other.**\n\nA profile is more useful when it points at the orgs, people, and events around it. Cross-links are what let a query walk from a person to their company to the events you both attended. In markdown this is just wikilinks, and they are what make the graph navigable for both you and the model.\n\n**5. Compile incrementally, and cache big exports.**\n\nTwo operational lessons I added too late. Build people-only, health-only, and content-only compile targets that finish in under 10 seconds. Once the wiki passed ~1,000 pages, a 3-minute full recompile to preview a one-line edit killed the editing habit. And a large email archive reruns slowly on every compile, so build a structured cache from each big export once, then read from cache forever after.\n\nOnce contacts are merged, rank them.\n\n```\nfrequency * recency_decay * reciprocity * channel_diversity\n```\n\nInputs: message and email counts per contact (frequency), date of last interaction (recency), whether they replied back (reciprocity), number of distinct platforms where you've interacted (channel_diversity).\n\nMap percentile rank to Dunbar tiers: top 5 intimate, top 15 close, top 50 friends, top 150 acquaintances, everyone else weak. Recency half-life: 120 days.\n\nChannel diversity surprised me. A contact I email and message and occasionally see in person ranks higher than someone I email 5x as often on a single channel. That matches actual relationship depth better than raw counts.\n\nNow that the data is structured, you can actually ask it things.\n\nFor exact-term search, BM25 over the wiki directory is fast. For semantic questions, hybrid BM25 + vector search gives better results. qmd handles both and indexes a directory in one command.\n\nI browse in Obsidian. Standard markdown with wikilinks; backlinks and graph view make navigation fast. The point of compiling to markdown is exactly this: it is a format both a human reader and an LLM can work with directly, with no extra layer.\n\nThis is the payoff, and the reason the data prep is worth it. Each of these is a join across sources that was impossible while the data was raw.\n\n- \"My ferritin came back low again. Was my recovery score actually worse those weeks, or did I just feel that way?\" Joins lab panel data with wearable recovery scores across the same dates.\n- \"Which people I think of as close friends haven't had a real conversation with me in over three months? Not a like, an actual exchange.\" Joins contact Dunbar tier with message history across every platform.\n- \"I'm going back to a city I lived in years ago. Who do I know there that I worked with closely but haven't talked to since I left?\" Joins org history, location, and interaction recency.\n- \"Who has engaged with things I've published but has never been in direct contact with me?\" Joins post archive with contact interaction history.\n\nAnd the thing I didn't expect: ambient recall. Someone reaches out. I open their profile. Twelve years of context in 10 seconds. I go into every conversation prepared now.\n\n- Define the contact schema first. Everything else depends on it.\n- Write one parser for your richest source. Get raw noise into the schema.\n- Add a second source. Implement Phase 1 dedup on emails and phones.\n- Build the simplest compiler: one markdown file per contact, name and interaction count.\n- Add incremental compile targets before adding more sources.\n- Add Phase 2 probabilistic dedup (Splink) once you have three sources and can evaluate match quality.\n- Add health data last. Most valuable once working, but canonicalizing labels takes real effort.\n\nMy pipeline grew to ~15k lines of Python. Most of that is parsers handling format variations. Core dedup, scoring, and compiler logic is ~3,300 lines. One parser for one source is a couple hundred lines. Start there.\n\n*Inspired by Karpathy's framing.*", "url": "https://wpnews.pro/news/building-a-personal-wiki-from-your-own-digital-exhaust", "canonical_source": "https://gist.github.com/wiltodelta/970cf102ea05cc89d5b85653a773cb7f", "published_at": "2026-06-10 23:42:49+00:00", "updated_at": "2026-06-18 03:24:53.959127+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-products"], "entities": ["Andrey Petrov", "Karpathy", "LinkedIn", "Gmail", "WhatsApp", "Telegram", "Acme Corp"], "alternates": {"html": "https://wpnews.pro/news/building-a-personal-wiki-from-your-own-digital-exhaust", "markdown": "https://wpnews.pro/news/building-a-personal-wiki-from-your-own-digital-exhaust.md", "text": "https://wpnews.pro/news/building-a-personal-wiki-from-your-own-digital-exhaust.txt", "jsonld": "https://wpnews.pro/news/building-a-personal-wiki-from-your-own-digital-exhaust.jsonld"}}