# Building a personal wiki from your own digital exhaust

> Source: <https://gist.github.com/wiltodelta/970cf102ea05cc89d5b85653a773cb7f>
> Published: 2026-06-10 23:42:49+00:00

Someone I haven't talked to in years reaches out. I remember the name, nothing else. I open four apps to reconstruct who they are and what we worked on, lose ten minutes, and still walk into the conversation cold.

What I actually wanted was to ask a question and get an answer. Who is this, when did we last talk, what did we work on. And then the harder ones: which people I think of as close friends haven't I had a real conversation with in three months? Was my recovery score worse the weeks my ferritin was low?

You can't ask your raw data any of that. I read Karpathy's note on building a personal LLM wiki, recognized the problem, and built mine.

Your digital exhaust is spread across dozens of services, each with its own export format. Raw, it's noise: no shared schema, the same person fragmented across sources under different names, lab tests labeled differently by every provider, nothing linked to anything else. Point an LLM at that and it drowns.

The work is turning that noise into one structured layer an LLM (and plain search) can actually query. That is the whole project. The model is the easy part. Preparing the data so the model can use it is the job.

"Make it legible" is concrete. It means: normalize every format to one schema, resolve the same entity across sources, canonicalize labels, link records to each other, and compile the result to clean markdown. Raw exports are noise. Structured pages have one schema, consistent labels, and cross-links, and that is what an LLM can reason over.

Raw data exports in. A structured, queryable markdown wiki out.

``` php
sources/ -> parsers -> normalize -> dedup -> link -> compile -> wiki/
```

Mine processes 55 export directories into ~2,750 wiki pages.

This is the schema each parser produces. Define it before writing any code. It is the contract every other component depends on, and it is the unit the LLM ends up reading.

```
{
    "name": "Andrey Petrov",
    "emails": {"andrey.petrov@company.com"},
    "phones": {"+14155559876"},
    "platform_ids": {
        "linkedin": {"slug": "andrew-petrov"},
        "telegram": {"user_id": "814375234"},
    },
    "sources": {"linkedin", "gmail", "whatsapp"},
    "orgs": {"Acme Corp"},
    "email_count": 47,
    "message_count": 312,
    "last_contact": "2025-11-03",
}
```

`sources`

tells you which exports merged into this record. `platform_ids`

is how entity resolution finds matches across parsers. `email_count`

and `message_count`

feed the scoring formula. `last_contact`

drives recency decay.

**Relationship data.** Professional network export, email archive, direct messages from every platform you use, social media archives, phone contacts.

**Health data.** Lab results, clinical records from your health system if it exports them, wearable sensor data, genetic raw data.

**Activity data.** Calendar events, version control history, video watch history, search history, tasks, bookmarks, GPS workout routes.

Start with relationship data. A professional network export plus email archive gives you ~80% of the value. Add everything else later.

Mine has 7,405 contacts merged into 832 person profiles and 167 organization pages. 53,836 emails. 24,809 direct messages. 1,440+ posts. 1,934 biomarker readings across 78 tests, back to 2015.

Every step below exists for one reason: to take a pile of inconsistent exports and turn it into something an LLM can answer questions over. Dedup is the hardest of them, so it gets the most space, but it is one transform among several, not the point on its own.

**1. Normalize every format to one schema.**

Each export speaks its own dialect. The parser's only job is to read one format and emit the contact record above. One parser per source, a couple hundred lines each. Most of the codebase is parsers handling format variations, and that is fine: this is where raw noise becomes a uniform shape.

**2. Resolve the same entity across sources (dedup). The hard one.**

I didn't think this would be the hard part. I was wrong.

The same person appears differently in every export. "Andrey" in one messaging app. "Andrew Petrov" in a professional network. `andrey.petrov@company.com`

in email. A phone number in another app. No source has a primary key.

Two phases.

*Phase 1: Union-Find on hard identifiers.* Merge records that share an email, phone, platform user ID, or profile slug. Fast, exact, handles most matches.

*Phase 2: Probabilistic matching.* For pairs that survived Phase 1 without a shared identifier, I use Splink, an open-source library for probabilistic record linkage. Jaro-Winkler similarity on first and last name, weighted by employer overlap. 0.85 auto-merges; 0.70-0.85 goes to a review file I reconcile by hand.

One guard that took me too long to add: exclude single-word name clusters from Phase 2. "Andrey" alone will false-positive at scale. Leave it unmerged until a hard identifier shows up.

**3. Canonicalize labels.**

The same lab test arrives under different names from different providers. I keep a canonical alias table: every incoming name routes through it before any chart renders. 40 variants mapped to 78 canonical tests. Without it the biomarker timeline fragments and nothing aggregates. The same idea applies to org names, titles, anything a human typed slightly differently each time.

**4. Link records to each other.**

A profile is more useful when it points at the orgs, people, and events around it. Cross-links are what let a query walk from a person to their company to the events you both attended. In markdown this is just wikilinks, and they are what make the graph navigable for both you and the model.

**5. Compile incrementally, and cache big exports.**

Two operational lessons I added too late. Build people-only, health-only, and content-only compile targets that finish in under 10 seconds. Once the wiki passed ~1,000 pages, a 3-minute full recompile to preview a one-line edit killed the editing habit. And a large email archive reruns slowly on every compile, so build a structured cache from each big export once, then read from cache forever after.

Once contacts are merged, rank them.

```
frequency * recency_decay * reciprocity * channel_diversity
```

Inputs: message and email counts per contact (frequency), date of last interaction (recency), whether they replied back (reciprocity), number of distinct platforms where you've interacted (channel_diversity).

Map percentile rank to Dunbar tiers: top 5 intimate, top 15 close, top 50 friends, top 150 acquaintances, everyone else weak. Recency half-life: 120 days.

Channel diversity surprised me. A contact I email and message and occasionally see in person ranks higher than someone I email 5x as often on a single channel. That matches actual relationship depth better than raw counts.

Now that the data is structured, you can actually ask it things.

For exact-term search, BM25 over the wiki directory is fast. For semantic questions, hybrid BM25 + vector search gives better results. qmd handles both and indexes a directory in one command.

I browse in Obsidian. Standard markdown with wikilinks; backlinks and graph view make navigation fast. The point of compiling to markdown is exactly this: it is a format both a human reader and an LLM can work with directly, with no extra layer.

This is the payoff, and the reason the data prep is worth it. Each of these is a join across sources that was impossible while the data was raw.

- "My ferritin came back low again. Was my recovery score actually worse those weeks, or did I just feel that way?" Joins lab panel data with wearable recovery scores across the same dates.
- "Which people I think of as close friends haven't had a real conversation with me in over three months? Not a like, an actual exchange." Joins contact Dunbar tier with message history across every platform.
- "I'm going back to a city I lived in years ago. Who do I know there that I worked with closely but haven't talked to since I left?" Joins org history, location, and interaction recency.
- "Who has engaged with things I've published but has never been in direct contact with me?" Joins post archive with contact interaction history.

And the thing I didn't expect: ambient recall. Someone reaches out. I open their profile. Twelve years of context in 10 seconds. I go into every conversation prepared now.

- Define the contact schema first. Everything else depends on it.
- Write one parser for your richest source. Get raw noise into the schema.
- Add a second source. Implement Phase 1 dedup on emails and phones.
- Build the simplest compiler: one markdown file per contact, name and interaction count.
- Add incremental compile targets before adding more sources.
- Add Phase 2 probabilistic dedup (Splink) once you have three sources and can evaluate match quality.
- Add health data last. Most valuable once working, but canonicalizing labels takes real effort.

My pipeline grew to ~15k lines of Python. Most of that is parsers handling format variations. Core dedup, scoring, and compiler logic is ~3,300 lines. One parser for one source is a couple hundred lines. Start there.

*Inspired by Karpathy's framing.*