{"slug": "causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs", "title": "Causal graph memory for LLMs. Flat token cost, no matter how the session runs", "summary": "Rudi, a new system for LLM memory management, uses a causal graph of decisions to replace the growing transcript, achieving flat token costs regardless of session length. In a 43-turn software architecture session, Rudi used 5.4× fewer tokens than the standard full-transcript approach while maintaining answer quality and passing all six callback traps that tested long-term constraint adherence.", "body_md": "**Causal graph memory for LLMs. Flat token cost, no matter how long the session runs.**\n\nEvery LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a **dependency graph of decisions** — and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10.\n\nIn a **43-turn** software-architecture session (building a Notes API turn by turn), the standard \"re-send the full transcript\" approach was sending **~38,000 input tokens** by the final turn. Rudi sent **6,782** — for the *same task, same model, same answer quality.*\n\n| Turn | Rudi input | Full-transcript input | Savings |\n|---|---|---|---|\n| 1 | 382 | 340 | — |\n| 10 | 1,467 | 6,999 | 4.8× |\n| 20 | 3,581 | 17,385 | 4.9× |\n| 30 | 4,128 | 26,821 | 6.5× |\n| 43 | 6,782 | 38,320 | 5.7× |\n\n**Totals across all 43 turns:** 152,222 input tokens (Rudi) vs 828,369 (full transcript) — **5.4× fewer tokens**, and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear.\n\nThese numbers are from a run with fold disabled — graph slicing alone. See below for the measured fold result.\n\nCost of the entire 43-turn run on Claude Haiku 4.5: **$0.34.**\n\nAt turn 29 of a separate run, fold fired for the first time:\n\n```\nturn 28: input=5,075 tokens   active nodes=24\n[fold] d1–d8   (8 nodes, 20 hard rules) → stub d25\n[fold] d9–d16  (8 nodes, 20 hard rules) → stub d26\n[fold] d17–d21 (5 nodes, 16 hard rules) → stub d27\nturn 29: active nodes=6   (dropped 24 → 6)\nturn 30: input=2,865 tokens   ← down 44% from turn 28\n```\n\n21 live nodes compressed into 3 stubs. **56 hard rules preserved verbatim.** Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets *smaller* as the conversation gets *longer*.\n\nCheap context is worthless if the model forgets the rules. So the same benchmark plants **6 callback traps** late in the session and checks whether decisions made dozens of turns earlier are still honored.\n\n| # | Turn | Trap | Result |\n|---|---|---|---|\n| 1 | 38 | Add logout — must use the exact auth mechanism chosen on turn 1 |\n✅ |\n| 2 | 39 | Profile endpoint — must scope via turn-1 auth and turn-2 DB |\n✅ |\n| 3 | 40 | Admin CSV export — a rule that was folded away banned cross-user data |\n✅ surfaced |\n| 4 | 41 | Email full notes — a folded rule banned note contents in email |\n✅ surfaced |\n| 5 | 42 | \"Store the token in localStorage\" — conflicts with turn-1 hard rule | ✅ blocked |\n| 6 | 43 | \"Permanently delete a note\" — turn-11 chose soft-delete | ✅ flagged |\n\n**6 / 6.** *(First benchmark run — fold disabled, slicing only.)* The two that matter most are #3 and #4: those rules had been **compressed out of the active context** by the time the trap was sprung — and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: *forget the prose, keep the constraints.*\n\nEvery model response is parsed into **decision nodes**, each linked backward to the decisions it depends on:\n\n```\nnode = {\n  id, text,\n  depends_on: [...],     # backward edges — what this decision rests on\n  hard_rules: [...],     # binding constraints; the worker must halt if violated\n  revises, exception_to, # full replacement vs. narrow carve-out\n  status, turn, pinned\n}\n```\n\n**Slice, don't dump.** Before each turn, Rudi injects only the nodes reachable from the current task — not the transcript.**Fold.** When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub.**Hard rules survive the fold verbatim**, so a constraint can never be silently lost (see traps #3/#4).** Pin foundations.**Decisions that are reinforced repeatedly, made in the first two turns, or carry exceptions are pinned and never folded.** Hard rules are binding.**If a new task would violate one, the worker stops and asks instead of silently complying (traps #5/#6).\n\n```\ngit clone https://github.com/<you>/rudi\ncd rudi\npip install anthropic flask flask-cors\n\n# your own key — never hardcode it\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n\n# run the 43-turn benchmark against local SQLite (no cloud needed)\npython benchmark_long_haiku.py\n```\n\nYou'll watch the input-token curve stay flat while a naive transcript would balloon, and see all 6 callback traps resolve.\n\nTwo calls per turn. You keep your own model key; Rudi only manages the graph.\n\n``` python\nimport rudi\n\n# 1 — before your LLM call: fetch the relevant slice\ns = rudi.get_slice(task)\n#   → feed s[\"system\"] + s[\"prompt\"] into YOUR LLM call\n\n# 2 — after your LLM call: store what was decided\nrudi.store_decisions(decisions, inject_ids=s[\"inject_ids\"])\n```\n\nOr let Rudi drive the whole turn (LLM call + store + fold) in one shot:\n\n```\nresult = rudi.run_turn(task)   # → {\"display\", \"tokens_in\", \"tokens_out\", ...}\n```\n\nStorage is local SQLite (`store.py`\n\n) — one row per decision node. No server, no cloud, no setup.\n\n| Graph slicing bounds the token curve | ✅ measured — table above |\n| Decisions recalled 40+ turns later | ✅ 6/6 callbacks |\n| Hard rules survive fold verbatim | ✅ traps #3/#4 |\n| Conflicts blocked, not silently obeyed | ✅ traps #5/#6 |\n| Fold GC compresses dead branches mid-session | ✅ measured — 24 nodes → 6, input −44% at turn 30 |\n| Retrieval fallback above ~80 active nodes | ⏳ built, not yet benchmarked at scale |\n\nNo vapor. The table is what the logs say; the in-progress rows are labeled as such.\n\n**Business Source License 1.1.** Free for personal use, research, development, and self-hosting. Commercial SaaS or managed hosting use requires a paid license from the maintainer.\n\nWant a commercial license? Open an issue or email ** raphaelwkago@gmail.com**.\n\nWant to use Rudi commercially without AGPL obligations? Open an issue or email ** raphaelwkago@gmail.com**.", "url": "https://wpnews.pro/news/causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs", "canonical_source": "https://github.com/raphaelwkago-sketch/rudi", "published_at": "2026-06-19 12:54:12+00:00", "updated_at": "2026-06-19 13:07:54.713235+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools"], "entities": ["Rudi", "Claude Haiku 4.5", "Notes API"], "alternates": {"html": "https://wpnews.pro/news/causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs", "markdown": "https://wpnews.pro/news/causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs.md", "text": "https://wpnews.pro/news/causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs.txt", "jsonld": "https://wpnews.pro/news/causal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs.jsonld"}}