Causal graph memory for LLMs. Flat token cost, no matter how the session runs

Rudi, a new system for LLM memory management, uses a causal graph of decisions to replace the growing transcript, achieving flat token costs regardless of session length. In a 43-turn software architecture session, Rudi used 5.4× fewer tokens than the standard full-transcript approach while maintaining answer quality and passing all six callback traps that tested long-term constraint adherence.

Causal graph memory for LLMs. Flat token cost, no matter how long the session runs. Every LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a dependency graph of decisions — and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10. In a 43-turn software-architecture session building a Notes API turn by turn , the standard "re-send the full transcript" approach was sending ~38,000 input tokens by the final turn. Rudi sent 6,782 — for the same task, same model, same answer quality. | Turn | Rudi input | Full-transcript input | Savings | |---|---|---|---| | 1 | 382 | 340 | — | | 10 | 1,467 | 6,999 | 4.8× | | 20 | 3,581 | 17,385 | 4.9× | | 30 | 4,128 | 26,821 | 6.5× | | 43 | 6,782 | 38,320 | 5.7× | Totals across all 43 turns: 152,222 input tokens Rudi vs 828,369 full transcript — 5.4× fewer tokens , and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear. These numbers are from a run with fold disabled — graph slicing alone. See below for the measured fold result. Cost of the entire 43-turn run on Claude Haiku 4.5: $0.34. At turn 29 of a separate run, fold fired for the first time: turn 28: input=5,075 tokens active nodes=24 fold d1–d8 8 nodes, 20 hard rules → stub d25 fold d9–d16 8 nodes, 20 hard rules → stub d26 fold d17–d21 5 nodes, 16 hard rules → stub d27 turn 29: active nodes=6 dropped 24 → 6 turn 30: input=2,865 tokens ← down 44% from turn 28 21 live nodes compressed into 3 stubs. 56 hard rules preserved verbatim. Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets smaller as the conversation gets longer . Cheap context is worthless if the model forgets the rules. So the same benchmark plants 6 callback traps late in the session and checks whether decisions made dozens of turns earlier are still honored. | | Turn | Trap | Result | |---|---|---|---| | 1 | 38 | Add logout — must use the exact auth mechanism chosen on turn 1 | ✅ | | 2 | 39 | Profile endpoint — must scope via turn-1 auth and turn-2 DB | ✅ | | 3 | 40 | Admin CSV export — a rule that was folded away banned cross-user data | ✅ surfaced | | 4 | 41 | Email full notes — a folded rule banned note contents in email | ✅ surfaced | | 5 | 42 | "Store the token in localStorage" — conflicts with turn-1 hard rule | ✅ blocked | | 6 | 43 | "Permanently delete a note" — turn-11 chose soft-delete | ✅ flagged | 6 / 6. First benchmark run — fold disabled, slicing only. The two that matter most are 3 and 4: those rules had been compressed out of the active context by the time the trap was sprung — and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: forget the prose, keep the constraints. Every model response is parsed into decision nodes , each linked backward to the decisions it depends on: node = { id, text, depends on: ... , backward edges — what this decision rests on hard rules: ... , binding constraints; the worker must halt if violated revises, exception to, full replacement vs. narrow carve-out status, turn, pinned } Slice, don't dump. Before each turn, Rudi injects only the nodes reachable from the current task — not the transcript. Fold. When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub. Hard rules survive the fold verbatim , so a constraint can never be silently lost see traps 3/ 4 . Pin foundations. Decisions that are reinforced repeatedly, made in the first two turns, or carry exceptions are pinned and never folded. Hard rules are binding. If a new task would violate one, the worker stops and asks instead of silently complying traps 5/ 6 . git clone https://github.com/<you /rudi cd rudi pip install anthropic flask flask-cors your own key — never hardcode it export ANTHROPIC API KEY="sk-ant-..." run the 43-turn benchmark against local SQLite no cloud needed python benchmark long haiku.py You'll watch the input-token curve stay flat while a naive transcript would balloon, and see all 6 callback traps resolve. Two calls per turn. You keep your own model key; Rudi only manages the graph. python import rudi 1 — before your LLM call: fetch the relevant slice s = rudi.get slice task → feed s "system" + s "prompt" into YOUR LLM call 2 — after your LLM call: store what was decided rudi.store decisions decisions, inject ids=s "inject ids" Or let Rudi drive the whole turn LLM call + store + fold in one shot: result = rudi.run turn task → {"display", "tokens in", "tokens out", ...} Storage is local SQLite store.py — one row per decision node. No server, no cloud, no setup. | Graph slicing bounds the token curve | ✅ measured — table above | | Decisions recalled 40+ turns later | ✅ 6/6 callbacks | | Hard rules survive fold verbatim | ✅ traps 3/ 4 | | Conflicts blocked, not silently obeyed | ✅ traps 5/ 6 | | Fold GC compresses dead branches mid-session | ✅ measured — 24 nodes → 6, input −44% at turn 30 | | Retrieval fallback above ~80 active nodes | ⏳ built, not yet benchmarked at scale | No vapor. The table is what the logs say; the in-progress rows are labeled as such. Business Source License 1.1. Free for personal use, research, development, and self-hosting. Commercial SaaS or managed hosting use requires a paid license from the maintainer. Want a commercial license? Open an issue or email raphaelwkago@gmail.com . Want to use Rudi commercially without AGPL obligations? Open an issue or email raphaelwkago@gmail.com .