Causal graph memory for LLMs. Flat token cost, no matter how long the session runs.
Every LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a dependency graph of decisions β and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10.
In a 43-turn software-architecture session (building a Notes API turn by turn), the standard "re-send the full transcript" approach was sending ~38,000 input tokens by the final turn. Rudi sent 6,782 β for the same task, same model, same answer quality.
| Turn | Rudi input | Full-transcript input | Savings |
|---|---|---|---|
| 1 | 382 | 340 | β |
| 10 | 1,467 | 6,999 | 4.8Γ |
| 20 | 3,581 | 17,385 | 4.9Γ |
| 30 | 4,128 | 26,821 | 6.5Γ |
| 43 | 6,782 | 38,320 | 5.7Γ |
Totals across all 43 turns: 152,222 input tokens (Rudi) vs 828,369 (full transcript) β 5.4Γ fewer tokens, and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear.
These numbers are from a run with fold disabled β graph slicing alone. See below for the measured fold result.
Cost of the entire 43-turn run on Claude Haiku 4.5: $0.34.
At turn 29 of a separate run, fold fired for the first time:
turn 28: input=5,075 tokens active nodes=24
[fold] d1βd8 (8 nodes, 20 hard rules) β stub d25
[fold] d9βd16 (8 nodes, 20 hard rules) β stub d26
[fold] d17βd21 (5 nodes, 16 hard rules) β stub d27
turn 29: active nodes=6 (dropped 24 β 6)
turn 30: input=2,865 tokens β down 44% from turn 28
21 live nodes compressed into 3 stubs. 56 hard rules preserved verbatim. Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets smaller as the conversation gets longer.
Cheap context is worthless if the model forgets the rules. So the same benchmark plants 6 callback traps late in the session and checks whether decisions made dozens of turns earlier are still honored.
| # | Turn | Trap | Result |
|---|---|---|---|
| 1 | 38 | Add logout β must use the exact auth mechanism chosen on turn 1 | |
| β | |||
| 2 | 39 | Profile endpoint β must scope via turn-1 auth and turn-2 DB | |
| β | |||
| 3 | 40 | Admin CSV export β a rule that was folded away banned cross-user data | |
| β surfaced | |||
| 4 | 41 | Email full notes β a folded rule banned note contents in email | |
| β surfaced | |||
| 5 | 42 | "Store the token in localStorage" β conflicts with turn-1 hard rule | β blocked |
| 6 | 43 | "Permanently delete a note" β turn-11 chose soft-delete | β flagged |
6 / 6. (First benchmark run β fold disabled, slicing only.) The two that matter most are #3 and #4: those rules had been compressed out of the active context by the time the trap was sprung β and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: forget the prose, keep the constraints.
Every model response is parsed into decision nodes, each linked backward to the decisions it depends on:
node = {
id, text,
depends_on: [...], # backward edges β what this decision rests on
hard_rules: [...], # binding constraints; the worker must halt if violated
revises, exception_to, # full replacement vs. narrow carve-out
status, turn, pinned
}
Slice, don't dump. Before each turn, Rudi injects only the nodes reachable from the current task β not the transcript.Fold. When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub.Hard rules survive the fold verbatim, so a constraint can never be silently lost (see traps #3/#4).** Pin foundations.Decisions that are reinforced repeatedly, made in the first two turns, or carry exceptions are pinned and never folded. Hard rules are binding.**If a new task would violate one, the worker stops and asks instead of silently complying (traps #5/#6).
git clone https://github.com/<you>/rudi
cd rudi
pip install anthropic flask flask-cors
export ANTHROPIC_API_KEY="sk-ant-..."
python benchmark_long_haiku.py
You'll watch the input-token curve stay flat while a naive transcript would balloon, and see all 6 callback traps resolve.
Two calls per turn. You keep your own model key; Rudi only manages the graph.
import rudi
s = rudi.get_slice(task)
rudi.store_decisions(decisions, inject_ids=s["inject_ids"])
Or let Rudi drive the whole turn (LLM call + store + fold) in one shot:
result = rudi.run_turn(task) # β {"display", "tokens_in", "tokens_out", ...}
Storage is local SQLite (store.py
) β one row per decision node. No server, no cloud, no setup.
| Graph slicing bounds the token curve | β measured β table above | | Decisions recalled 40+ turns later | β 6/6 callbacks | | Hard rules survive fold verbatim | β traps #3/#4 | | Conflicts blocked, not silently obeyed | β traps #5/#6 | | Fold GC compresses dead branches mid-session | β measured β 24 nodes β 6, input β44% at turn 30 | | Retrieval fallback above ~80 active nodes | β³ built, not yet benchmarked at scale |
No vapor. The table is what the logs say; the in-progress rows are labeled as such.
Business Source License 1.1. Free for personal use, research, development, and self-hosting. Commercial SaaS or managed hosting use requires a paid license from the maintainer.
Want a commercial license? Open an issue or email ** raphaelwkago@gmail.com**.
Want to use Rudi commercially without AGPL obligations? Open an issue or email ** raphaelwkago@gmail.com**.