Causal graph memory for LLMs. Flat token cost, no matter how the session runs

wpnews.pro

cd /news/large-language-models/causal-graph-memory-for-llms-flat-to… · home › topics › large-language-models › article

[ARTICLE · art-33938] src=github.com ↗ pub=2026-06-19T12:54Z topic=large-language-models verified=true sentiment=↑ positive

Causal graph memory for LLMs. Flat token cost, no matter how the session runs

Rudi, a new system for LLM memory management, uses a causal graph of decisions to replace the growing transcript, achieving flat token costs regardless of session length. In a 43-turn software architecture session, Rudi used 5.4× fewer tokens than the standard full-transcript approach while maintaining answer quality and passing all six callback traps that tested long-term constraint adherence.

read5 min views1 publishedJun 19, 2026

Causal graph memory for LLMs. Flat token cost, no matter how the session runs — Image: source

Causal graph memory for LLMs. Flat token cost, no matter how long the session runs.

Every LLM API call re-sends the whole conversation. Cost grows every turn; eventually you hit the context limit. Rudi replaces the growing transcript with a dependency graph of decisions — and injects only the slice relevant to the current task. Turn 10,000 costs about the same as turn 10.

In a 43-turn software-architecture session (building a Notes API turn by turn), the standard "re-send the full transcript" approach was sending ~38,000 input tokens by the final turn. Rudi sent 6,782 — for the same task, same model, same answer quality.

Turn	Rudi input	Full-transcript input	Savings
1	382	340	—
10	1,467	6,999	4.8×
20	3,581	17,385	4.9×
30	4,128	26,821	6.5×
43	6,782	38,320	5.7×

Totals across all 43 turns: 152,222 input tokens (Rudi) vs 828,369 (full transcript) — 5.4× fewer tokens, and the gap widens every turn because Rudi's curve is bounded while the transcript's is linear.

These numbers are from a run with fold disabled — graph slicing alone. See below for the measured fold result.

Cost of the entire 43-turn run on Claude Haiku 4.5: $0.34.

At turn 29 of a separate run, fold fired for the first time:

turn 28: input=5,075 tokens   active nodes=24
[fold] d1–d8   (8 nodes, 20 hard rules) → stub d25
[fold] d9–d16  (8 nodes, 20 hard rules) → stub d26
[fold] d17–d21 (5 nodes, 16 hard rules) → stub d27
turn 29: active nodes=6   (dropped 24 → 6)
turn 30: input=2,865 tokens   ← down 44% from turn 28

21 live nodes compressed into 3 stubs. 56 hard rules preserved verbatim. Input tokens nearly halved mid-session, automatically. That's the sawtooth: the graph gets smaller as the conversation gets longer.

Cheap context is worthless if the model forgets the rules. So the same benchmark plants 6 callback traps late in the session and checks whether decisions made dozens of turns earlier are still honored.

#	Turn	Trap	Result
1	38	Add logout — must use the exact auth mechanism chosen on turn 1
✅
2	39	Profile endpoint — must scope via turn-1 auth and turn-2 DB
✅
3	40	Admin CSV export — a rule that was folded away banned cross-user data
✅ surfaced
4	41	Email full notes — a folded rule banned note contents in email
✅ surfaced
5	42	"Store the token in localStorage" — conflicts with turn-1 hard rule	✅ blocked
6	43	"Permanently delete a note" — turn-11 chose soft-delete	✅ flagged

6 / 6. (First benchmark run — fold disabled, slicing only.) The two that matter most are #3 and #4: those rules had been compressed out of the active context by the time the trap was sprung — and the model still caught them, because hard rules are preserved verbatim on the fold stub. That's the whole thesis: forget the prose, keep the constraints.

Every model response is parsed into decision nodes, each linked backward to the decisions it depends on:

node = {
  id, text,
  depends_on: [...],     # backward edges — what this decision rests on
  hard_rules: [...],     # binding constraints; the worker must halt if violated
  revises, exception_to, # full replacement vs. narrow carve-out
  status, turn, pinned
}

Slice, don't dump. Before each turn, Rudi injects only the nodes reachable from the current task — not the transcript.Fold. When a branch of decisions goes reachability-dead, a background pass compresses it into a one-line stub.Hard rules survive the fold verbatim, so a constraint can never be silently lost (see traps #3/#4).** Pin foundations.Decisions that are reinforced repeatedly, made in the first two turns, or carry exceptions are pinned and never folded. Hard rules are binding.**If a new task would violate one, the worker stops and asks instead of silently complying (traps #5/#6).

git clone https://github.com/<you>/rudi
cd rudi
pip install anthropic flask flask-cors

export ANTHROPIC_API_KEY="sk-ant-..."

python benchmark_long_haiku.py

You'll watch the input-token curve stay flat while a naive transcript would balloon, and see all 6 callback traps resolve.

Two calls per turn. You keep your own model key; Rudi only manages the graph.

import rudi

s = rudi.get_slice(task)

rudi.store_decisions(decisions, inject_ids=s["inject_ids"])

Or let Rudi drive the whole turn (LLM call + store + fold) in one shot:

result = rudi.run_turn(task)   # → {"display", "tokens_in", "tokens_out", ...}

Storage is local SQLite (store.py

) — one row per decision node. No server, no cloud, no setup.

No vapor. The table is what the logs say; the in-progress rows are labeled as such.

Business Source License 1.1. Free for personal use, research, development, and self-hosting. Commercial SaaS or managed hosting use requires a paid license from the maintainer.

Want a commercial license? Open an issue or email ** raphaelwkago@gmail.com**.

Want to use Rudi commercially without AGPL obligations? Open an issue or email ** raphaelwkago@gmail.com**.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/causal-graph-memory-for-…

Read original on github.com → github.com/raphaelwkago-sketch/rudi

mentioned entities

Rudi

Claude Haiku 4.5

Notes API

metadata

slugcausal-graph-memory-for-llms-flat-token-cost-no-matter-how-the-session-runs

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalgithub.com

navigation

← prevBletchley's Longest Day: a warti…

next →Everpure’s AI Strategy Is Almost…

── more in #large-language-models 4 stories · sorted by recency

vettedconsumer.com · 19 Jun · #large-language-models

RAG on a Local LLM, Explained: Give Your Model Your Documents Without Drowning in Context

pub.towardsai.net · 19 Jun · #large-language-models

LangGraph Multi-Agent Systems: From One Brain to Many

letsdatascience.com · 19 Jun · #large-language-models

LUMIQ Secures INR 50 Crore Pre-Series B Funding

scmp.com · 19 Jun · #large-language-models

‘We’re all in’: Alibaba’s Joe Tsai makes biggest AI push yet at VivaTech

── more on @rudi 3 stories trending now

wpnews · 18 Jun · #large-language-models

ICYMI: ZAI launches GLM-5.2 open model with 1M context

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required