Most RAG demos answer "what's the right chunk?" Very few can answer the
two questions a regulator or an auditor will actually ask:
I got tired of hand-waving at both, so I shipped two pre-registered, deterministic benchmarks alongside
RAB measures whether your audit trail is good enough to replay a
decision, with three deterministic metrics:
| Metric | What it checks | EU AI Act |
|---|---|---|
| AC — Audit Completeness | ||
| Is every decision-relevant event logged? | Art. 10 | |
| RF — Replay Fidelity | ||
| Can you re-derive the answer from the log alone? | Art. 12 | |
| PC — Provenance Coverage | ||
| Does every claim trace to a source? | Art. 19 |
The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —
record-keeping obligations that apply from 2026-08-02 (per Article 113).
Scenario S1 result:
AC RF PC
JAMES 1.000 1.000 1.000
Baseline-0 0.275 0.000 0.000 (vanilla default-logging)
The gap is the whole point. "We have logs" (AC 0.275) is not the same as
"we can replay the decision" (RF 0). Default application logging gets you
a partial event trail and zero replay/provenance — which is exactly the
failure mode an Article 12 audit would surface.
RAG facts go stale. A policy is superseded, a price changes, a spec is
revised. LRB asks: when you query as of a point in time, do you
retrieve the fact that was valid then, or whatever overwrote it?
Three systems compared:
reconstruct_graph_at(t)
).The R@1 ordering V < N < J holds across 4 model families × 4 scale points (a 12.5× scale span) — time-aware retrieval beats both naive
At publication scale (S3):
R@1
V 0.502
N 0.721
J 0.845
Everything is local — Ollama (gemma4:e4b
default) + BAAI/bge-m3
embeddings + ChromaDB. No cloud LLM account.
git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a
scenario I designed is a starting line, not proof of general
superiority — the value is that the scenarios, metrics, and baselines are
public and deterministic, so you can run them, disagree, and beat the
numbers.
Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping
hold up under your reading of the text? (b) is "newest wins" the right
Naive-supersede baseline for LRB, or is there a stronger one I should add?