# Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

> Source: <https://dev.to/hashevolution/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-101219-lrb-time-travel-369h>
> Published: 2026-06-14 06:10:11+00:00

Most RAG demos answer "what's the right chunk?" Very few can answer the

two questions a regulator or an auditor will actually ask:

I got tired of hand-waving at both, so I shipped two **pre-registered,
deterministic** benchmarks alongside

RAB measures whether your audit trail is good enough to *replay* a

decision, with three deterministic metrics:

| Metric | What it checks | EU AI Act |
|---|---|---|
AC — Audit Completeness |
Is every decision-relevant event logged? | Art. 10 |
RF — Replay Fidelity |
Can you re-derive the answer from the log alone? | Art. 12 |
PC — Provenance Coverage |
Does every claim trace to a source? | Art. 19 |

The three metrics map **verbatim** to EU AI Act Articles 10, 12, and 19 —

record-keeping obligations that apply from **2026-08-02** (per Article 113).

**Scenario S1 result:**

```
                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)
```

The gap is the whole point. "We have logs" (AC 0.275) is not the same as

"we can replay the decision" (RF 0). Default application logging gets you

a partial event trail and zero replay/provenance — which is exactly the

failure mode an Article 12 audit would surface.

RAG facts go stale. A policy is superseded, a price changes, a spec is

revised. LRB asks: when you query *as of* a point in time, do you

retrieve the fact that was **valid then**, or whatever overwrote it?

Three systems compared:

`reconstruct_graph_at(t)`

).The R@1 ordering **V < N < J holds across 4 model families × 4 scale
points** (a 12.5× scale span) — time-aware retrieval beats both naive

**At publication scale (S3):**

```
        R@1
V       0.502
N       0.721
J       0.845
```

Everything is local — Ollama (`gemma4:e4b`

default) + BAAI/bge-m3

embeddings + ChromaDB. No cloud LLM account.

```
git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b
# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)
```

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a

scenario *I designed* is a starting line, not proof of general

superiority — the value is that the scenarios, metrics, and baselines are

public and deterministic, so you can run them, disagree, and beat the

numbers.

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping

hold up under your reading of the text? (b) is "newest wins" the right

Naive-supersede baseline for LRB, or is there a stronger one I should add?