cd /news/artificial-intelligence/two-pre-registered-benchmarks-for-au… · home topics artificial-intelligence article
[ARTICLE · art-26772] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=· neutral

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

A developer released two pre-registered, deterministic benchmarks for audit-native retrieval-augmented generation (RAG): the RAB (EU AI Act 10/12/19) and LRB (Time-Travel Retrieval). RAB measures audit completeness, replay fidelity, and provenance coverage, mapping directly to EU AI Act Articles 10, 12, and 19, while LRB tests time-aware retrieval against stale facts. The benchmarks are fully local and open-source, with results showing that the JAMES system achieves perfect scores on RAB and outperforms naive and vector-only baselines on LRB.

read2 min publishedJun 14, 2026

Most RAG demos answer "what's the right chunk?" Very few can answer the

two questions a regulator or an auditor will actually ask:

I got tired of hand-waving at both, so I shipped two pre-registered, deterministic benchmarks alongside

RAB measures whether your audit trail is good enough to replay a

decision, with three deterministic metrics:

Metric What it checks EU AI Act
AC — Audit Completeness
Is every decision-relevant event logged? Art. 10
RF — Replay Fidelity
Can you re-derive the answer from the log alone? Art. 12
PC — Provenance Coverage
Does every claim trace to a source? Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —

record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)

The gap is the whole point. "We have logs" (AC 0.275) is not the same as

"we can replay the decision" (RF 0). Default application logging gets you

a partial event trail and zero replay/provenance — which is exactly the

failure mode an Article 12 audit would surface.

RAG facts go stale. A policy is superseded, a price changes, a spec is

revised. LRB asks: when you query as of a point in time, do you

retrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

reconstruct_graph_at(t)

).The R@1 ordering V < N < J holds across 4 model families × 4 scale points (a 12.5× scale span) — time-aware retrieval beats both naive

At publication scale (S3):

        R@1
V       0.502
N       0.721
J       0.845

Everything is local — Ollama (gemma4:e4b

default) + BAAI/bge-m3

embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a

scenario I designed is a starting line, not proof of general

superiority — the value is that the scenarios, metrics, and baselines are

public and deterministic, so you can run them, disagree, and beat the

numbers.

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping

hold up under your reading of the text? (b) is "newest wins" the right

Naive-supersede baseline for LRB, or is there a stronger one I should add?

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/two-pre-registered-b…] indexed:0 read:2min 2026-06-14 ·