Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

A developer released two pre-registered, deterministic benchmarks for audit-native retrieval-augmented generation (RAG): the RAB (EU AI Act 10/12/19) and LRB (Time-Travel Retrieval). RAB measures audit completeness, replay fidelity, and provenance coverage, mapping directly to EU AI Act Articles 10, 12, and 19, while LRB tests time-aware retrieval against stale facts. The benchmarks are fully local and open-source, with results showing that the JAMES system achieves perfect scores on RAB and outperforms naive and vector-only baselines on LRB.

Most RAG demos answer "what's the right chunk?" Very few can answer the two questions a regulator or an auditor will actually ask: I got tired of hand-waving at both, so I shipped two pre-registered, deterministic benchmarks alongside RAB measures whether your audit trail is good enough to replay a decision, with three deterministic metrics: | Metric | What it checks | EU AI Act | |---|---|---| AC — Audit Completeness | Is every decision-relevant event logged? | Art. 10 | RF — Replay Fidelity | Can you re-derive the answer from the log alone? | Art. 12 | PC — Provenance Coverage | Does every claim trace to a source? | Art. 19 | The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 — record-keeping obligations that apply from 2026-08-02 per Article 113 . Scenario S1 result: AC RF PC JAMES 1.000 1.000 1.000 Baseline-0 0.275 0.000 0.000 vanilla default-logging The gap is the whole point. "We have logs" AC 0.275 is not the same as "we can replay the decision" RF 0 . Default application logging gets you a partial event trail and zero replay/provenance — which is exactly the failure mode an Article 12 audit would surface. RAG facts go stale. A policy is superseded, a price changes, a spec is revised. LRB asks: when you query as of a point in time, do you retrieve the fact that was valid then , or whatever overwrote it? Three systems compared: reconstruct graph at t .The R@1 ordering V < N < J holds across 4 model families × 4 scale points a 12.5× scale span — time-aware retrieval beats both naive At publication scale S3 : R@1 V 0.502 N 0.721 J 0.845 Everything is local — Ollama gemma4:e4b default + BAAI/bge-m3 embeddings + ChromaDB. No cloud LLM account. git clone https://github.com/Hashevolution/James-RAG-Evol cp .env.example .env pip install -r requirements.txt ollama pull gemma4:e4b benchmark runners live in scripts/research/ lrb run .py, rab These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a scenario I designed is a starting line, not proof of general superiority — the value is that the scenarios, metrics, and baselines are public and deterministic, so you can run them, disagree, and beat the numbers. Feedback I'd value most: a does the AC/RF/PC ↔ Art. 10/12/19 mapping hold up under your reading of the text? b is "newest wins" the right Naive-supersede baseline for LRB, or is there a stronger one I should add?