Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

wpnews.pro

cd /news/artificial-intelligence/two-pre-registered-benchmarks-for-au… · home › topics › artificial-intelligence › article

[ARTICLE · art-26772] src=dev.to ↗ pub=2026-06-14T06:10Z topic=artificial-intelligence verified=true sentiment=· neutral

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

A developer released two pre-registered, deterministic benchmarks for audit-native retrieval-augmented generation (RAG): the RAB (EU AI Act 10/12/19) and LRB (Time-Travel Retrieval). RAB measures audit completeness, replay fidelity, and provenance coverage, mapping directly to EU AI Act Articles 10, 12, and 19, while LRB tests time-aware retrieval against stale facts. The benchmarks are fully local and open-source, with results showing that the JAMES system achieves perfect scores on RAB and outperforms naive and vector-only baselines on LRB.

read2 min views23 publishedJun 14, 2026

Most RAG demos answer "what's the right chunk?" Very few can answer the

two questions a regulator or an auditor will actually ask:

I got tired of hand-waving at both, so I shipped two pre-registered, deterministic benchmarks alongside

RAB measures whether your audit trail is good enough to replay a

decision, with three deterministic metrics:

Metric	What it checks	EU AI Act
AC — Audit Completeness
Is every decision-relevant event logged?	Art. 10
RF — Replay Fidelity
Can you re-derive the answer from the log alone?	Art. 12
PC — Provenance Coverage
Does every claim trace to a source?	Art. 19

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 —

record-keeping obligations that apply from 2026-08-02 (per Article 113).

Scenario S1 result:

                 AC      RF      PC
JAMES          1.000   1.000   1.000
Baseline-0     0.275   0.000   0.000   (vanilla default-logging)

The gap is the whole point. "We have logs" (AC 0.275) is not the same as

"we can replay the decision" (RF 0). Default application logging gets you

a partial event trail and zero replay/provenance — which is exactly the

failure mode an Article 12 audit would surface.

RAG facts go stale. A policy is superseded, a price changes, a spec is

revised. LRB asks: when you query as of a point in time, do you

retrieve the fact that was valid then, or whatever overwrote it?

Three systems compared:

reconstruct_graph_at(t)

).The R@1 ordering V < N < J holds across 4 model families × 4 scale points (a 12.5× scale span) — time-aware retrieval beats both naive

At publication scale (S3):

        R@1
V       0.502
N       0.721
J       0.845

Everything is local — Ollama (gemma4:e4b

default) + BAAI/bge-m3

embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol
cp .env.example .env
pip install -r requirements.txt
ollama pull gemma4:e4b

These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a

scenario I designed is a starting line, not proof of general

superiority — the value is that the scenarios, metrics, and baselines are

public and deterministic, so you can run them, disagree, and beat the

numbers.

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping

hold up under your reading of the text? (b) is "newest wins" the right

Naive-supersede baseline for LRB, or is there a stronger one I should add?

source & further reading

dev.to — original article garden-skills packages taste and process for AI coding agents Before Grok Build Uploads Your Repo, Show the Outbound Receipt Google Renames NotebookLM to Gemini Notebook With Code Execution and Cross-App Sync

~/api · this article 200

$curl api.wpnews.pro/v1/news/two-pre-registered-bench…

Read original on dev.to → dev.to/hashevolution/two-pre-registered-benchmar…

mentioned entities

JAMES

RAB

LRB

EU AI Act

Ollama

ChromaDB

BAAI/bge-m3

Hashevolution

metadata

slugtwo-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevPython for Machine Learning: The…

next →Ataraxy Labs' Weave targets the …

── more in #artificial-intelligence 4 stories · sorted by recency

sourcefeed.dev · 29 Jul · #artificial-intelligence

Copilot's Self-Copying Prompt Is a Macro Virus, Not a Worm

promptcube3.com · 29 Jul · #artificial-intelligence

AI Prompt Forum, AI refactoring, LLM security best practices

pub.towardsai.net · 29 Jul · #artificial-intelligence

I Built an AI Agent That Can Query My Kubernetes Cluster, But Never Break It

cryptobriefing.com · 29 Jul · #artificial-intelligence

OpenAI rogue agent escapes sandbox, launches multi-day hacking campaign against Hugging Face

── more on @james 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required