{"slug": "two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb", "title": "Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)", "summary": "A developer released two pre-registered, deterministic benchmarks for audit-native retrieval-augmented generation (RAG): the RAB (EU AI Act 10/12/19) and LRB (Time-Travel Retrieval). RAB measures audit completeness, replay fidelity, and provenance coverage, mapping directly to EU AI Act Articles 10, 12, and 19, while LRB tests time-aware retrieval against stale facts. The benchmarks are fully local and open-source, with results showing that the JAMES system achieves perfect scores on RAB and outperforms naive and vector-only baselines on LRB.", "body_md": "Most RAG demos answer \"what's the right chunk?\" Very few can answer the\n\ntwo questions a regulator or an auditor will actually ask:\n\nI got tired of hand-waving at both, so I shipped two **pre-registered,\ndeterministic** benchmarks alongside\n\nRAB measures whether your audit trail is good enough to *replay* a\n\ndecision, with three deterministic metrics:\n\n| Metric | What it checks | EU AI Act |\n|---|---|---|\nAC — Audit Completeness |\nIs every decision-relevant event logged? | Art. 10 |\nRF — Replay Fidelity |\nCan you re-derive the answer from the log alone? | Art. 12 |\nPC — Provenance Coverage |\nDoes every claim trace to a source? | Art. 19 |\n\nThe three metrics map **verbatim** to EU AI Act Articles 10, 12, and 19 —\n\nrecord-keeping obligations that apply from **2026-08-02** (per Article 113).\n\n**Scenario S1 result:**\n\n```\n                 AC      RF      PC\nJAMES          1.000   1.000   1.000\nBaseline-0     0.275   0.000   0.000   (vanilla default-logging)\n```\n\nThe gap is the whole point. \"We have logs\" (AC 0.275) is not the same as\n\n\"we can replay the decision\" (RF 0). Default application logging gets you\n\na partial event trail and zero replay/provenance — which is exactly the\n\nfailure mode an Article 12 audit would surface.\n\nRAG facts go stale. A policy is superseded, a price changes, a spec is\n\nrevised. LRB asks: when you query *as of* a point in time, do you\n\nretrieve the fact that was **valid then**, or whatever overwrote it?\n\nThree systems compared:\n\n`reconstruct_graph_at(t)`\n\n).The R@1 ordering **V < N < J holds across 4 model families × 4 scale\npoints** (a 12.5× scale span) — time-aware retrieval beats both naive\n\n**At publication scale (S3):**\n\n```\n        R@1\nV       0.502\nN       0.721\nJ       0.845\n```\n\nEverything is local — Ollama (`gemma4:e4b`\n\ndefault) + BAAI/bge-m3\n\nembeddings + ChromaDB. No cloud LLM account.\n\n```\ngit clone https://github.com/Hashevolution/James-RAG-Evol\ncp .env.example .env\npip install -r requirements.txt\nollama pull gemma4:e4b\n# benchmark runners live in scripts/research/ (lrb_run*.py, rab_*)\n```\n\nThese are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a\n\nscenario *I designed* is a starting line, not proof of general\n\nsuperiority — the value is that the scenarios, metrics, and baselines are\n\npublic and deterministic, so you can run them, disagree, and beat the\n\nnumbers.\n\nFeedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping\n\nhold up under your reading of the text? (b) is \"newest wins\" the right\n\nNaive-supersede baseline for LRB, or is there a stronger one I should add?", "url": "https://wpnews.pro/news/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb", "canonical_source": "https://dev.to/hashevolution/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-101219-lrb-time-travel-369h", "published_at": "2026-06-14 06:10:11+00:00", "updated_at": "2026-06-14 06:28:49.728629+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-policy", "ai-safety", "developer-tools"], "entities": ["JAMES", "RAB", "LRB", "EU AI Act", "Ollama", "ChromaDB", "BAAI/bge-m3", "Hashevolution"], "alternates": {"html": "https://wpnews.pro/news/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb", "markdown": "https://wpnews.pro/news/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb.md", "text": "https://wpnews.pro/news/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb.txt", "jsonld": "https://wpnews.pro/news/two-pre-registered-benchmarks-for-audit-native-rag-rab-eu-ai-act-10-12-19-lrb.jsonld"}}