Unified RAG Evaluation Schema: Cross-Supplier Quality Measurement for Amazon Bedrock and Agentic… Amazon proposes the Unified RAG Evaluation Schema (URES) to standardize cross-supplier quality measurement for RAG and agentic workloads, enabling enterprises to compare evaluation results from different toolkits like RAGAS and Amazon Bedrock model evaluation. The schema defines a common input and output record format with metrics on a unified 0-1 scale, addressing the NIST AI Risk Management Framework's requirement for repeatable, comparable measurement. Consider an enterprise running three RAG-backed assistants. Team A evaluates with Amazon Bedrock model evaluation through the console 1 . Team B runs RAGAS in a notebook against a curated golden set 2 . Team C writes a custom LLM-as-judge harness and stores results in its own format. All three report a faithfulness score for what the procurement organization is told is “the same” model. The numbers cannot be compared. Each toolkit names metrics differently, scores them on different scales, expects different input shapes, and persists results in incompatible records. The pattern repeats across the industry. Two teams in one enterprise running ostensibly identical evaluations produce numbers that nobody can stack. A team comparing a Bedrock-hosted judge against an Azure-hosted judge cannot tell whether a score delta reflects model quality or evaluation-method drift. Under the NIST AI Risk Management Framework 3 , the Measure function requires repeatable, comparable measurement. That requirement is not satisfiable without a shared evaluation schema. Enterprises running RAG and agentic workloads should adopt a single evaluation record schema across teams, suppliers, and model versions. The schema defines the input shape conversation messages, retrieved contexts, expected outputs, evaluation type, judge model, system under test and the output shape named metric scores on a unified 0–1 scale, plus evaluator metadata . Toolkits such as RAGAS, Amazon Bedrock model evaluation, and custom LLM-as-judge harnesses are adapted to the schema rather than the other way around. Quality measurement becomes a property of the enterprise, not of any one toolkit. In URES, the evaluation record is the architectural contract , ahead of any specific evaluator implementation. Four constraints define a URES-compliant record: { "evaluationId": "eval-2026-05-18-tc042", "evaluationType": "MT SESSION", "messages": { "role": "human", "content": "..." }, { "role": "ai", "content": "...", "retrievedContexts": { "title": "...", "url": "...", "snippet": "..." } } , "expectedAnswer": "...", "goldSourceRef": "s3://eval-artifacts/gold/tc042.json", "judgeModel": "bedrock/anthropic.claude-3-5-sonnet", "systemUnderTest": "bedrock/anthropic.claude-3-haiku", "retrievalEngine": "bedrock-knowledge-base/kb-