Unified RAG Evaluation Schema: Cross-Supplier Quality Measurement for Amazon Bedrock and Agentic…

Amazon proposes the Unified RAG Evaluation Schema (URES) to standardize cross-supplier quality measurement for RAG and agentic workloads, enabling enterprises to compare evaluation results from different toolkits like RAGAS and Amazon Bedrock model evaluation. The schema defines a common input and output record format with metrics on a unified 0-1 scale, addressing the NIST AI Risk Management Framework's requirement for repeatable, comparable measurement.

Consider an enterprise running three RAG-backed assistants. Team A evaluates with Amazon Bedrock model evaluation through the console 1 . Team B runs RAGAS in a notebook against a curated golden set 2 . Team C writes a custom LLM-as-judge harness and stores results in its own format. All three report a faithfulness score for what the procurement organization is told is “the same” model. The numbers cannot be compared. Each toolkit names metrics differently, scores them on different scales, expects different input shapes, and persists results in incompatible records. The pattern repeats across the industry. Two teams in one enterprise running ostensibly identical evaluations produce numbers that nobody can stack. A team comparing a Bedrock-hosted judge against an Azure-hosted judge cannot tell whether a score delta reflects model quality or evaluation-method drift. Under the NIST AI Risk Management Framework 3 , the Measure function requires repeatable, comparable measurement. That requirement is not satisfiable without a shared evaluation schema. Enterprises running RAG and agentic workloads should adopt a single evaluation record schema across teams, suppliers, and model versions. The schema defines the input shape conversation messages, retrieved contexts, expected outputs, evaluation type, judge model, system under test and the output shape named metric scores on a unified 0–1 scale, plus evaluator metadata . Toolkits such as RAGAS, Amazon Bedrock model evaluation, and custom LLM-as-judge harnesses are adapted to the schema rather than the other way around. Quality measurement becomes a property of the enterprise, not of any one toolkit. In URES, the evaluation record is the architectural contract , ahead of any specific evaluator implementation. Four constraints define a URES-compliant record: { "evaluationId": "eval-2026-05-18-tc042", "evaluationType": "MT SESSION", "messages": { "role": "human", "content": "..." }, { "role": "ai", "content": "...", "retrievedContexts": { "title": "...", "url": "...", "snippet": "..." } } , "expectedAnswer": "...", "goldSourceRef": "s3://eval-artifacts/gold/tc042.json", "judgeModel": "bedrock/anthropic.claude-3-5-sonnet", "systemUnderTest": "bedrock/anthropic.claude-3-haiku", "retrievalEngine": "bedrock-knowledge-base/kb-<your-kb-id ", "timestamp": "2026-05-18T00:00:00Z"} The retrievalEngine field names an Amazon Bedrock Knowledge Base 4 ; on other suppliers it names the equivalent retrieval target. The field is a string identifier, not a binding, so the schema admits any retrieval surface an enterprise actually runs. { "evaluationId": "eval-2026-05-18-tc042", "evaluationType": "MT SESSION", "metrics": { "faithfulness": 0.92, "answerRelevancy": 0.88, "contextPrecision": 0.85, "contextRecall": 0.90, "conversationCoherence": 0.87, "sessionOutcome": "SELF SERVED KNOWLEDGE", "taskCompletion": 0.95 }, "evaluatorMetadata": { "framework": "ragas", "frameworkVersion": "0.2.x", "judgeModel": "bedrock/anthropic.claude-3-5-sonnet" }, "timestamp": "2026-05-18T00:05:12Z"} A minimal RAGAS adapter reads a URES input record, computes a fixed subset of metrics, and emits a URES output record: python from ragas import evaluatefrom ragas.metrics import faithfulness, answer relevancy, \ context precision, context recalldef ragas adapter input record: dict - dict: dataset = to ragas dataset input record scores = evaluate dataset, metrics= faithfulness, answer relevancy, context precision, context recall , return { "evaluationId": input record "evaluationId" , "evaluationType": input record "evaluationType" , "metrics": { "faithfulness": float scores "faithfulness" , "answerRelevancy": float scores "answer relevancy" , "contextPrecision": float scores "context precision" , "contextRecall": float scores "context recall" , }, "evaluatorMetadata": { "framework": "ragas", "frameworkVersion": ragas version , "judgeModel": input record "judgeModel" , }, "timestamp": now , } An Amazon Bedrock model evaluation adapter wraps a Bedrock CreateEvaluationJob call 1 and maps its native task-type scores into the same URES output shape. A custom LLM-as-judge adapter dispatches a structured prompt to any Bedrock-hosted model and parses the JSON response into the same shape. The schema is the contract; the rest is replaceable. The evaluation record is the boundary. Adapters conform to it; consumers depend on it. Replace any toolkit without touching dashboards, CI gates, or governance reports. URES records persist as JSON in Amazon S3 partitioned by date and evaluation type. An AWS Glue catalog exposes the records as a table queried via Amazon Athena. Amazon QuickSight surfaces cross-team and cross-supplier dashboards. CI/CD pipelines query the same table to enforce regression gates: a faithfulness score below a per-cohort threshold blocks a release. No record transformation occurs between producer and consumer. The same shape works against any object store and query engine; S3/Athena is one realization, not a requirement. The Athena table is declared once against the URES output schema and reused across every team: aws athena start-query-execution \ --output json \ --query-string " CREATE EXTERNAL TABLE ures eval results evaluationId string, evaluationType string, metrics map<string,string , evaluatorMetadata struct<framework:string, frameworkVersion:string, judgeModel:string , timestamp string PARTITIONED BY eval date string, eval type string ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 's3://ures-eval-records/' TBLPROPERTIES 'projection.enabled'='true' ; " \ --work-group "ures-evaluation" Partition projection eliminates per-partition MSCK REPAIR TABLE calls; the table is queryable as soon as the first records land in Amazon S3. Enterprises that have adopted the Evidence-Logged Agent Loop EGAL https://medium.com/@natarajaninbox/the-evidence-logged-agent-loop-structured-tool-call-logging-for-agentic-systems-dfc511b6735f pattern 5 have a natural upstream source for URES inputs: EGAL-compliant records already capture per-call retrieval context and model responses in structured form. URES is the measurement layer that transforms those evidence records into quality scores. Combined with the Stateless HTTP Container Isolation SHCI https://medium.com/@natarajaninbox/stateless-http-container-isolation-why-mcp-servers-on-serverless-runtimes-must-disable-d4c6abe1ac5a 6 runtime discipline, EGAL, SHCI, and URES constitute a coherent production stack for agentic AI on Amazon Bedrock. EGAL captures the evidence, SHCI keeps the runtime honest, and URES measures the quality of what those two have produced. Fits when: Doesn’t fit when: Operational considerations: URES does not solve metric selection. The schema mandates how scores are recorded, not which metrics measure quality for a given workload. Multimodal evaluation, tool-using agent evaluation, and multi-agent orchestration evaluation are still maturing, and current URES metric vocabularies cover only the first generation of these. Adopters should expect to extend the schema as those vocabularies stabilize. URES also does not eliminate judge bias. A unified schema makes judge model identity a first-class field, which improves traceability, but does not remove the systematic bias a particular judge introduces. Enterprises running URES at scale should rotate judge models or run dual-judge consensus on high-stakes evaluations, and treat metric scores as conditioned on judge identity, not absolute. A unified scale is not a unified ground truth. As US enterprises and federal programs adopt the NIST AI Risk Management Framework, the Measure function requires repeatable, comparable evaluation of AI system characteristics across suppliers and time. That requirement depends on a shared evaluation schema rather than supplier-specific evaluation surfaces. A schema-first measurement layer is the precondition that lets EGAL’s evidence and SHCI’s runtime discipline be evaluated together, on the same numeric ground, across every supplier an enterprise has reason to compare. 1 Amazon Web Services, Amazon Bedrock — Model Evaluation https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html . AWS Documentation. 2 Exploding Gradients, RAGAS — Available Metrics for RAG Evaluation https://docs.ragas.io/en/stable/concepts/metrics/available metrics/ . RAGAS Documentation. 3 National Institute of Standards and Technology, AI Risk Management Framework AI RMF 1.0 https://www.nist.gov/itl/ai-risk-management-framework . NIST, 2023. 4 Amazon Web Services, Amazon Bedrock Knowledge Bases https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html . AWS Documentation. 5 N. Selvaraj, The Evidence-Logged Agent Loop: Structured Tool-Call Logging for Agentic Systems https://medium.com/@natarajaninbox/the-evidence-logged-agent-loop-structured-tool-call-logging-for-agentic-systems-dfc511b6735f . Medium, 2026. 6 N. Selvaraj, Stateless HTTP Container Isolation: Why MCP Servers on Serverless Runtimes Must Disable Session-Based Routing https://medium.com/@natarajaninbox/stateless-http-container-isolation-why-mcp-servers-on-serverless-runtimes-must-disable-d4c6abe1ac5a . Medium, 2026. Unified RAG Evaluation Schema: Cross-Supplier Quality Measurement for Amazon Bedrock and Agentic… https://pub.towardsai.net/unified-rag-evaluation-schema-cross-supplier-quality-measurement-for-amazon-bedrock-and-agentic-02b4364351c0 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.