AI Agent Failure Detection and Root Cause Analysis with Strands Evals

AWS announced the Strands Evals SDK, which automatically detects AI agent failures in production and performs root cause analysis, reducing diagnosis time from hours to minutes. The tool categorizes failures, identifies causal chains, and provides fix recommendations for system prompts or tool definitions, complementing traditional evaluation scores.

Artificial Intelligence https://aws.amazon.com/blogs/machine-learning/ AI Agent Failure Detection and Root Cause Analysis with Strands Evals When your AI agent fails in production, knowing that it failed is only the beginning. The harder question is why it failed and what to fix. Traditional evaluation tells you “this agent scored 60 percent on goal completion,” but leaves you manually reviewing execution traces to understand what went wrong. For teams operating agents at scale, this manual diagnosis becomes the bottleneck between detecting a problem and shipping a fix. Detectors in the Strands Evals SDK https://github.com/strands-agents/evals remove this bottleneck by automatically identifying failures in agent execution traces and performing root cause analysis, so you can reduce diagnosis time from hours to minutes. In this post, we walk you through calling the detector functions to diagnose real agent failures. You learn how to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream symptoms, and fix recommendations specifying whether a change belongs in your system prompt or tool definitions. You also learn how to integrate detection into your evaluation pipeline for automated diagnosis on every test run. Detectors complement the evaluation framework introduced in a previous post https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/ by answering not only “how well did the agent do?” but also “why did it fail and how do I fix it?” Prerequisites You must have the following prerequisites to follow along with this post. - Python 3.10 or later. - Strands Evals SDK installed with pip install strands-agents-evals . - Amazon Bedrock model access enabled detectors use large language model LLM -based analysis . - For Amazon CloudWatch examples, AWS credentials configured with logs:StartQuery and logs:GetQueryResults permissions. Why scores alone are not enough The Strands Evals framework provides reliable quality signals through Cases, Experiments, and Evaluators: goal success rates, tool selection accuracy, and helpfulness scores. These are important for catching regressions and understanding performance at a statistical level. But consider what happens after you detect a regression. Your agent’s goal success rate drops from 85 percent to 70 percent after a deployment or after prompt or tool changes in build-time testing. Evaluators confirm the drop. Now what? You must identify which specific behaviors caused failures, distinguish root causes from downstream symptoms, determine whether the fix belongs in the system prompt or tool definitions, and prioritize by impact. This diagnosis workflow has traditionally required senior engineers to manually inspect traces span by span and correlate failures across hundreds of steps, and this process doesn’t scale. Detectors automate this workflow. Evaluators answer “how well did the agent do?” by producing scores at the per-case level. Detectors answer “why did it fail?” by producing diagnoses at the per-span level with categorized failures, causal chains, and fix recommendations. How detectors work The detector pipeline operates in two phases, each powered by LLM-based analysis of the execution trace. Refer to Understand observability for agentic resources in Amazon Bedrock AgentCore https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-telemetry.html to learn more about sessions, traces, and spans of agents. Phase 1: Failure detection scans each span in a session against a comprehensive failure taxonomy organized into nine parent categories: hallucination, incorrect actions, orchestration errors, task instruction non-compliance, execution errors, context handling errors, repetitive behavior, LLM output issues, and configuration mismatch. For each identified failure, it returns the span location, one or more categories, a confidence score, and evidence extracted from the trace. Phase 2: Root cause analysis takes the detected failures and traces causal chains between them. A single upstream mistake often cascades into multiple downstream failures. Root cause analysis separates causes from symptoms. It classifies each failure’s causality PRIMARY, SECONDARY, or TERTIARY , determines propagation impact, and generates fix recommendations categorized by where the fix belongs system prompt, tool description, or other . Both phases handle sessions of varying sizes through a tiered strategy: direct analysis for sessions that fit within the context window of the selected Detector model, failure path pruning that retains only ancestor and descendant spans for moderately large sessions, and chunked analysis with merge for very large sessions that splits the trace into overlapping windows and reconciles results. The following diagram shows the end-to-end pipeline with two entry points converging into the same detection and analysis flow. Figure: Detector pipeline with integrated and standalone entry points flowing into failure detection and root cause analysis. Getting started with failure detection The following examples use a session trace from the drug discovery research assistant featured in Evaluating AI agents for production: A practical guide to Strands Evals https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/ . The agent is built on Strands Agents and Amazon Bedrock. To follow along, run your agent with OpenTelemetry https://opentelemetry.io/ tracing enabled and export the session as JSON, or use the CloudWatchProvider shown later in this post to fetch an existing trace. Refer to User Simulation in the Strands Agents SDK documentation https://strandsagents.com/docs/user-guide/evals-sdk/simulators/user simulation/ complete-example-customer-service-evaluation for how to set up tracing and export sessions. The detect failures function takes a Session object the standard trace format in Strands Evals and returns structured failures. Each failure includes the span where it occurred, one or more categories from the pre-defined failure taxonomy, a confidence score, and evidence extracted from the trace. The following is output from a research agent that was asked to “Research the impact of energy requirements for powering AI in the real world.” The agent encountered tool configuration issues and progressively degraded: In a single pass, the detector identifies failures at multiple levels: execution errors tool parameter validation , semantic issues hallucinating from “general knowledge” , and orchestration problems full goal deviation . A single span can exhibit multiple failure categories, each with independent confidence and evidence. Adding root cause analysis Identifying failures is useful, but understanding why they happened is what drives fixes. The analyze root cause function takes detected failures and traces causal chains between them, separating root causes from downstream symptoms and recommending where each fix belongs. If failures aren’t provided to analyze root cause , it runs failure detection automatically. Continuing with the same research agent session, root cause analysis reveals the causal structure: The distinction between fix types is what makes root cause analysis actionable. The tool schema error is a TOOL DESCRIPTION FIX because the retrieve tool’s knowledgeBaseId isn’t documented clearly. The downstream hallucination is a SYSTEM PROMPT FIX because of missing instructions for how to handle persistent tool failures. Fixing only one category leaves the other unaddressed. Integrated diagnosis with diagnose session For convenience, diagnose session runs both phases as a single pipeline detect failures, then analyze root causes and returns a unified DiagnosisResult with deduplicated recommendations: This produces the same failures and root causes shown in the preceding examples, packaged into a single result with recommendations deduplicated across all root causes. From one function call, you get a prioritized list of concrete changes categorized by where they belong. Integration with evaluation pipelines Detectors provide additional value when you integrate them into your existing evaluation workflow. The DiagnosisConfig attaches automated diagnosis to any experiment, so that every failing test case automatically produces a diagnosis: Two trigger modes are available. ON FAILURE default runs diagnosis only when at least one evaluator returns test pass=False , making it cost-efficient for continuous integration and continuous delivery CI/CD regression detection. ALWAYS runs diagnosis on every case regardless of outcome, which is useful for identifying suboptimal paths in nominally passing cases. With this integration, your CI/CD pipeline tells you “3 tests failed”, and it tells you why they failed and what to change. This closes the feedback loop: define cases, run the experiment, get scores and diagnosis together, apply the recommended fixes, and re-run to confirm. Note: Running detectors uses Amazon Bedrock inference for LLM-based analysis, which incurs charges. See Amazon Bedrock pricing https://aws.amazon.com/bedrock/pricing/ for details. Amazon CloudWatch Logs storage also incurs charges. See Amazon CloudWatch pricing https://aws.amazon.com/cloudwatch/pricing/ for details. Monitor your usage in AWS Cost Explorer, especially when integrating detectors into CI/CD pipelines that run frequently. Diagnosing production sessions from CloudWatch The preceding examples use local session files, but in production your agent traces live in Amazon CloudWatch Logs, exported with OpenTelemetry. The CloudWatchProvider fetches traces directly from Amazon CloudWatch and converts them into Session objects that you can analyze with detectors: Under the hood, the provider queries Amazon CloudWatch Logs Insights for OTEL records matching the session ID, auto-detects the agent framework Strands, LangChain, or others from span metadata, and maps the spans into a standardized Session . Detectors work with any framework that exports OpenTelemetry traces to Amazon CloudWatch, not only Strands Agents. You can also combine this with the experiment pipeline for offline evaluation: use CloudWatchProvider to evaluate and diagnose historical production sessions without re-running the agent. You can also retrieve traces from Langfuse or OpenSearch using LangfuseProvider or OpenSearchProvider . Best practices Start with MEDIUM confidence. The LOW threshold catches more potential issues but includes more noise, which is useful for deep investigation of a specific failing case. MEDIUM provides a good signal-to-noise ratio for routine use. Reserve HIGH for production monitoring where you only want high-certainty findings. Use ON FAILURE in CI/CD, ALWAYS for periodic audits. ON FAILURE keeps LLM costs proportional to failure rates, making it practical for every test run. Schedule ALWAYS -mode runs weekly or per-release to catch suboptimal behaviors hiding in passing cases. Fix PRIMARY failures first. Secondary and tertiary failures often resolve when their root cause is addressed. Before implementing multiple recommendations, check whether fixing the primary failure removes the downstream ones. This reduces iteration cycles. Group recommendations by fix type. Batch TOOL DESCRIPTION FIX changes together and SYSTEM PROMPT FIX changes together. This makes the impact of each change category independently measurable when you re-run evaluation. Pass pre-detected failures to analyze root cause. If you have already run detect failures and want to inspect the results before running root cause analysis, pass them directly to avoid redundant detection: Use the test session for experimentation. The flawed session.json used in this post is available in the Strands Evals test suite https://github.com/strands-agents/evals for you to try detectors locally. Clean up resources The detector functions themselves don’t provision any persistent AWS resources. However, if you configured Amazon CloudWatch Logs export for your agent traces, you might want to review the following: Amazon CloudWatch log groups: Deleting a log group permanently removes all log data and can’t be undone. Confirm that you have exported any logs you need to retain before proceeding. If you created log groups specifically for testing, delete them through the Amazon CloudWatch console or by running aws logs delete-log-group --log-group-name <your-log-group . Amazon Bedrock model access: The LLM analysis uses Amazon Bedrock. If you enabled model access solely for this walkthrough, revoke it through the Amazon Bedrock console under Model access . Conclusion Detectors close the loop between measuring agent quality and improving it. By automating the failure detection and root cause analysis that previously required manual trace inspection, you can go from “test failed” to “here is what to fix” in minutes instead of hours. To get started, see the Strands Evals SDK Detectors documentation https://strandsagents.com/docs/user-guide/evals-sdk/detectors/ and the Strands Evals GitHub repository https://github.com/strands-agents/evals . Try the included sample trace file, then add DiagnosisConfig to one existing test case in your evaluation pipeline to see automated diagnosis in action.