# AI Agent Failure Detection and Root Cause Analysis with Strands Evals

> Source: <https://aws.amazon.com/blogs/machine-learning/ai-agent-failure-detection-and-root-cause-analysis-with-strands-evals/>
> Published: 2026-06-15 18:07:59+00:00

[Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/)

# AI Agent Failure Detection and Root Cause Analysis with Strands Evals

When your AI agent fails in production, knowing *that* it failed is only the beginning. The harder question is *why* it failed and what to fix. Traditional evaluation tells you “this agent scored 60 percent on goal completion,” but leaves you manually reviewing execution traces to understand what went wrong. For teams operating agents at scale, this manual diagnosis becomes the bottleneck between detecting a problem and shipping a fix. Detectors in the [Strands Evals SDK](https://github.com/strands-agents/evals) remove this bottleneck by automatically identifying failures in agent execution traces and performing root cause analysis, so you can reduce diagnosis time from hours to minutes.

In this post, we walk you through calling the detector functions to diagnose real agent failures. You learn how to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream symptoms, and fix recommendations specifying whether a change belongs in your system prompt or tool definitions. You also learn how to integrate detection into your evaluation pipeline for automated diagnosis on every test run.

Detectors complement the evaluation framework [introduced in a previous post](https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/) by answering not only “how well did the agent do?” but also “why did it fail and how do I fix it?”

## Prerequisites

You must have the following prerequisites to follow along with this post.

- Python 3.10 or later.
- Strands Evals SDK installed with
`pip install strands-agents-evals`

. - Amazon Bedrock model access enabled (detectors use large language model (LLM)-based analysis).
- For Amazon CloudWatch examples, AWS credentials configured with
`logs:StartQuery`

and`logs:GetQueryResults`

permissions.

## Why scores alone are not enough

The Strands Evals framework provides reliable quality signals through Cases, Experiments, and Evaluators: goal success rates, tool selection accuracy, and helpfulness scores. These are important for catching regressions and understanding performance at a statistical level. But consider what happens after you detect a regression. Your agent’s goal success rate drops from 85 percent to 70 percent after a deployment or after prompt or tool changes in build-time testing. Evaluators confirm the drop. Now what?

You must identify which specific behaviors caused failures, distinguish root causes from downstream symptoms, determine whether the fix belongs in the system prompt or tool definitions, and prioritize by impact. This diagnosis workflow has traditionally required senior engineers to manually inspect traces span by span and correlate failures across hundreds of steps, and this process doesn’t scale.

Detectors automate this workflow. Evaluators answer “how well did the agent do?” by producing scores at the per-case level. Detectors answer “why did it fail?” by producing diagnoses at the per-span level with categorized failures, causal chains, and fix recommendations.

## How detectors work

The detector pipeline operates in two phases, each powered by LLM-based analysis of the execution trace. Refer to [Understand observability for agentic resources in Amazon Bedrock AgentCore](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-telemetry.html) to learn more about sessions, traces, and spans of agents.

Phase 1: Failure detection scans each span in a session against a comprehensive failure taxonomy organized into nine parent categories: hallucination, incorrect actions, orchestration errors, task instruction non-compliance, execution errors, context handling errors, repetitive behavior, LLM output issues, and configuration mismatch. For each identified failure, it returns the span location, one or more categories, a confidence score, and evidence extracted from the trace.

Phase 2: Root cause analysis takes the detected failures and traces causal chains between them. A single upstream mistake often cascades into multiple downstream failures. Root cause analysis separates causes from symptoms. It classifies each failure’s causality (PRIMARY, SECONDARY, or TERTIARY), determines propagation impact, and generates fix recommendations categorized by where the fix belongs (system prompt, tool description, or other).

Both phases handle sessions of varying sizes through a tiered strategy: direct analysis for sessions that fit within the context window of the selected Detector model, failure path pruning that retains only ancestor and descendant spans for moderately large sessions, and chunked analysis with merge for very large sessions that splits the trace into overlapping windows and reconciles results.

The following diagram shows the end-to-end pipeline with two entry points converging into the same detection and analysis flow.

*Figure: Detector pipeline with integrated and standalone entry points flowing into failure detection and root cause analysis.*

## Getting started with failure detection

The following examples use a session trace from the drug discovery research assistant featured in [Evaluating AI agents for production: A practical guide to Strands Evals](https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/). The agent is built on Strands Agents and Amazon Bedrock. To follow along, run your agent with [OpenTelemetry](https://opentelemetry.io/) tracing enabled and export the session as JSON, or use the `CloudWatchProvider`

shown later in this post to fetch an existing trace. Refer to [User Simulation in the Strands Agents SDK documentation](https://strandsagents.com/docs/user-guide/evals-sdk/simulators/user_simulation/#complete-example-customer-service-evaluation) for how to set up tracing and export sessions.

The `detect_failures`

function takes a `Session`

object (the standard trace format in Strands Evals) and returns structured failures. Each failure includes the span where it occurred, one or more categories from the pre-defined failure taxonomy, a confidence score, and evidence extracted from the trace.

The following is output from a research agent that was asked to “Research the impact of energy requirements for powering AI in the real world.” The agent encountered tool configuration issues and progressively degraded:

In a single pass, the detector identifies failures at multiple levels: execution errors (tool parameter validation), semantic issues (hallucinating from “general knowledge”), and orchestration problems (full goal deviation). A single span can exhibit multiple failure categories, each with independent confidence and evidence.

## Adding root cause analysis

Identifying failures is useful, but understanding why they happened is what drives fixes. The `analyze_root_cause`

function takes detected failures and traces causal chains between them, separating root causes from downstream symptoms and recommending where each fix belongs. If failures aren’t provided to `analyze_root_cause`

, it runs failure detection automatically.

Continuing with the same research agent session, root cause analysis reveals the causal structure:

The distinction between fix types is what makes root cause analysis actionable. The tool schema error is a `TOOL_DESCRIPTION_FIX`

because the retrieve tool’s `knowledgeBaseId`

isn’t documented clearly. The downstream hallucination is a `SYSTEM_PROMPT_FIX`

because of missing instructions for how to handle persistent tool failures. Fixing only one category leaves the other unaddressed.

## Integrated diagnosis with diagnose_session

For convenience, `diagnose_session`

runs both phases as a single pipeline (detect failures, then analyze root causes) and returns a unified `DiagnosisResult`

with deduplicated recommendations:

This produces the same failures and root causes shown in the preceding examples, packaged into a single result with recommendations deduplicated across all root causes. From one function call, you get a prioritized list of concrete changes categorized by where they belong.

## Integration with evaluation pipelines

Detectors provide additional value when you integrate them into your existing evaluation workflow. The `DiagnosisConfig`

attaches automated diagnosis to any experiment, so that every failing test case automatically produces a diagnosis:

Two trigger modes are available. `ON_FAILURE`

(default) runs diagnosis only when at least one evaluator returns `test_pass=False`

, making it cost-efficient for continuous integration and continuous delivery (CI/CD) regression detection. `ALWAYS`

runs diagnosis on every case regardless of outcome, which is useful for identifying suboptimal paths in nominally passing cases.

With this integration, your CI/CD pipeline tells you “3 tests failed”, and it tells you why they failed and what to change. This closes the feedback loop: define cases, run the experiment, get scores and diagnosis together, apply the recommended fixes, and re-run to confirm.

**Note:** Running detectors uses Amazon Bedrock inference for LLM-based analysis, which incurs charges. See [Amazon Bedrock pricing](https://aws.amazon.com/bedrock/pricing/) for details. Amazon CloudWatch Logs storage also incurs charges. See [Amazon CloudWatch pricing](https://aws.amazon.com/cloudwatch/pricing/) for details. Monitor your usage in AWS Cost Explorer, especially when integrating detectors into CI/CD pipelines that run frequently.

## Diagnosing production sessions from CloudWatch

The preceding examples use local session files, but in production your agent traces live in Amazon CloudWatch Logs, exported with OpenTelemetry. The `CloudWatchProvider`

fetches traces directly from Amazon CloudWatch and converts them into `Session`

objects that you can analyze with detectors:

Under the hood, the provider queries Amazon CloudWatch Logs Insights for OTEL records matching the session ID, auto-detects the agent framework (Strands, LangChain, or others) from span metadata, and maps the spans into a standardized `Session`

. Detectors work with any framework that exports OpenTelemetry traces to Amazon CloudWatch, not only Strands Agents.

You can also combine this with the experiment pipeline for offline evaluation: use `CloudWatchProvider`

to evaluate and diagnose historical production sessions without re-running the agent. You can also retrieve traces from Langfuse or OpenSearch using `LangfuseProvider`

or `OpenSearchProvider`

.

## Best practices

**Start with MEDIUM confidence.** The `LOW`

threshold catches more potential issues but includes more noise, which is useful for deep investigation of a specific failing case. `MEDIUM`

provides a good signal-to-noise ratio for routine use. Reserve `HIGH`

for production monitoring where you only want high-certainty findings.

**Use ON_FAILURE in CI/CD, ALWAYS for periodic audits.**

`ON_FAILURE`

keeps LLM costs proportional to failure rates, making it practical for every test run. Schedule `ALWAYS`

-mode runs weekly or per-release to catch suboptimal behaviors hiding in passing cases.**Fix PRIMARY failures first.** Secondary and tertiary failures often resolve when their root cause is addressed. Before implementing multiple recommendations, check whether fixing the primary failure removes the downstream ones. This reduces iteration cycles.

**Group recommendations by fix type.** Batch `TOOL_DESCRIPTION_FIX`

changes together and `SYSTEM_PROMPT_FIX`

changes together. This makes the impact of each change category independently measurable when you re-run evaluation.

**Pass pre-detected failures to analyze_root_cause.** If you have already run

`detect_failures`

and want to inspect the results before running root cause analysis, pass them directly to avoid redundant detection:**Use the test session for experimentation.** The `flawed_session.json`

used in this post is available in the [Strands Evals test suite](https://github.com/strands-agents/evals) for you to try detectors locally.

## Clean up resources

The detector functions themselves don’t provision any persistent AWS resources. However, if you configured Amazon CloudWatch Logs export for your agent traces, you might want to review the following:

**Amazon CloudWatch log groups:** Deleting a log group permanently removes all log data and can’t be undone. Confirm that you have exported any logs you need to retain before proceeding. If you created log groups specifically for testing, delete them through the Amazon CloudWatch console or by running`aws logs delete-log-group --log-group-name <your-log-group>`

.**Amazon Bedrock model access:** The LLM analysis uses Amazon Bedrock. If you enabled model access solely for this walkthrough, revoke it through the Amazon Bedrock console under**Model access**.

## Conclusion

Detectors close the loop between measuring agent quality and improving it. By automating the failure detection and root cause analysis that previously required manual trace inspection, you can go from “test failed” to “here is what to fix” in minutes instead of hours.

To get started, see the [Strands Evals SDK Detectors documentation](https://strandsagents.com/docs/user-guide/evals-sdk/detectors/) and the [Strands Evals GitHub repository](https://github.com/strands-agents/evals). Try the included sample trace file, then add `DiagnosisConfig`

to one existing test case in your evaluation pipeline to see automated diagnosis in action.
