AI Agent Failure Detection and Root Cause Analysis with Strands Evals

wpnews.pro

Artificial Intelligence When your AI agent fails in production, knowing that it failed is only the beginning. The harder question is why it failed and what to fix. Traditional evaluation tells you “this agent scored 60 percent on goal completion,” but leaves you manually reviewing execution traces to understand what went wrong. For teams operating agents at scale, this manual diagnosis becomes the bottleneck between detecting a problem and shipping a fix. Detectors in the Strands Evals SDK remove this bottleneck by automatically identifying failures in agent execution traces and performing root cause analysis, so you can reduce diagnosis time from hours to minutes.

In this post, we walk you through calling the detector functions to diagnose real agent failures. You learn how to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream symptoms, and fix recommendations specifying whether a change belongs in your system prompt or tool definitions. You also learn how to integrate detection into your evaluation pipeline for automated diagnosis on every test run.

Detectors complement the evaluation framework introduced in a previous post by answering not only “how well did the agent do?” but also “why did it fail and how do I fix it?”

Prerequisites #

You must have the following prerequisites to follow along with this post.

Python 3.10 or later.
Strands Evals SDK installed with

`pip install strands-agents-evals`

. - Amazon Bedrock model access enabled (detectors use large language model (LLM)-based analysis).

For Amazon CloudWatch examples, AWS credentials configured with logs:StartQuery

andlogs:GetQueryResults

permissions.

Why scores alone are not enough #

The Strands Evals framework provides reliable quality signals through Cases, Experiments, and Evaluators: goal success rates, tool selection accuracy, and helpfulness scores. These are important for catching regressions and understanding performance at a statistical level. But consider what happens after you detect a regression. Your agent’s goal success rate drops from 85 percent to 70 percent after a deployment or after prompt or tool changes in build-time testing. Evaluators confirm the drop. Now what?

You must identify which specific behaviors caused failures, distinguish root causes from downstream symptoms, determine whether the fix belongs in the system prompt or tool definitions, and prioritize by impact. This diagnosis workflow has traditionally required senior engineers to manually inspect traces span by span and correlate failures across hundreds of steps, and this process doesn’t scale.

Detectors automate this workflow. Evaluators answer “how well did the agent do?” by producing scores at the per-case level. Detectors answer “why did it fail?” by producing diagnoses at the per-span level with categorized failures, causal chains, and fix recommendations.

How detectors work #

The detector pipeline operates in two phases, each powered by LLM-based analysis of the execution trace. Refer to Understand observability for agentic resources in Amazon Bedrock AgentCore to learn more about sessions, traces, and spans of agents.

Phase 1: Failure detection scans each span in a session against a comprehensive failure taxonomy organized into nine parent categories: hallucination, incorrect actions, orchestration errors, task instruction non-compliance, execution errors, context handling errors, repetitive behavior, LLM output issues, and configuration mismatch. For each identified failure, it returns the span location, one or more categories, a confidence score, and evidence extracted from the trace.

Phase 2: Root cause analysis takes the detected failures and traces causal chains between them. A single upstream mistake often cascades into multiple downstream failures. Root cause analysis separates causes from symptoms. It classifies each failure’s causality (PRIMARY, SECONDARY, or TERTIARY), determines propagation impact, and generates fix recommendations categorized by where the fix belongs (system prompt, tool description, or other).

Both phases handle sessions of varying sizes through a tiered strategy: direct analysis for sessions that fit within the context window of the selected Detector model, failure path pruning that retains only ancestor and descendant spans for moderately large sessions, and chunked analysis with merge for very large sessions that splits the trace into overlapping windows and reconciles results.

The following diagram shows the end-to-end pipeline with two entry points converging into the same detection and analysis flow.

Figure: Detector pipeline with integrated and standalone entry points flowing into failure detection and root cause analysis.

Getting started with failure detection #

The following examples use a session trace from the drug discovery research assistant featured in Evaluating AI agents for production: A practical guide to Strands Evals. The agent is built on Strands Agents and Amazon Bedrock. To follow along, run your agent with OpenTelemetry tracing enabled and export the session as JSON, or use the CloudWatchProvider

shown later in this post to fetch an existing trace. Refer to User Simulation in the Strands Agents SDK documentation for how to set up tracing and export sessions.

The detect_failures

function takes a Session object (the standard trace format in Strands Evals) and returns structured failures. Each failure includes the span where it occurred, one or more categories from the pre-defined failure taxonomy, a confidence score, and evidence extracted from the trace.

The following is output from a research agent that was asked to “Research the impact of energy requirements for powering AI in the real world.” The agent encountered tool configuration issues and progressively degraded:

In a single pass, the detector identifies failures at multiple levels: execution errors (tool parameter validation), semantic issues (hallucinating from “general knowledge”), and orchestration problems (full goal deviation). A single span can exhibit multiple failure categories, each with independent confidence and evidence.

Adding root cause analysis #

Identifying failures is useful, but understanding why they happened is what drives fixes. The analyze_root_cause

function takes detected failures and traces causal chains between them, separating root causes from downstream symptoms and recommending where each fix belongs. If failures aren’t provided to analyze_root_cause , it runs failure detection automatically.

Continuing with the same research agent session, root cause analysis reveals the causal structure:

The distinction between fix types is what makes root cause analysis actionable. The tool schema error is a TOOL_DESCRIPTION_FIX

because the retrieve tool’s knowledgeBaseId

isn’t documented clearly. The downstream hallucination is a SYSTEM_PROMPT_FIX

because of missing instructions for how to handle persistent tool failures. Fixing only one category leaves the other unaddressed.

Integrated diagnosis with diagnose_session #

For convenience, diagnose_session runs both phases as a single pipeline (detect failures, then analyze root causes) and returns a unified DiagnosisResult

with deduplicated recommendations:

This produces the same failures and root causes shown in the preceding examples, packaged into a single result with recommendations deduplicated across all root causes. From one function call, you get a prioritized list of concrete changes categorized by where they belong.

Integration with evaluation pipelines #

Detectors provide additional value when you integrate them into your existing evaluation workflow. The DiagnosisConfig

attaches automated diagnosis to any experiment, so that every failing test case automatically produces a diagnosis:

Two trigger modes are available. ON_FAILURE

(default) runs diagnosis only when at least one evaluator returns test_pass=False

, making it cost-efficient for continuous integration and continuous delivery (CI/CD) regression detection. ALWAYS

runs diagnosis on every case regardless of outcome, which is useful for identifying suboptimal paths in nominally passing cases.

With this integration, your CI/CD pipeline tells you “3 tests failed”, and it tells you why they failed and what to change. This closes the feedback loop: define cases, run the experiment, get scores and diagnosis together, apply the recommended fixes, and re-run to confirm.

Note: Running detectors uses Amazon Bedrock inference for LLM-based analysis, which incurs charges. See Amazon Bedrock pricing for details. Amazon CloudWatch Logs storage also incurs charges. See Amazon CloudWatch pricing for details. Monitor your usage in AWS Cost Explorer, especially when integrating detectors into CI/CD pipelines that run frequently.

Diagnosing production sessions from CloudWatch #

The preceding examples use local session files, but in production your agent traces live in Amazon CloudWatch Logs, exported with OpenTelemetry. The CloudWatchProvider

fetches traces directly from Amazon CloudWatch and converts them into Session

objects that you can analyze with detectors:

Under the hood, the provider queries Amazon CloudWatch Logs Insights for OTEL records matching the session ID, auto-detects the agent framework (Strands, LangChain, or others) from span metadata, and maps the spans into a standardized Session

. Detectors work with any framework that exports OpenTelemetry traces to Amazon CloudWatch, not only Strands Agents.

You can also combine this with the experiment pipeline for offline evaluation: use CloudWatchProvider

to evaluate and diagnose historical production sessions without re-running the agent. You can also retrieve traces from Langfuse or OpenSearch using LangfuseProvider

or OpenSearchProvider

.

Best practices #

Start with MEDIUM confidence. The LOW

threshold catches more potential issues but includes more noise, which is useful for deep investigation of a specific failing case. MEDIUM

provides a good signal-to-noise ratio for routine use. Reserve HIGH

for production monitoring where you only want high-certainty findings. Use ON_FAILURE in CI/CD, ALWAYS for periodic audits.

ON_FAILURE

keeps LLM costs proportional to failure rates, making it practical for every test run. Schedule ALWAYS

-mode runs weekly or per-release to catch suboptimal behaviors hiding in passing cases.Fix PRIMARY failures first. Secondary and tertiary failures often resolve when their root cause is addressed. Before implementing multiple recommendations, check whether fixing the primary failure removes the downstream ones. This reduces iteration cycles.

Group recommendations by fix type. Batch TOOL_DESCRIPTION_FIX

changes together and SYSTEM_PROMPT_FIX

changes together. This makes the impact of each change category independently measurable when you re-run evaluation.

Pass pre-detected failures to analyze_root_cause. If you have already run

detect_failures

and want to inspect the results before running root cause analysis, pass them directly to avoid redundant detection:Use the test session for experimentation. The flawed_session.json

used in this post is available in the Strands Evals test suite for you to try detectors locally.

Clean up resources #

The detector functions themselves don’t provision any persistent AWS resources. However, if you configured Amazon CloudWatch Logs export for your agent traces, you might want to review the following:

Amazon CloudWatch log groups: Deleting a log group permanently removes all log data and can’t be undone. Confirm that you have exported any logs you need to retain before proceeding. If you created log groups specifically for testing, delete them through the Amazon CloudWatch console or by runningaws logs delete-log-group --log-group-name <your-log-group>

.Amazon Bedrock model access: The LLM analysis uses Amazon Bedrock. If you enabled model access solely for this walkthrough, revoke it through the Amazon Bedrock console underModel access.

Conclusion #

Detectors close the loop between measuring agent quality and improving it. By automating the failure detection and root cause analysis that previously required manual trace inspection, you can go from “test failed” to “here is what to fix” in minutes instead of hours.

To get started, see the Strands Evals SDK Detectors documentation and the Strands Evals GitHub repository. Try the included sample trace file, then add DiagnosisConfig

to one existing test case in your evaluation pipeline to see automated diagnosis in action.

source & further reading

aws.amazon.com — original article Deploying Kimi K3 on AWS Inference meta-monitoring for Amazon SageMaker AI endpoints with Amazon Quick Introducing explicit prompt caching for OpenAI GPT-5.6 models on Amazon Bedrock