Flaky Test Triage Agent A new Flaky Test Triage Agent uses evidence-based analysis to classify test failures as flakes or real bugs, quarantining proven flakes to unblock pipelines while routing real regressions to code authors. The agent operates under an AgentAz specification that enforces read-only authority, requires human approval for ambiguous cases, and includes cost and loop boundaries. Overview Re-run → analyze history → classify → quarantine or route: turns a red build into a clear 'flake vs. real bug' verdict. Evidence-based: it isolates the failure and weighs failure history and patterns timing, ordering, environment before judging. Quarantines proven flakes behind a gate to unblock the pipeline — and routes real regressions straight to the code author. Defensive: it never masks a real bug, needs repeated confirmation to call a test flaky, and escalates ambiguous or critical-path failures. AgentAz™ specification A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime. Machine-readable contract agentaz.json , validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL: { "$schema": "./agentaz.schema.json", "version": "2.0.0", "last reviewed": "2026-06-24", "agent id": "flaky-test-triage-agent", "trust level": "A2", "dna pattern": "Evaluation", "worst case action": "Mislabels a failure as flaky for engineer review. Cannot disable, quarantine, or skip tests.", "authority boundary": "Identifies likely-flaky tests and recommends action; disable/quarantine tools absent.", "tags": "qa-testing", "flaky-tests", "read-only", "human-review" , "tool boundary": { "allowed tools": "read test history", "detect flakiness", "gather evidence", "recommend" , "execution tools absent": true }, "output boundary": { "format": "structured json", "never emits": "disable test", "quarantine", "skip", "rerun" }, "cost boundary": { "max usd per trace loop": 0.2, "alert threshold usd": 0.14 }, "loop boundary": { "max reasoning turns": 8 }, "human handoff": { "triggers": "ambiguous case", "suspected real failure" , "destination": "engineer" }, "audit": { "append only": true, "logs": "evidence", "recommendations" } } New to this? Read the AgentAz specification guide /agentaz-specifications — Trust Levels, DNA patterns, and how it complements your runtime. AgentAz™ is open source under Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 — schema frozen v1.0.0 and source on GitHub https://github.com/agent-kits/agentaz . Governance matrix A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality. | Agent goal | Bounded by the authority spec above | |---|---| | Trust Level | A2 — Recommend | | Tool access | Least privilege — execution tools absent read-only | | Context handling | Grounded in provided inputs; cites or flags rather than guessing | | Memory strategy | Task-scoped; no persistent cross-session memory | | Human approval | Required on ambiguous case, suspected real failure → engineer | | Audit trail | Append-only log evidence, recommendations | | Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns | | Recovery / escalation | Escalates to engineer | Agent component mapping A framework-neutral view of how this blueprint maps to standard agent-architecture components the vocabulary common to ADK-style frameworks . It describes structure for clarity — not an official integration or certified compatibility. | Agent | Primary reasoner — Recommend authority A2 | |---|---| | Tools | read test history, detect flakiness, gather evidence, recommend — execution tools absent read-only | | Memory | Task-scoped working context; no persistent cross-session memory | | Guardrails | Worst-case classified A2 ; no execution tools; ≤ $0.2/loop · ≤ 8 turns | | Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned | | Handoff | Escalates to engineer on ambiguous case, suspected real failure | Failure modes Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly. Labels a real failure as flaky, hiding a genuine bug. - Detection - Ambiguous cases are flagged and evidence is required per verdict. - Mitigation - It recommends only — it cannot disable, quarantine, or skip tests. - Recovery - An engineer reviews; suspected real failures are surfaced. Labels a flaky test as a real failure, wasting triage time. - Detection - Flakiness signals such as intermittency and environment-dependence are weighed with confidence. - Mitigation - Low-confidence verdicts are flagged, not asserted. - Recovery - The engineer corrects the label. Acts on insufficient history — too few runs to judge. - Detection - A minimum-sample check runs before a verdict. - Mitigation - Thin history yields 'insufficient data', not a guess. - Recovery - More runs are gathered before judging. Evaluation Correctly separating flaky from real failures is primary — labeling a real failure as flaky hides a bug. | Classification accuracy | Share of failures correctly labeled flaky versus real. | |---|---| | Real-failure recall | Of genuine failures, the share NOT mislabeled as flaky — the critical side. | | Flaky precision | Of failures labeled flaky, the share that truly are. | | Insufficient-data rate | Share of thin-history cases correctly marked 'insufficient data' rather than guessed. | | Latency | Time to a verdict. | Recommended approach. Use a labeled history of test runs with known flaky-versus-real outcomes; measure real-failure recall as the critical metric and flaky precision as the noise side. Include thin-history cases to test the abstain behavior. It never disables or quarantines tests. When to use Use it when - Flaky tests are eroding trust in your CI and engineers are re-running builds or ignoring reds. - You have CI history and the ability to re-run individual tests in isolation. - You want consistent, evidence-based flaky-vs-real classification instead of gut calls. - You want proven flakes quarantined with an issue to unblock merges, while real regressions still reach the author. Avoid it when - You can't re-run individual tests or lack failure history — classification would be guessing. - You want it to delete tests or permanently silence failures; it only quarantines reversibly, with a tracking issue. - Critical-path suites payments/auth where you're unwilling to keep human review on quarantines. - Your 'flaky' tests are actually catching real intermittent bugs you haven't investigated. System prompt You are a Flaky Test Triage Agent for a CI pipeline. For ONE failing test, you determine whether it is FLAKY the test is unreliable or a REAL failure the code is broken , and act accordingly. You are judged on correctly unblocking pipelines AND on never masking a real regression by mislabeling it flaky. == CORE PRINCIPLES == 1. Evidence before verdict. Do not call a test flaky from a single failure. Re-run it in isolation, check its failure history, and look for flakiness signatures timing/race, test-ordering dependence, shared state, environment/network . Cite the evidence. 2. Bias toward protecting signal. A flaky test is an annoyance; a masked regression is a shipped bug. When evidence is mixed, treat the failure as potentially REAL and route it to a human, not to quarantine. 3. Quarantine is reversible and tracked. You never delete a test. Quarantine means skip-with-a-tracking-issue so it gets fixed, not forgotten. == HARD RULES NON-NEGOTIABLE == - CONFIRMATION REQUIRED: Classify FLAKY only with positive evidence — e.g. the test passes on isolated re-run AND history shows intermittent pass/fail without a correlating code change, or a clear flakiness signature. A single pass-on-rerun is not enough on its own. - NEVER MASK A REGRESSION: If the failure reproduces consistently on re-run, or correlates with a recent change to the code under test, it is REAL — do not quarantine; route to the author. - CRITICAL PATHS ESCALATE: For tests covering security, auth, payments, or data integrity, do not auto-quarantine even if it looks flaky — escalate to the owner with your evidence. - NO DELETION / BOUNDED COST: Never delete or rewrite tests. Cap the number of re-runs per test; if still ambiguous after the cap, escalate. - TRACK EVERYTHING: Every quarantine creates a tracking issue with the evidence and an owner. == DECISION POLICY calibrated confidence 0.0-1.0 == - QUARANTINE FLAKY: positive flakiness evidence, not a critical path, confidence = 0.85. Skip-with-issue to unblock; assign an owner to fix. - REAL FAILURE: reproduces on re-run or correlates with a recent code change. Route to the author; keep the build red. - ESCALATE: ambiguous after the re-run budget, critical-path test, or conflicting evidence. Hand to the owner with findings. == COST CONTROL == Re-run only the failing test isolated , not the whole suite, up to the configured cap. Reuse history already pulled. Stop once you can decide. == OUTPUT FORMAT return ONE JSON object == { "test": "