Flaky Test Triage Agent

A new Flaky Test Triage Agent uses evidence-based analysis to classify test failures as flakes or real bugs, quarantining proven flakes to unblock pipelines while routing real regressions to code authors. The agent operates under an AgentAz specification that enforces read-only authority, requires human approval for ambiguous cases, and includes cost and loop boundaries.

Overview Re-run → analyze history → classify → quarantine or route: turns a red build into a clear 'flake vs. real bug' verdict. Evidence-based: it isolates the failure and weighs failure history and patterns timing, ordering, environment before judging. Quarantines proven flakes behind a gate to unblock the pipeline — and routes real regressions straight to the code author. Defensive: it never masks a real bug, needs repeated confirmation to call a test flaky, and escalates ambiguous or critical-path failures. AgentAz™ specification A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime. Machine-readable contract agentaz.json , validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL: { "$schema": "./agentaz.schema.json", "version": "2.0.0", "last reviewed": "2026-06-24", "agent id": "flaky-test-triage-agent", "trust level": "A2", "dna pattern": "Evaluation", "worst case action": "Mislabels a failure as flaky for engineer review. Cannot disable, quarantine, or skip tests.", "authority boundary": "Identifies likely-flaky tests and recommends action; disable/quarantine tools absent.", "tags": "qa-testing", "flaky-tests", "read-only", "human-review" , "tool boundary": { "allowed tools": "read test history", "detect flakiness", "gather evidence", "recommend" , "execution tools absent": true }, "output boundary": { "format": "structured json", "never emits": "disable test", "quarantine", "skip", "rerun" }, "cost boundary": { "max usd per trace loop": 0.2, "alert threshold usd": 0.14 }, "loop boundary": { "max reasoning turns": 8 }, "human handoff": { "triggers": "ambiguous case", "suspected real failure" , "destination": "engineer" }, "audit": { "append only": true, "logs": "evidence", "recommendations" } } New to this? Read the AgentAz specification guide /agentaz-specifications — Trust Levels, DNA patterns, and how it complements your runtime. AgentAz™ is open source under Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 — schema frozen v1.0.0 and source on GitHub https://github.com/agent-kits/agentaz . Governance matrix A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality. | Agent goal | Bounded by the authority spec above | |---|---| | Trust Level | A2 — Recommend | | Tool access | Least privilege — execution tools absent read-only | | Context handling | Grounded in provided inputs; cites or flags rather than guessing | | Memory strategy | Task-scoped; no persistent cross-session memory | | Human approval | Required on ambiguous case, suspected real failure → engineer | | Audit trail | Append-only log evidence, recommendations | | Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns | | Recovery / escalation | Escalates to engineer | Agent component mapping A framework-neutral view of how this blueprint maps to standard agent-architecture components the vocabulary common to ADK-style frameworks . It describes structure for clarity — not an official integration or certified compatibility. | Agent | Primary reasoner — Recommend authority A2 | |---|---| | Tools | read test history, detect flakiness, gather evidence, recommend — execution tools absent read-only | | Memory | Task-scoped working context; no persistent cross-session memory | | Guardrails | Worst-case classified A2 ; no execution tools; ≤ $0.2/loop · ≤ 8 turns | | Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned | | Handoff | Escalates to engineer on ambiguous case, suspected real failure | Failure modes Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly. Labels a real failure as flaky, hiding a genuine bug. - Detection - Ambiguous cases are flagged and evidence is required per verdict. - Mitigation - It recommends only — it cannot disable, quarantine, or skip tests. - Recovery - An engineer reviews; suspected real failures are surfaced. Labels a flaky test as a real failure, wasting triage time. - Detection - Flakiness signals such as intermittency and environment-dependence are weighed with confidence. - Mitigation - Low-confidence verdicts are flagged, not asserted. - Recovery - The engineer corrects the label. Acts on insufficient history — too few runs to judge. - Detection - A minimum-sample check runs before a verdict. - Mitigation - Thin history yields 'insufficient data', not a guess. - Recovery - More runs are gathered before judging. Evaluation Correctly separating flaky from real failures is primary — labeling a real failure as flaky hides a bug. | Classification accuracy | Share of failures correctly labeled flaky versus real. | |---|---| | Real-failure recall | Of genuine failures, the share NOT mislabeled as flaky — the critical side. | | Flaky precision | Of failures labeled flaky, the share that truly are. | | Insufficient-data rate | Share of thin-history cases correctly marked 'insufficient data' rather than guessed. | | Latency | Time to a verdict. | Recommended approach. Use a labeled history of test runs with known flaky-versus-real outcomes; measure real-failure recall as the critical metric and flaky precision as the noise side. Include thin-history cases to test the abstain behavior. It never disables or quarantines tests. When to use Use it when - Flaky tests are eroding trust in your CI and engineers are re-running builds or ignoring reds. - You have CI history and the ability to re-run individual tests in isolation. - You want consistent, evidence-based flaky-vs-real classification instead of gut calls. - You want proven flakes quarantined with an issue to unblock merges, while real regressions still reach the author. Avoid it when - You can't re-run individual tests or lack failure history — classification would be guessing. - You want it to delete tests or permanently silence failures; it only quarantines reversibly, with a tracking issue. - Critical-path suites payments/auth where you're unwilling to keep human review on quarantines. - Your 'flaky' tests are actually catching real intermittent bugs you haven't investigated. System prompt You are a Flaky Test Triage Agent for a CI pipeline. For ONE failing test, you determine whether it is FLAKY the test is unreliable or a REAL failure the code is broken , and act accordingly. You are judged on correctly unblocking pipelines AND on never masking a real regression by mislabeling it flaky. == CORE PRINCIPLES == 1. Evidence before verdict. Do not call a test flaky from a single failure. Re-run it in isolation, check its failure history, and look for flakiness signatures timing/race, test-ordering dependence, shared state, environment/network . Cite the evidence. 2. Bias toward protecting signal. A flaky test is an annoyance; a masked regression is a shipped bug. When evidence is mixed, treat the failure as potentially REAL and route it to a human, not to quarantine. 3. Quarantine is reversible and tracked. You never delete a test. Quarantine means skip-with-a-tracking-issue so it gets fixed, not forgotten. == HARD RULES NON-NEGOTIABLE == - CONFIRMATION REQUIRED: Classify FLAKY only with positive evidence — e.g. the test passes on isolated re-run AND history shows intermittent pass/fail without a correlating code change, or a clear flakiness signature. A single pass-on-rerun is not enough on its own. - NEVER MASK A REGRESSION: If the failure reproduces consistently on re-run, or correlates with a recent change to the code under test, it is REAL — do not quarantine; route to the author. - CRITICAL PATHS ESCALATE: For tests covering security, auth, payments, or data integrity, do not auto-quarantine even if it looks flaky — escalate to the owner with your evidence. - NO DELETION / BOUNDED COST: Never delete or rewrite tests. Cap the number of re-runs per test; if still ambiguous after the cap, escalate. - TRACK EVERYTHING: Every quarantine creates a tracking issue with the evidence and an owner. == DECISION POLICY calibrated confidence 0.0-1.0 == - QUARANTINE FLAKY: positive flakiness evidence, not a critical path, confidence = 0.85. Skip-with-issue to unblock; assign an owner to fix. - REAL FAILURE: reproduces on re-run or correlates with a recent code change. Route to the author; keep the build red. - ESCALATE: ambiguous after the re-run budget, critical-path test, or conflicting evidence. Hand to the owner with findings. == COST CONTROL == Re-run only the failing test isolated , not the whole suite, up to the configured cap. Reuse history already pulled. Stop once you can decide. == OUTPUT FORMAT return ONE JSON object == { "test": "<test id/name ", "verdict": "flaky|real|ambiguous", "confidence": <0.0-1.0 , "evidence": "<isolated re-run results, history, flakiness signature, change correlation " , "signature": "<timing|ordering|shared state|environment|none ", "critical path": <bool , "decision": "QUARANTINE FLAKY|REAL FAILURE|ESCALATE", "actions": { "tool": "<tool ", "args": { ... }, "requires approval": <bool } , "issue": "<tracking issue title + owner if quarantining, else empty ", "author note": "<message to the code author if REAL, else empty ", "escalation": { "needed": <bool , "reason": "<critical path / ambiguity, or empty " } } If verdict is ambiguous or the test is critical-path, do NOT quarantine — ESCALATE or route as REAL. Simulate run Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser. Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run. Setup guide Install and connect CI Install the agent and connect it to your CI and test reporting. pipx install flaky-triage-agent flaky-triage-agent connect --ci github-actions --reports junit flaky-triage-agent doctor verifies isolated re-run capability Configure confirmation and caps How many isolated re-runs, and how much evidence is needed to call flaky. Enforced outside the model. cp .env.example .env ANTHROPIC API KEY=sk-ant-... MAX RERUNS=5 FLAKY NEEDS: { isolated pass: true, history intermittent: true } MODE=advise advise recommend | act auto-quarantine proven flakes Mark critical-path suites These never auto-quarantine, even if they look flaky. .flaky.yml critical paths: "tests/payments/ ", "tests/auth/ ", "tests/data integrity/ " quarantine: skip with issue never delete assign owner from: CODEOWNERS Dry-run on recent failures Replay recent red builds to check verdicts before enabling auto-quarantine. flaky-triage-agent backtest --range 14d --explain reports flaky/real accuracy and any regressions it would have masked must be 0 Wire into the pipeline Trigger on test failures. Start in advise mode; enable auto-quarantine for non-critical proven flakes once backtests are clean. php CI failure hook - POST https://your-host/flaky/triage HMAC promote MODE=act after a clean backtest with zero masked regressions Architecture Tools required Workflow 1. Intake the failure Pull the failing test, logs, and the diff under test from CI. 2. Re-run in isolation Re-run just that test within the cap to see if the failure reproduces deterministically or comes and goes. 3. Check history & change correlation Pull pass/fail history and check whether the failure lines up with a recent change to the covered code. 4. Look for a flakiness signature Analyze for timing/ordering/shared-state/environment causes that explain instability. 5. Decide on evidence Call it flaky only with positive evidence; if it reproduces or correlates with a change, it's real; if unclear or critical-path, escalate. 6. Act through the gate Quarantine proven flakes skip-with-issue to unblock, route real failures to the author, and never quarantine critical-path tests automatically. 7. Track & learn File the tracking issue, update the flake registry, and feed outcomes back so repeat flakes and misclassifications are caught. Examples Intermittent, timing-dependent test → quarantine as flaky A test fails once, passes on isolated re-run, and history shows it flips pass/fail with no related code change — a classic race. Input Failure: test async notify failed in CI. No changes to the notify module in this PR. Output { "test": "test async notify", "verdict": "flaky", "confidence": 0.9, "evidence": "isolated re-run: passed 5/5", "history: failed 6 of last 80 runs across unrelated PRs", "logs show it asserts before an async callback completes race ", "no change to the notify module in this PR" , "signature": "timing", "critical path": false, "decision": "QUARANTINE FLAKY", "actions": { "tool": "quarantine test", "args": { "test": "test async notify", "mode": "skip with issue" }, "requires approval": false }, { "tool": "file issue", "args": { "title": "Flaky: test async notify async race ", "owner": "@notify-team" }, "requires approval": false } , "issue": "Flaky: test async notify async race — assigned @notify-team", "author note": "", "escalation": { "needed": false, "reason": "" } } Note: Strong flaky evidence: passes 5/5 in isolation, intermittent history across unrelated PRs, a clear timing/race signature, and no related code change. It quarantines reversibly skip-with-issue and assigns an owner — unblocking the build without losing the test. Reproduces on re-run + correlates with the change → REAL, don't quarantine The failure looks like it could be flaky, but it reproduces every re-run and the PR changed exactly the code under test. The agent refuses to quarantine. Input Failure: test discount calc failed. This PR modifies pricing/discount.py. Output { "test": "test discount calc", "verdict": "real", "confidence": 0.92, "evidence": "isolated re-run: failed 5/5 deterministic ", "this PR modifies pricing/discount.py — the exact code under test", "history: passed 100% before this PR", "assertion: expected 90.0, got 99.0 discount not applied " , "signature": "none", "critical path": false, "decision": "REAL FAILURE", "actions": , "issue": "", "author note": "test discount calc fails deterministically 5/5 and your PR changes pricing/discount.py — expected 90.0 but got 99.0, so the 10% discount isn't being applied. This looks like a real regression, not a flake. Please check the discount calculation before merging.", "escalation": { "needed": false, "reason": "" } } Note: The core safety behavior: it would be tempting and harmful to quarantine a red test to unblock the PR. But the failure is deterministic and correlates exactly with the changed code, so the agent calls it REAL, keeps the build red, and hands the author a precise diagnosis. Quarantining here would have shipped a pricing bug. Looks flaky but it's a payments test → escalate, never auto-quarantine A test on the payments path shows intermittent behavior. Even though it pattern-matches to flaky, the critical-path rule forces a human. Input Failure: test payment capture retry intermittently fails passed 3/5 on re-run . Covers the payment capture/retry path. Output { "test": "test payment capture retry", "verdict": "ambiguous", "confidence": 0.6, "evidence": "isolated re-run: passed 3/5, failed 2/5 intermittent ", "history: occasional failures", "covers payment capture + retry logic critical path ", "possible real intermittent bug in retry/idempotency, not just test flakiness" , "signature": "timing", "critical path": true, "decision": "ESCALATE", "actions": { "tool": "escalate to owner", "args": { "owner": "@payments", "priority": "high" }, "requires approval": false } , "issue": "", "author note": "", "escalation": { "needed": true, "reason": "Intermittent failure on a payments-path test. It looks flaky, but intermittent failures in capture/retry can indicate a real idempotency/race bug. Not auto-quarantining a critical-path test — escalating to @payments to investigate whether the retry logic is actually correct." } } Note: The defining example: the test pattern-matches to flaky passes 3/5 , but it covers payments — where an 'intermittent' failure might be a real idempotency/race bug that only sometimes triggers. Auto-quarantining it could hide a money-losing defect, so the critical-path rule overrides and it escalates to the owner with that exact concern. Implementation notes - Require positive, multi-signal evidence to call a test flaky isolated pass + intermittent history or a clear signature . A single pass-on-rerun must never be enough — that's how real intermittent bugs get masked. - Always check change-correlation: a failure on a test whose covered code just changed is real until proven otherwise. - Make critical-path suites payments/auth/data escalate-only; the cost of masking a regression there is far higher than a slow build. - Quarantine reversibly skip-with-issue and never delete tests; every quarantine needs a tracking issue and an owner so flakes get fixed, not buried. - Bound re-runs per test for cost, and escalate don't guess once the budget is exhausted on an ambiguous case. - Maintain a flake registry and track 'masked regressions' as the key safety metric in backtests before enabling auto-quarantine. - Reserve the strong model for the flaky-vs-real judgment; a cheaper model can parse logs and pull history. Variations Basic Flake classifier Re-runs the failure, analyzes history and signatures, and reports a flaky/real verdict with evidence and a recommendation for an engineer. No auto-quarantine. Advanced Guarded auto-quarantine Auto-quarantines proven non-critical flakes skip-with-issue + owner to unblock CI, routes real failures to authors, and escalates critical-path and ambiguous cases. Enterprise Org-wide flake management Adds a cross-repo flake registry, trend analytics, CODEOWNERS routing, quarantine SLAs and auto-expiry, and threshold tuning from masked-regression metrics at scale. Download the Agent Blueprint Download Blueprint .zip /downloads/flaky-test-triager.zip Export View the source on GitHub https://github.com/agent-kits/agentaz/tree/main/kits/flaky-test-triager This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 code & schema and CC‑BY‑4.0 text . Frequently asked questions No — that's its primary safety constraint. It only quarantines with positive flakiness evidence, treats any failure that reproduces or correlates with a code change as real, and escalates never quarantines tests on critical paths like payments and auth. It re-runs the failing test in isolation, checks its pass/fail history, looks for flakiness signatures timing, ordering, shared state, environment , and correlates against recent code changes. Flaky needs positive evidence; reproducible or change-correlated failures are real. Never. Quarantine means a reversible skip-with-tracking-issue assigned to an owner, so the test is unblocked from CI but still gets fixed rather than forgotten. Tests covering payments, auth, or data integrity are escalate-only — it won't auto-quarantine them even if they look flaky, because an 'intermittent' failure there can be a real bug. It re-runs only the single failing test in isolation, up to a configured cap, rather than the whole suite — and if a case is still ambiguous after the cap, it escalates instead of burning more runs. Start in advise mode where it only recommends, backtest on recent failures to confirm it would mask zero regressions, then enable auto-quarantine for non-critical proven flakes.