Alert Noise Reduction Agent

A new open-source agent specification called AgentAz aims to reduce alert noise by ranking alerts based on actionability and recommending suppression rules, while never suppressing alerts linked to real incidents or critical/SEV1 alerts. The specification, released under Apache-2.0, defines trust levels, tool boundaries, and human handoff triggers to ensure safe, auditable operations.

Overview Score → correlate → tune → suppress carefully : turns a noisy alert stream into a quieter, still-trustworthy one. Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone. Recommends concrete tuning thresholds, grouping, dedup and only time-box-suppresses alerts proven to be noise. Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited. AgentAz™ specification A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime. Machine-readable contract agentaz.json , validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL: { "$schema": "./agentaz.schema.json", "version": "2.0.0", "last reviewed": "2026-06-24", "agent id": "alert-noise-reducer-agent", "trust level": "A2", "dna pattern": "Evaluation", "worst case action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.", "authority boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.", "tags": "devops-sre", "alerting", "read-only", "human-review" , "tool boundary": { "allowed tools": "read alerts", "cluster", "dedupe", "recommend rule" , "execution tools absent": true, "never suppress critical": true }, "output boundary": { "format": "structured json", "never emits": "silence alert", "auto suppress", "close alert" }, "cost boundary": { "max usd per trace loop": 0.2, "alert threshold usd": 0.14 }, "loop boundary": { "max reasoning turns": 8 }, "human handoff": { "triggers": "uncertain grouping", "critical severity" , "destination": "oncall engineer" }, "audit": { "append only": true, "logs": "groupings", "recommendations" } } New to this? Read the AgentAz specification guide /agentaz-specifications — Trust Levels, DNA patterns, and how it complements your runtime. AgentAz™ is open source under Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 — schema frozen v1.0.0 and source on GitHub https://github.com/agent-kits/agentaz . Governance matrix A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality. | Agent goal | Bounded by the authority spec above | |---|---| | Trust Level | A2 — Recommend | | Tool access | Least privilege — execution tools absent read-only | | Context handling | Grounded in provided inputs; cites or flags rather than guessing | | Memory strategy | Task-scoped; no persistent cross-session memory | | Human approval | Required on uncertain grouping, critical severity → oncall engineer | | Audit trail | Append-only log groupings, recommendations | | Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns | | Recovery / escalation | Escalates to oncall engineer | Agent component mapping A framework-neutral view of how this blueprint maps to standard agent-architecture components the vocabulary common to ADK-style frameworks . It describes structure for clarity — not an official integration or certified compatibility. | Agent | Primary reasoner — Recommend authority A2 | |---|---| | Tools | read alerts, cluster, dedupe, recommend rule — execution tools absent read-only | | Memory | Task-scoped working context; no persistent cross-session memory | | Guardrails | Worst-case classified A2 ; no execution tools; ≤ $0.2/loop · ≤ 8 turns | | Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned | | Handoff | Escalates to oncall engineer on uncertain grouping, critical severity | Failure modes Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly. Recommends suppressing an alert that actually mattered. - Detection - Critical severity is excluded from suppression and grouping confidence is scored. - Mitigation - Suppression is a recommendation requiring approval; criticals are never suppressed. - Recovery - An engineer rejects the rule and the alert remains. Over-groups distinct alerts, masking a second incident. - Detection - A grouping similarity threshold runs and divergent signals are flagged. - Mitigation - Uncertain groupings are flagged, not merged silently. - Recovery - The engineer splits the group. A suppression rule persists after the underlying issue changes. - Detection - Rules are time-bounded and reviewed. - Mitigation - Rules expire and require re-approval. - Recovery - Stale rules lapse and the alert resurfaces. Evaluation Suppression precision with critical-alert safety is primary — suppressing an alert that mattered is the failure. | Suppression precision | Of alerts recommended for suppression, the share that were genuinely noise. | |---|---| | Critical-miss rate | Frequency of critical alerts caught in a suppression recommendation — must be zero. | | Grouping accuracy | Share of alert groupings that are correct, with no masked second incident. | | Rule-decay handling | Share of stale suppression rules correctly expired. | | Latency | Time to a grouping or recommendation. | Recommended approach. Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire. When to use Use it when - On-call is drowning in alerts and real signals are getting lost in the noise. - You have alert history fire/ack/incident-correlation the agent can score actionability from. - You want data-backed tuning recommendations and safe, reversible suppression of proven noise. - You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced. Avoid it when - You lack alert/incident history, so actionability can't be measured — suppression would be blind. - You expect it to autonomously silence critical-service alerts; those are recommendation-only. - Your 'noisy' alerts are actually under-investigated real signals. - You can't keep suppression reversible, time-boxed, and audited. System prompt You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise — WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters. == CORE PRINCIPLES == 1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and — most importantly — whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise. 2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence. 3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule. == HARD RULES NON-NEGOTIABLE == - INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident even once , you must not suppress it. At most, recommend tuning threshold/grouping . This rule is absolute. - CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress — you recommend tuning and escalate the decision to a human. - EVIDENCE REQUIRED: Auto-suppress only with a clear record e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations on a non-critical signal. State the numbers. - BOUNDED SUPPRESSION: Every suppression is time-boxed auto-expires , scoped to the specific alert, reversible, and logged. Never an open-ended silence. - NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem. == METHOD == - Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window. - Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates. - For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is. == DECISION POLICY calibrated confidence 0.0-1.0 == - AUTO SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence = 0.85. Time-boxed + tracked. - RECOMMEND TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress. - ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about. == COST CONTROL == Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have. == OUTPUT FORMAT return ONE JSON object == { "alert": "<alert name/id or group ", "actionability": "<score + the numbers: fires, ack rate, incident correlations over window ", "incident linked": <bool , "critical service": <bool , "decision": "AUTO SUPPRESS|RECOMMEND TUNING|ESCALATE", "suppression": { "applied": <bool , "duration": "<time-box, or empty ", "scope": "<specific alert/condition ", "reversible": true }, "tuning": "<concrete recommendation: threshold/grouping/dedup " , "rationale": "<evidence-grounded reason ", "escalation": { "needed": <bool , "reason": "<critical/uncertain, or empty " } } If incident linked is true or critical service is true, decision must NOT be AUTO SUPPRESS. Simulate run Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser. Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run. Setup guide Install and connect alerting Install the agent and connect it read to your alerting and incident systems. pipx install alert-noise-agent alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty alert-noise-agent doctor Configure guardrails The incident-linked and critical-service protections are enforced here, not by the model. cp .env.example .env ANTHROPIC API KEY=sk-ant-... NEVER SUPPRESS IF INCIDENT LINKED=true MAX SUPPRESSION=24h time-box; auto-expires MODE=advise advise recommend | act auto-suppress proven noise Mark critical services Alerts on these are recommendation-only — never auto-suppressed. .alerts.yml critical services: "checkout", "auth", "payments", "db-primary" noise threshold: { window: 30d, min fires: 50, max ack rate: 0.02, incident correlations: 0 } suppression: { reversible: true, max duration: 24h } Backtest on alert history Replay history to confirm it would never have suppressed an incident-linked alert. alert-noise-agent backtest --range 90d --explain reports noise found + a hard check: suppressed-incident-linked count must be 0 Wire in advise first Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean. php scheduled job - recommendations to sre; tuning PRs to the monitoring repo promote MODE=act after a clean backtest Architecture Tools required Workflow 1. Ingest the stream Pull the alert inventory and metadata over the analysis window. 2. Score actionability For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations. 3. Correlate & dedupe Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible. 4. Identify noise candidates Flag chronically non-actionable, never-incident-linked, non-critical alerts — and nothing else. 5. Apply the gate Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead. 6. Tune or suppress Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow. 7. Audit & review Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted. Examples Chronic non-actionable alert → time-boxed auto-suppress + tuning PR An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet. Input Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker non-critical . Output { "alert": "batch-worker-cpu-high", "actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations", "incident linked": false, "critical service": false, "decision": "AUTO SUPPRESS", "suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true }, "tuning": "Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert" , "rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service — textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.", "escalation": { "needed": false, "reason": "" } } Note: Clear, quantified non-actionability 312 fires, 0.6% acks, 0 incidents on a non-critical service. It applies a bounded 24h suppression auto-expiring, reversible and — crucially — also opens a tuning PR so the real fix lands, rather than just muting forever. Noisy but incident-linked once → recommend tuning, do NOT suppress An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead. Input Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api customer-facing . Output { "alert": "api-latency-p99-high", "actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation", "incident linked": true, "critical service": true, "decision": "RECOMMEND TUNING", "suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true }, "tuning": "Add a sustained-duration condition p99 high for 5m to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages" , "rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service — so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.", "escalation": { "needed": false, "reason": "" } } Note: The defining rule in action: the alert is noisy enough that muting it would be tempting — but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents. Critical-service alert, noisy → escalate, never auto-act A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides. Input Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments critical . Output { "alert": "payments-error-rate", "actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far", "incident linked": false, "critical service": true, "decision": "ESCALATE", "suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true }, "tuning": "Consider grouping by error type", "Possibly raise threshold — but only with payments-team sign-off given the criticality" , "rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.", "escalation": { "needed": true, "reason": "Critical-service payments alert — tuning/suppression decisions require human sign-off regardless of current noise level." } } Note: Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise. Implementation notes - Make 'never suppress an incident-linked alert' an absolute, deterministic gate — a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy. - Score by actionability acks, time-to-ack, incident correlation , not raw volume; a frequent-but-always-acted-on alert is signal. - Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans. - Make every suppression time-boxed, scoped, reversible, and audited — never an open-ended silence — and pair it with a tuning PR so the root cause gets fixed. - Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem. - Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression. - The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history. Variations Basic Noise analyzer Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression. Advanced Guarded auto-suppression Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced. Enterprise Org-wide alert hygiene Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes — incident-linked alerts always protected. Download the Agent Blueprint Download Blueprint .zip /downloads/alert-noise-reducer.zip Export View the source on GitHub https://github.com/agent-kits/agentaz/tree/main/kits/alert-noise-reducer This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 code & schema and CC‑BY‑4.0 text . Frequently asked questions No — that's the hard guarantee. Any alert that has ever correlated with a real incident is permanently ineligible for suppression, and critical-service alerts are recommendation-only. It can only auto-suppress proven, never-incident-linked, non-critical noise. By actionability, not volume: fire count, ack rate, time-to-ack, and incident correlation over a window. A high-volume alert that's consistently acted on is treated as signal, not noise. No. Every suppression is time-boxed auto-expires , scoped to the specific alert, reversible, and logged — and it's paired with a tuning recommendation/PR so the underlying noise actually gets fixed. It never auto-suppresses or auto-tunes them. It surfaces recommendations and escalates the decision to on-call, because missing a real issue on a critical service is far costlier than the noise. It groups related/duplicate alerts while preserving visibility into the underlying ones, so a single root cause stops paging ten times without collapsing genuinely distinct signals. Backtest it on your alert history; the key check is that it would have suppressed zero incident-linked alerts. Start in advise mode recommendations + tuning PRs and enable auto-suppression for non-critical proven noise only once that holds.