# Alert Noise Reduction Agent

> Source: <https://www.agent-kits.com/kit/alert-noise-reducer>
> Published: 2026-06-21 00:00:00+00:00

## Overview

Score → correlate → tune → suppress (carefully): turns a noisy alert stream into a quieter, still-trustworthy one.

Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone.

Recommends concrete tuning (thresholds, grouping, dedup) and only time-box-suppresses alerts proven to be noise.

Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited.

## AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Machine-readable contract (`agentaz.json`

), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

```
{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "alert-noise-reducer-agent",
  "trust_level": "A2",
  "dna_pattern": "Evaluation",
  "worst_case_action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.",
  "authority_boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.",
  "tags": [
    "devops-sre",
    "alerting",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_alerts",
      "cluster",
      "dedupe",
      "recommend_rule"
    ],
    "execution_tools_absent": true,
    "never_suppress_critical": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "silence_alert",
      "auto_suppress",
      "close_alert"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.2,
    "alert_threshold_usd": 0.14
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "uncertain_grouping",
      "critical_severity"
    ],
    "destination": "oncall_engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "groupings",
      "recommendations"
    ]
  }
}
```

New to this? Read the [AgentAz specification guide](/agentaz-specifications) — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) — schema (frozen v1.0.0) and source on [GitHub](https://github.com/agent-kits/agentaz).

## Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on uncertain grouping, critical severity → oncall engineer |
| Audit trail | Append-only log (groupings, recommendations) |
| Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to oncall engineer |

## Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read alerts, cluster, dedupe, recommend rule — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to oncall engineer on uncertain grouping, critical severity |

## Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Recommends suppressing an alert that actually mattered.

- Detection
- Critical severity is excluded from suppression and grouping confidence is scored.
- Mitigation
- Suppression is a recommendation requiring approval; criticals are never suppressed.
- Recovery
- An engineer rejects the rule and the alert remains.

Over-groups distinct alerts, masking a second incident.

- Detection
- A grouping similarity threshold runs and divergent signals are flagged.
- Mitigation
- Uncertain groupings are flagged, not merged silently.
- Recovery
- The engineer splits the group.

A suppression rule persists after the underlying issue changes.

- Detection
- Rules are time-bounded and reviewed.
- Mitigation
- Rules expire and require re-approval.
- Recovery
- Stale rules lapse and the alert resurfaces.

## Evaluation

Suppression precision with critical-alert safety is primary — suppressing an alert that mattered is the failure.

| Suppression precision | Of alerts recommended for suppression, the share that were genuinely noise. |
|---|---|
| Critical-miss rate | Frequency of critical alerts caught in a suppression recommendation — must be zero. |
| Grouping accuracy | Share of alert groupings that are correct, with no masked second incident. |
| Rule-decay handling | Share of stale suppression rules correctly expired. |
| Latency | Time to a grouping or recommendation. |

**Recommended approach.** Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire.

## When to use

Use it when

- On-call is drowning in alerts and real signals are getting lost in the noise.
- You have alert history (fire/ack/incident-correlation) the agent can score actionability from.
- You want data-backed tuning recommendations and safe, reversible suppression of proven noise.
- You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced.

Avoid it when

- You lack alert/incident history, so actionability can't be measured — suppression would be blind.
- You expect it to autonomously silence critical-service alerts; those are recommendation-only.
- Your 'noisy' alerts are actually under-investigated real signals.
- You can't keep suppression reversible, time-boxed, and audited.

## System prompt

```
You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise — WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters.

== CORE PRINCIPLES ==
1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and — most importantly — whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise.
2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence.
3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule.

== HARD RULES (NON-NEGOTIABLE) ==
- INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident (even once), you must not suppress it. At most, recommend tuning (threshold/grouping). This rule is absolute.
- CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress — you recommend tuning and escalate the decision to a human.
- EVIDENCE REQUIRED: Auto-suppress only with a clear record (e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations) on a non-critical signal. State the numbers.
- BOUNDED SUPPRESSION: Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged. Never an open-ended silence.
- NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem.

== METHOD ==
- Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window.
- Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates.
- For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is.

== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence >= 0.85. Time-boxed + tracked.
- RECOMMEND_TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress.
- ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about.

== COST CONTROL ==
Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "alert": "<alert name/id or group>",
  "actionability": "<score + the numbers: fires, ack rate, incident correlations over window>",
  "incident_linked": <bool>,
  "critical_service": <bool>,
  "decision": "AUTO_SUPPRESS|RECOMMEND_TUNING|ESCALATE",
  "suppression": { "applied": <bool>, "duration": "<time-box, or empty>", "scope": "<specific alert/condition>", "reversible": true },
  "tuning": ["<concrete recommendation: threshold/grouping/dedup>"],
  "rationale": "<evidence-grounded reason>",
  "escalation": { "needed": <bool>, "reason": "<critical/uncertain, or empty>" }
}
If incident_linked is true or critical_service is true, decision must NOT be AUTO_SUPPRESS.
```

## Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press `⌘/Ctrl` + `Enter` to run.

## Setup guide

Install and connect alerting

Install the agent and connect it (read) to your alerting and incident systems.

```
pipx install alert-noise-agent
alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty
alert-noise-agent doctor
```

Configure guardrails

The incident-linked and critical-service protections are enforced here, not by the model.

```
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
NEVER_SUPPRESS_IF_INCIDENT_LINKED=true
MAX_SUPPRESSION=24h     # time-box; auto-expires
MODE=advise   # advise (recommend) | act (auto-suppress proven noise)
```

Mark critical services

Alerts on these are recommendation-only — never auto-suppressed.

```
# .alerts.yml
critical_services: ["checkout", "auth", "payments", "db-primary"]
noise_threshold: { window: 30d, min_fires: 50, max_ack_rate: 0.02, incident_correlations: 0 }
suppression: { reversible: true, max_duration: 24h }
```

Backtest on alert history

Replay history to confirm it would never have suppressed an incident-linked alert.

```
alert-noise-agent backtest --range 90d --explain
# reports noise found + a hard check: suppressed-incident-linked count (must be 0)
```

Wire in (advise first)

Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean.

``` php
# scheduled job -> recommendations to #sre; tuning PRs to the monitoring repo
# promote MODE=act after a clean backtest
```

## Architecture

## Tools required

## Workflow

1. Ingest the stream

Pull the alert inventory and metadata over the analysis window.

2. Score actionability

For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations.

3. Correlate & dedupe

Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible.

4. Identify noise candidates

Flag chronically non-actionable, never-incident-linked, non-critical alerts — and nothing else.

5. Apply the gate

Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead.

6. Tune or suppress

Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow.

7. Audit & review

Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted.

## Examples

Chronic non-actionable alert → time-boxed auto-suppress + tuning PR

An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet.

Input

```
Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker (non-critical).
```

Output

```
{
  "alert": "batch-worker-cpu-high",
  "actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations",
  "incident_linked": false,
  "critical_service": false,
  "decision": "AUTO_SUPPRESS",
  "suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true },
  "tuning": ["Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert"],
  "rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service — textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.",
  "escalation": { "needed": false, "reason": "" }
}
```

**Note:** Clear, quantified non-actionability (312 fires, 0.6% acks, 0 incidents) on a non-critical service. It applies a bounded 24h suppression (auto-expiring, reversible) and — crucially — also opens a tuning PR so the real fix lands, rather than just muting forever.

Noisy but incident-linked once → recommend tuning, do NOT suppress

An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead.

Input

```
Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api (customer-facing).
```

Output

```
{
  "alert": "api-latency-p99-high",
  "actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation",
  "incident_linked": true,
  "critical_service": true,
  "decision": "RECOMMEND_TUNING",
  "suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true },
  "tuning": ["Add a sustained-duration condition (p99 high for >5m) to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages"],
  "rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service — so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.",
  "escalation": { "needed": false, "reason": "" }
}
```

**Note:** The defining rule in action: the alert is noisy enough that muting it would be tempting — but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents.

Critical-service alert, noisy → escalate, never auto-act

A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides.

Input

```
Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments (critical).
```

Output

```
{
  "alert": "payments-error-rate",
  "actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far",
  "incident_linked": false,
  "critical_service": true,
  "decision": "ESCALATE",
  "suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true },
  "tuning": ["Consider grouping by error type", "Possibly raise threshold — but only with payments-team sign-off given the criticality"],
  "rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.",
  "escalation": { "needed": true, "reason": "Critical-service (payments) alert — tuning/suppression decisions require human sign-off regardless of current noise level." }
}
```

**Note:** Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise.

## Implementation notes

- Make 'never suppress an incident-linked alert' an absolute, deterministic gate — a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy.
- Score by actionability (acks, time-to-ack, incident correlation), not raw volume; a frequent-but-always-acted-on alert is signal.
- Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans.
- Make every suppression time-boxed, scoped, reversible, and audited — never an open-ended silence — and pair it with a tuning PR so the root cause gets fixed.
- Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem.
- Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression.
- The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history.

## Variations

Basic

Noise analyzer

Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression.

Advanced

Guarded auto-suppression

Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced.

Enterprise

Org-wide alert hygiene

Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes — incident-linked alerts always protected.

Download the Agent Blueprint

[Download Blueprint (.zip)](/downloads/alert-noise-reducer.zip)

Export

[View the source on GitHub](https://github.com/agent-kits/agentaz/tree/main/kits/alert-noise-reducer)

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

## Frequently asked questions

No — that's the hard guarantee. Any alert that has ever correlated with a real incident is permanently ineligible for suppression, and critical-service alerts are recommendation-only. It can only auto-suppress proven, never-incident-linked, non-critical noise.

By actionability, not volume: fire count, ack rate, time-to-ack, and incident correlation over a window. A high-volume alert that's consistently acted on is treated as signal, not noise.

No. Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged — and it's paired with a tuning recommendation/PR so the underlying noise actually gets fixed.

It never auto-suppresses or auto-tunes them. It surfaces recommendations and escalates the decision to on-call, because missing a real issue on a critical service is far costlier than the noise.

It groups related/duplicate alerts while preserving visibility into the underlying ones, so a single root cause stops paging ten times without collapsing genuinely distinct signals.

Backtest it on your alert history; the key check is that it would have suppressed zero incident-linked alerts. Start in advise mode (recommendations + tuning PRs) and enable auto-suppression for non-critical proven noise only once that holds.
