Overview #
Score β correlate β tune β suppress (carefully): turns a noisy alert stream into a quieter, still-trustworthy one.
Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone.
Recommends concrete tuning (thresholds, grouping, dedup) and only time-box-suppresses alerts proven to be noise.
Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited.
AgentAzβ’ specification #
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do β and why β and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json
), validated against the open AgentAzβ’ JSON Schema β bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "alert-noise-reducer-agent",
"trust_level": "A2",
"dna_pattern": "Evaluation",
"worst_case_action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.",
"authority_boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.",
"tags": [
"devops-sre",
"alerting",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_alerts",
"cluster",
"dedupe",
"recommend_rule"
],
"execution_tools_absent": true,
"never_suppress_critical": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"silence_alert",
"auto_suppress",
"close_alert"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.2,
"alert_threshold_usd": 0.14
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"uncertain_grouping",
"critical_severity"
],
"destination": "oncall_engineer"
},
"audit": {
"append_only": true,
"logs": [
"groupings",
"recommendations"
]
}
}
New to this? Read the AgentAz specification guide β Trust Levels, DNA patterns, and how it complements your runtime.
AgentAzβ’ is open source under Apache-2.0 β schema (frozen v1.0.0) and source on GitHub.
Governance matrix #
A scannable summary of this blueprint's governance coverage, derived from its AgentAzβ’ specification. It documents the boundaries that already ship β not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 β Recommend |
| Tool access | Least privilege β execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on uncertain grouping, critical severity β oncall engineer |
| Audit trail | Append-only log (groupings, recommendations) |
| Cost & loop bounds | β€ $0.2 per loop Β· β€ 8 reasoning turns |
| Recovery / escalation | Escalates to oncall engineer |
Agent component mapping #
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity β not an official integration or certified compatibility.
| Agent | Primary reasoner β Recommend authority (A2) |
|---|---|
| Tools | read alerts, cluster, dedupe, recommend rule β execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; β€ $0.2/loop Β· β€ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to oncall engineer on uncertain grouping, critical severity |
Failure modes #
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each β the boundaries that make it safe to run, stated plainly.
Recommends suppressing an alert that actually mattered.
- Detection
- Critical severity is excluded from suppression and grouping confidence is scored.
- Mitigation
- Suppression is a recommendation requiring approval; criticals are never suppressed.
- Recovery
- An engineer rejects the rule and the alert remains.
Over-groups distinct alerts, masking a second incident.
- Detection
- A grouping similarity threshold runs and divergent signals are flagged.
- Mitigation
- Uncertain groupings are flagged, not merged silently.
- Recovery
- The engineer splits the group.
A suppression rule persists after the underlying issue changes.
- Detection
- Rules are time-bounded and reviewed.
- Mitigation
- Rules expire and require re-approval.
- Recovery
- Stale rules lapse and the alert resurfaces.
Evaluation #
Suppression precision with critical-alert safety is primary β suppressing an alert that mattered is the failure.
| Suppression precision | Of alerts recommended for suppression, the share that were genuinely noise. |
|---|---|
| Critical-miss rate | Frequency of critical alerts caught in a suppression recommendation β must be zero. |
| Grouping accuracy | Share of alert groupings that are correct, with no masked second incident. |
| Rule-decay handling | Share of stale suppression rules correctly expired. |
| Latency | Time to a grouping or recommendation. |
Recommended approach. Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire.
When to use #
Use it when
- On-call is drowning in alerts and real signals are getting lost in the noise.
- You have alert history (fire/ack/incident-correlation) the agent can score actionability from.
- You want data-backed tuning recommendations and safe, reversible suppression of proven noise.
- You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced.
Avoid it when
- You lack alert/incident history, so actionability can't be measured β suppression would be blind.
- You expect it to autonomously silence critical-service alerts; those are recommendation-only.
- Your 'noisy' alerts are actually under-investigated real signals.
- You can't keep suppression reversible, time-boxed, and audited.
System prompt #
You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise β WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters.
== CORE PRINCIPLES ==
1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and β most importantly β whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise.
2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence.
3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule.
== HARD RULES (NON-NEGOTIABLE) ==
- INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident (even once), you must not suppress it. At most, recommend tuning (threshold/grouping). This rule is absolute.
- CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress β you recommend tuning and escalate the decision to a human.
- EVIDENCE REQUIRED: Auto-suppress only with a clear record (e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations) on a non-critical signal. State the numbers.
- BOUNDED SUPPRESSION: Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged. Never an open-ended silence.
- NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem.
== METHOD ==
- Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window.
- Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates.
- For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is.
== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence >= 0.85. Time-boxed + tracked.
- RECOMMEND_TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress.
- ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about.
== COST CONTROL ==
Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"alert": "<alert name/id or group>",
"actionability": "<score + the numbers: fires, ack rate, incident correlations over window>",
"incident_linked": <bool>,
"critical_service": <bool>,
"decision": "AUTO_SUPPRESS|RECOMMEND_TUNING|ESCALATE",
"suppression": { "applied": <bool>, "duration": "<time-box, or empty>", "scope": "<specific alert/condition>", "reversible": true },
"tuning": ["<concrete recommendation: threshold/grouping/dedup>"],
"rationale": "<evidence-grounded reason>",
"escalation": { "needed": <bool>, "reason": "<critical/uncertain, or empty>" }
}
If incident_linked is true or critical_service is true, decision must NOT be AUTO_SUPPRESS.
Simulate run #
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute β no API calls, nothing leaves your browser.
Frontend preview only β no data leaves your browser. Tip: press β/Ctrl + Enter to run.
Setup guide #
Install and connect alerting
Install the agent and connect it (read) to your alerting and incident systems.
pipx install alert-noise-agent
alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty
alert-noise-agent doctor
Configure guardrails
The incident-linked and critical-service protections are enforced here, not by the model.
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
NEVER_SUPPRESS_IF_INCIDENT_LINKED=true
MAX_SUPPRESSION=24h # time-box; auto-expires
MODE=advise # advise (recommend) | act (auto-suppress proven noise)
Mark critical services
Alerts on these are recommendation-only β never auto-suppressed.
critical_services: ["checkout", "auth", "payments", "db-primary"]
noise_threshold: { window: 30d, min_fires: 50, max_ack_rate: 0.02, incident_correlations: 0 }
suppression: { reversible: true, max_duration: 24h }
Backtest on alert history
Replay history to confirm it would never have suppressed an incident-linked alert.
alert-noise-agent backtest --range 90d --explain
Wire in (advise first)
Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean.
Architecture #
Tools required #
Workflow #
- Ingest the stream
Pull the alert inventory and metadata over the analysis window.
- Score actionability
For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations.
- Correlate & dedupe
Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible.
- Identify noise candidates
Flag chronically non-actionable, never-incident-linked, non-critical alerts β and nothing else.
- Apply the gate
Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead.
- Tune or suppress
Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow.
- Audit & review
Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted.
Examples #
Chronic non-actionable alert β time-boxed auto-suppress + tuning PR
An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet.
Input
Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker (non-critical).
Output
{
"alert": "batch-worker-cpu-high",
"actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations",
"incident_linked": false,
"critical_service": false,
"decision": "AUTO_SUPPRESS",
"suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true },
"tuning": ["Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert"],
"rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service β textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.",
"escalation": { "needed": false, "reason": "" }
}
Note: Clear, quantified non-actionability (312 fires, 0.6% acks, 0 incidents) on a non-critical service. It applies a bounded 24h suppression (auto-expiring, reversible) and β crucially β also opens a tuning PR so the real fix lands, rather than just muting forever.
Noisy but incident-linked once β recommend tuning, do NOT suppress
An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead.
Input
Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api (customer-facing).
Output
{
"alert": "api-latency-p99-high",
"actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation",
"incident_linked": true,
"critical_service": true,
"decision": "RECOMMEND_TUNING",
"suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true },
"tuning": ["Add a sustained-duration condition (p99 high for >5m) to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages"],
"rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service β so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.",
"escalation": { "needed": false, "reason": "" }
}
Note: The defining rule in action: the alert is noisy enough that muting it would be tempting β but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents.
Critical-service alert, noisy β escalate, never auto-act
A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides.
Input
Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments (critical).
Output
{
"alert": "payments-error-rate",
"actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far",
"incident_linked": false,
"critical_service": true,
"decision": "ESCALATE",
"suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true },
"tuning": ["Consider grouping by error type", "Possibly raise threshold β but only with payments-team sign-off given the criticality"],
"rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.",
"escalation": { "needed": true, "reason": "Critical-service (payments) alert β tuning/suppression decisions require human sign-off regardless of current noise level." }
}
Note: Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise.
Implementation notes #
- Make 'never suppress an incident-linked alert' an absolute, deterministic gate β a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy.
- Score by actionability (acks, time-to-ack, incident correlation), not raw volume; a frequent-but-always-acted-on alert is signal.
- Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans.
- Make every suppression time-boxed, scoped, reversible, and audited β never an open-ended silence β and pair it with a tuning PR so the root cause gets fixed.
- Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem.
- Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression.
- The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history.
Variations #
Basic
Noise analyzer
Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression.
Advanced
Guarded auto-suppression
Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced.
Enterprise
Org-wide alert hygiene
Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes β incident-linked alerts always protected.
Download the Agent Blueprint
Export
This blueprint and the AgentAzβ’ specification live in the central AgentKits registry β open source under Apache-2.0 (code & schema) and CCβBYβ4.0 (text).
Frequently asked questions #
No β that's the hard guarantee. Any alert that has ever correlated with a real incident is permanently ineligible for suppression, and critical-service alerts are recommendation-only. It can only auto-suppress proven, never-incident-linked, non-critical noise.
By actionability, not volume: fire count, ack rate, time-to-ack, and incident correlation over a window. A high-volume alert that's consistently acted on is treated as signal, not noise.
No. Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged β and it's paired with a tuning recommendation/PR so the underlying noise actually gets fixed.
It never auto-suppresses or auto-tunes them. It surfaces recommendations and escalates the decision to on-call, because missing a real issue on a critical service is far costlier than the noise.
It groups related/duplicate alerts while preserving visibility into the underlying ones, so a single root cause stops paging ten times without collapsing genuinely distinct signals.
Backtest it on your alert history; the key check is that it would have suppressed zero incident-linked alerts. Start in advise mode (recommendations + tuning PRs) and enable auto-suppression for non-critical proven noise only once that holds.