Grafana Alerting Checklist: Wiring AI Anomaly Scores Correctly

wpnews.pro

Originally published on kuryzhev.cloud

Three days. That's how long a memory leak crept through one of our checkout pods before anyone noticed — because our CPU-at-80% alert never fired. The leak was slow, gradual, and stayed well under every static threshold we had. When we finally caught it, it was a customer complaint that tipped us off, not Prometheus. That incident is the reason we built a Grafana AI anomaly detection pipeline, and it's the reason this checklist exists.

Static thresholds work fine for hard failures — disk full, service down, 500s spiking. They fail at catching slow-burn, multivariate drift: memory creeping up 2% a day, disk I/O latency inching from 4ms to 40ms over a week, a subtle shift in request pattern that only looks wrong when you compare it against the last 30 days of baseline. That's exactly the class of problem anomaly-detection models are good at — they learn "normal" for a given service and flag statistical deviation, not just a fixed number.

But here's the catch nobody tells you upfront: wiring an ML model's output into an alerting system is its own project, separate from building the model. You're not just training Prophet or PyOD on a metric — you're turning a float between 0 and 1 into a reliable page that doesn't wake someone up for nothing. Alert fatigue is real, and a poorly-tuned anomaly pipeline generates more noise than the static rules it replaced. The setup cost is only worth it if you're running a large fleet, dealing with seasonal traffic, or managing multi-tenant systems where "normal" varies by customer.

This checklist assumes you already have a trained model (Prophet, PyOD, or Grafana's own Machine Learning plugin) producing a score. What it covers is everything between "the model outputs a number" and "the right person gets paged with useful context."

We run through this list every time we onboard a new service into the anomaly pipeline. Skipping any of these steps is how you end up with silent gaps or 3am pages for nothing.

scrape_interval

. Mismatched intervals create phantom gaps that Grafana reads as "no data."--storage.tsdb.retention.time

covers it before you even start training.remote_write

or expose them on a scrape endpoint. We use the latter — a small exporter that Prometheus scrapes like any other target.service

, instance

, and job

labels as the underlying metric, or Grafana can't correlate them on one panel./etc/grafana/provisioning/alerting/

— never click-ops this in production. GitOps or nothing.evaluate every

.for:

pending duration.grafana-cli

. Silent misconfiguration is the most common failure mode we've seen.Here's the actual stack we run for this, minus the model training pipeline. The exporter runs the trained model against Prometheus data and exposes a score:

version: "3.8"
services:
  anomaly-exporter:
    image: myorg/anomaly-exporter:1.4.2   # pinned version, avoid :latest
    ports:
      - "127.0.0.1:9187:9187"             # bind to localhost only, internal scrape
    environment:
      - MODEL_PATH=/models/prophet_v3.pkl
      - SCORE_INTERVAL_SECONDS=300        # batch scoring every 5m, cost control
      - PROMETHEUS_URL=http://prometheus:9090
    volumes:
      - ./models:/models:ro
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.4.2
    ports:
      - "3000:3000"
    volumes:
      - ./provisioning:/etc/grafana/provisioning   # alert rules + datasources as code
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_UNIFIED_ALERTING_ENABLED=true

volumes:
  prom_data:
  grafana_data:

And the alert rule that consumes the score, provisioned as code rather than clicked into existence in the UI:

apiVersion: 1
groups:
  - orgId: 1
    name: infra-anomaly-group
    folder: AI-Anomaly-Detection
    interval: 5m                     # must match SCORE_INTERVAL_SECONDS above
    rules:
      - uid: anomaly-checkout-cpu
        title: "Anomaly Score - Checkout Service CPU"
        condition: C
        data:
          - refId: A
            queryType: ""
            datasourceUid: prometheus-uid
            model:
              expr: anomaly_score{service="checkout", metric="cpu"}
          - refId: C
            queryType: ""
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [0.8]      # normalized z-score threshold, not raw model output
        for: 5m                       # pending period to avoid single-point noise
        labels:
          severity: warning
        annotations:
          summary: "Anomaly detected in checkout service CPU pattern"
          runbook_url: "https://kuryzhev.cloud/runbooks/anomaly-checkout"

Full syntax for unified alerting provisioning is documented in the Grafana alerting provisioning docs — worth bookmarking, since the schema shifts slightly between minor versions.

Even teams that follow the checklist above hit the same handful of gotchas. Here are the ones that bit us hardest.

Watch out for timestamp mismatches. If your anomaly exporter timestamps its output differently than Prometheus scrapes the raw metric — even by a few seconds consistently — Grafana will show gaps that look like "no data" rather than a healthy series. We lost half a day debugging what looked like a broken pipeline; it was just clock drift between the exporter container and the Prometheus host.

Version-pin everything, always. We once let the anomaly exporter float on :latest

, and a routine image rebuild silently changed the model's normalization window. Alert behavior shifted overnight with zero deploy log entry to explain it. Pin the image tag, pin the model file name, and log the model version in the alert annotation so on-call knows exactly what's scoring.

Give new hosts a cold-start grace period. A pod that just spun up has no baseline history — every metric looks "anomalous" simply because there's nothing to compare against. We now suppress anomaly alerts for the first two hours after a pod's first-seen timestamp. Skip this and you'll get paged every time autoscaling adds a node.

Set up dedup and grouping in your notification policy. One root-cause anomaly — say, a bad deploy — often triggers correlated score spikes across CPU, memory, and latency simultaneously. Without grouping by service

in your notification policy, that's four pages instead of one. I stopped trusting default notification policies after our on-call got seven pings in ninety seconds for a single deploy rollback.

One more that's easy to overlook: securing the exporter endpoint. Anomaly score exporters running on ports like :9187

should be internal-only and scraped with a token or mTLS. A raw confidence score leaking to the outside world tells an attacker more about your infra's normal behavior than you'd want them to have.

Manually maintaining an anomaly pipeline doesn't scale past a handful of services. Here's what we automated once the pattern proved itself.

We run a nightly CI job (GitHub Actions) that retrains the model against a rolling 90-day window and validates it against a holdout dataset before promoting new normalization thresholds. If the false-positive rate on the holdout exceeds roughly 5%, the pipeline fails the build instead of shipping a worse model — this single gate saved us from at least two bad retrains that would have flooded on-call.

Alert rules and contact points live in Terraform, synced to the anomaly exporter's service config in the same repo. That way a new microservice onboarding into the anomaly pipeline is a single PR: add the exporter config, add the Grafana rule block, merge, done. No manual clicking through the Grafana UI at 2am to add a rule for a service that just went live.

We also template runbook links directly into alert annotations — the PagerDuty notification includes which model version fired, when it was last retrained, and which features drove the score. It's a small thing, but it cuts investigation time significantly when someone unfamiliar with the model is the one paged.

Last piece: a feedback loop. When someone marks an anomaly alert as a false positive in Grafana, a small script logs that window into a training-exclusion dataset consumed by the next retrain cycle. Without this, the same seasonal blip (Black Friday traffic, month-end batch jobs) keeps triggering the same false alarm every cycle. We wrote more about building these kinds of self-correcting alerting loops in our DevOps_DayS notes, if you want the broader pattern applied outside of anomaly detection specifically.

None of this replaces good judgment — a Grafana AI anomaly detection setup is still a tool, not an oracle. But wired correctly, with the checklist above, it catches the slow-burn failures that static thresholds are structurally blind to, and it does so without turning your on-call rotation into a noise-cancellation exercise.

source & further reading

dev.to — original article The Soul Question: Can a Language Model Have Psyche? DPO vs RLHF: The Alignment Tax You Pay Without Knowing Crack AI Testing Interview in 7 Days

Grafana Alerting Checklist: Wiring AI Anomaly Scores Correctly

Run your AI side-project on zahid.host