How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

A developer built an autonomous incident investigation agent called FRIDAY that reduced mean time to resolution by 65% on a platform serving 30+ million end users. The agent, triggered by PagerDuty alerts, first checks GitHub for recent changes before analyzing observability data, completing investigations in under 2 minutes. The system uses AWS Lambda with asynchronous invocation to handle the 60-180 second investigation time within API Gateway's 30-second timeout.

It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual: This process takes 15–45 minutes for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages. I asked myself: So I built one. It's been running in production for months, investigating real incidents on a platform serving 30+ million end users across multiple AWS regions. We call it FRIDAY . When a PagerDuty alert fires, FRIDAY: The entire investigation takes under 2 minutes . The on-call engineer wakes up to a complete analysis instead of a raw alert. ┌──────────────┐ ┌────────────────┐ ┌─────────────────────┐ │ PagerDuty │────▶│ API Gateway │────▶│ Lambda Sync │ │ Webhook │ │ Validate │ │ Parse + Self-Invoke│ └──────────────┘ └────────────────┘ └─────────┬───────────┘ │ Async ▼ ┌─────────────────────┐ │ Lambda Async │ │ Investigation Agent │ │ │ │ ┌────────────────┐ │ │ │ Amazon Bedrock │ │ │ │ Claude Opus │ │ │ │ Tool-Use Loop │ │ │ └───────┬────────┘ │ │ │ │ │ ┌─────┼─────┐ │ │ ▼ ▼ ▼ │ │ GitHub Datadog S3 │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Microsoft Teams │ │ Adaptive Card │ └─────────────────────┘ API Gateway has a 30-second hard timeout . A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning 200 OK to PagerDuty within 2 seconds. Sync handler: validate, parse, self-invoke, return immediately lambda client.invoke FunctionName=context.function name, InvocationType="Event", Fire and forget Payload=json.dumps { " async investigate": True, "alert payload": alert payload, } , return {"statusCode": 200, "body": "Investigation started"} The async Lambda runs the full investigation without timeout pressure. This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, 80%+ of acute incidents are caused by a preceding change : a deployment, a config update, a replica count change, a memory limit modification. FRIDAY is instructed to check GitHub before touching Datadog: MANDATORY FIRST STEP — GitHub Step 0 : Before touching Datadog, you MUST run these calls in parallel: 1. github search repos — find the repo for the alerted service 2. github list commits — find commits in the 2 hours before the alert fired A deployment or config change is the most likely root cause. Our platform spans multiple AWS regions. A naive agent querying "all 5xx errors" would mix signals from healthy and unhealthy regions, producing confused analysis. FRIDAY's first action is always to lock a target region from the alert metadata: 🌍 Region: Description — resolved from alert hostname Every subsequent Datadog query includes: kube cluster name:region-az- scoped to affected region only This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region. FRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers: EXECUTIVE SUMMARY 2-3 sentences — what happened, who's affected, what changed KEY FINDINGS Bulleted evidence from GitHub + Datadog WHAT CHANGED Specific commit/PR with timestamp and author ERROR BREAKDOWN Service-by-service error counts with affected tenants ROOT CAUSE Confirmed / Suspected / Unknown — with evidence chain CUSTOMER IMPACT Affected tenants, operations, scope RECOMMENDED ACTIONS Specific next steps for the on-call engineer The on-call engineer can glance at the Teams card and immediately know: what happened, who's affected, what likely caused it, and what to do next without reading a wall of text. FRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it reasons about each alert independently, deciding which tools to call based on what it's learned so far. for round num in range MAX TOOL ROUNDS : Max 25 rounds response = bedrock client.converse modelId="anthropic.claude-opus", messages=messages, toolConfig={"tools": TOOL DEFINITIONS}, if stop reason == "tool use": Execute tools, append results, continue reasoning for tool call in content blocks: result = execute tool tool call "name" , tool call "input" tool results.append result messages.append tool results elif stop reason == "end turn": AI has concluded — extract findings return extract final report content blocks | Tool | Purpose | |---|---| github search repos | Find which repo owns a service | github list commits | What changed before the alert | github get file | Read actual deployment configs | github search code | Find all producers/consumers of a queue | datadog log search | Find specific error messages | datadog log aggregate | Count errors by backend/tenant/path | datadog query metrics | Queue depth, CPU, memory, latency | datadog get monitor | Understand what threshold triggered | The AI typically uses 8–15 tool calls per investigation , batching parallel calls when possible to minimize round-trip time. A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a deterministic training mode that pre-builds architectural knowledge: python def train : """ Deterministic training: ~13 targeted API calls, then one Bedrock synthesis call. Collects: cluster-service maps, HAProxy backends, chronic error baselines, recent planned work. """ Phase 1: Targeted data collection no AI — pure API calls collected = {} for key, tool name, tool input in TRAINING CALLS: collected key = execute tool tool name, tool input Phase 2: Single AI synthesis call knowledge doc = synthesize knowledge collected Phase 3: Save to S3 — injected into system prompt save to s3 knowledge doc The knowledge document contains: One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you expect 5xx errors as traffic drains. FRIDAY handles this through: "This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required." What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a graceful degradation mechanism: if rounds remaining <= 3: user content.append { "text": "STOP CALLING TOOLS. Write your FINAL report " "NOW using all data collected so far. Mark " "uncertain findings as 'Suspected' rather " "than skipping them." } This ensures every investigation produces a report — even if incomplete — rather than timing out silently. PagerDuty retries webhooks. FRIDAY handles this at two levels: After running in production for several months: | Metric | Before FRIDAY | After FRIDAY | Improvement | |---|---|---|---| | Mean Time to First Analysis | 15–45 min | 90 sec–3 min | ~90% faster | | MTTR overall | ~60 min | ~15 min | 65% reduction | | AI tool adoption team | 20% | 85% | 4x increase | | Alert noise false escalations | High | Minimal | ~80% reduction | | Auto-generated postmortems | 0% | 100% of P1/P2 | Eliminated manual RCA drafts | The system prompt is the most important file in the codebase. It's not instructions — it's the agent's operating manual . Ours is ~5,000 words covering: Invest in your prompt like you invest in your architecture docs. Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents. FRIDAY is explicitly told it does NOT take remediation actions . It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a design choice that builds trust . When on-call engineers trust the AI's analysis, they act on it faster. The two-Lambda pattern sync for webhook receipt, async for investigation is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth. We're extending this pattern to autonomous security remediation — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain. The pattern is reproducible with: The hard part isn't the code — it's the system prompt . That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful. The name also works as a backronym: F irst R esponder for I ncident D iagnostics and A nal Y sis — but honestly, we just thought the Marvel reference was cooler. I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations. This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.