{"slug": "how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65", "title": "How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%", "summary": "A developer built an autonomous incident investigation agent called FRIDAY that reduced mean time to resolution by 65% on a platform serving 30+ million end users. The agent, triggered by PagerDuty alerts, first checks GitHub for recent changes before analyzing observability data, completing investigations in under 2 minutes. The system uses AWS Lambda with asynchronous invocation to handle the 60-180 second investigation time within API Gateway's 30-second timeout.", "body_md": "It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual:\n\nThis process takes **15–45 minutes** for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages.\n\nI asked myself:\n\nSo I built one. It's been running in production for months, investigating real incidents on a platform serving **30+ million end users** across multiple AWS regions. We call it **FRIDAY**.\n\nWhen a PagerDuty alert fires, FRIDAY:\n\nThe entire investigation takes **under 2 minutes**. The on-call engineer wakes up to a complete analysis instead of a raw alert.\n\n```\n┌──────────────┐     ┌────────────────┐     ┌─────────────────────┐\n│  PagerDuty   │────▶│  API Gateway   │────▶│  Lambda (Sync)      │\n│  Webhook     │     │  (Validate)    │     │  Parse + Self-Invoke│\n└──────────────┘     └────────────────┘     └─────────┬───────────┘\n                                                       │ Async\n                                                       ▼\n                                            ┌─────────────────────┐\n                                            │  Lambda (Async)      │\n                                            │  Investigation Agent │\n                                            │                      │\n                                            │  ┌────────────────┐ │\n                                            │  │ Amazon Bedrock  │ │\n                                            │  │ Claude Opus     │ │\n                                            │  │ (Tool-Use Loop) │ │\n                                            │  └───────┬────────┘ │\n                                            │          │          │\n                                            │    ┌─────┼─────┐   │\n                                            │    ▼     ▼     ▼   │\n                                            │ GitHub Datadog  S3  │\n                                            └─────────┬───────────┘\n                                                      │\n                                                      ▼\n                                            ┌─────────────────────┐\n                                            │  Microsoft Teams    │\n                                            │  (Adaptive Card)    │\n                                            └─────────────────────┘\n```\n\nAPI Gateway has a **30-second hard timeout**. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning `200 OK`\n\nto PagerDuty within 2 seconds.\n\n```\n# Sync handler: validate, parse, self-invoke, return immediately\nlambda_client.invoke(\n    FunctionName=context.function_name,\n    InvocationType=\"Event\",  # Fire and forget\n    Payload=json.dumps({\n        \"_async_investigate\": True,\n        \"alert_payload\": alert_payload,\n    }),\n)\nreturn {\"statusCode\": 200, \"body\": \"Investigation started\"}\n```\n\nThe async Lambda runs the full investigation without timeout pressure.\n\nThis is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, **80%+ of acute incidents are caused by a preceding change**: a deployment, a config update, a replica count change, a memory limit modification.\n\nFRIDAY is instructed to check GitHub *before* touching Datadog:\n\n```\nMANDATORY FIRST STEP — GitHub (Step 0):\nBefore touching Datadog, you MUST run these calls in parallel:\n1. github_search_repos — find the repo for the alerted service\n2. github_list_commits — find commits in the 2 hours before \n   the alert fired\n\nA deployment or config change is the most likely root cause.\n```\n\nOur platform spans multiple AWS regions. A naive agent querying \"all 5xx errors\" would mix signals from healthy and unhealthy regions, producing confused analysis.\n\nFRIDAY's first action is always to **lock a target region** from the alert metadata:\n\n```\n🌍 Region: Description — resolved from alert hostname\n\nEvery subsequent Datadog query includes:\nkube_cluster_name:region-az-* (scoped to affected region only)\n```\n\nThis eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region.\n\nFRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers:\n\n```\n## EXECUTIVE SUMMARY\n[2-3 sentences — what happened, who's affected, what changed]\n\n## KEY FINDINGS\n[Bulleted evidence from GitHub + Datadog]\n\n## WHAT CHANGED\n[Specific commit/PR with timestamp and author]\n\n## ERROR BREAKDOWN\n[Service-by-service error counts with affected tenants]\n\n## ROOT CAUSE\n[Confirmed / Suspected / Unknown — with evidence chain]\n\n## CUSTOMER IMPACT\n[Affected tenants, operations, scope]\n\n## RECOMMENDED ACTIONS\n[Specific next steps for the on-call engineer]\n```\n\nThe on-call engineer can glance at the Teams card and immediately know: *what happened, who's affected, what likely caused it, and what to do next* without reading a wall of text.\n\nFRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it **reasons** about each alert independently, deciding which tools to call based on what it's learned so far.\n\n```\nfor round_num in range(MAX_TOOL_ROUNDS):  # Max 25 rounds\n    response = bedrock_client.converse(\n        modelId=\"anthropic.claude-opus\",\n        messages=messages,\n        toolConfig={\"tools\": TOOL_DEFINITIONS},\n    )\n\n    if stop_reason == \"tool_use\":\n        # Execute tools, append results, continue reasoning\n        for tool_call in content_blocks:\n            result = execute_tool(\n                tool_call[\"name\"], \n                tool_call[\"input\"]\n            )\n            tool_results.append(result)\n        messages.append(tool_results)\n\n    elif stop_reason == \"end_turn\":\n        # AI has concluded — extract findings\n        return extract_final_report(content_blocks)\n```\n\n| Tool | Purpose |\n|---|---|\n`github_search_repos` |\nFind which repo owns a service |\n`github_list_commits` |\nWhat changed before the alert |\n`github_get_file` |\nRead actual deployment configs |\n`github_search_code` |\nFind all producers/consumers of a queue |\n`datadog_log_search` |\nFind specific error messages |\n`datadog_log_aggregate` |\nCount errors by backend/tenant/path |\n`datadog_query_metrics` |\nQueue depth, CPU, memory, latency |\n`datadog_get_monitor` |\nUnderstand what threshold triggered |\n\nThe AI typically uses **8–15 tool calls per investigation**, batching parallel calls when possible to minimize round-trip time.\n\nA cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a **deterministic training mode** that pre-builds architectural knowledge:\n\n``` python\ndef train():\n    \"\"\"\n    Deterministic training:\n    ~13 targeted API calls, then one Bedrock synthesis call.\n\n    Collects: cluster-service maps, HAProxy backends, \n    chronic error baselines, recent planned work.\n    \"\"\"\n    # Phase 1: Targeted data collection (no AI — pure API calls)\n    collected = {}\n    for key, tool_name, tool_input in TRAINING_CALLS:\n        collected[key] = execute_tool(tool_name, tool_input)\n\n    # Phase 2: Single AI synthesis call\n    knowledge_doc = synthesize_knowledge(collected)\n\n    # Phase 3: Save to S3 — injected into system prompt\n    save_to_s3(knowledge_doc)\n```\n\nThe knowledge document contains:\n\nOne of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you *expect* 5xx errors as traffic drains. FRIDAY handles this through:\n\n\"This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required.\"\n\nWhat happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a **graceful degradation** mechanism:\n\n```\nif rounds_remaining <= 3:\n    user_content.append({\n        \"text\": (\n            \"STOP CALLING TOOLS. Write your FINAL report \"\n            \"NOW using all data collected so far. Mark \"\n            \"uncertain findings as 'Suspected' rather \"\n            \"than skipping them.\"\n        )\n    })\n```\n\nThis ensures every investigation produces a report — even if incomplete — rather than timing out silently.\n\nPagerDuty retries webhooks. FRIDAY handles this at two levels:\n\nAfter running in production for several months:\n\n| Metric | Before FRIDAY | After FRIDAY | Improvement |\n|---|---|---|---|\n| Mean Time to First Analysis | 15–45 min | 90 sec–3 min | ~90% faster |\n| MTTR (overall) | ~60 min | ~15 min | 65% reduction |\n| AI tool adoption (team) | 20% | 85% | 4x increase |\n| Alert noise (false escalations) | High | Minimal | ~80% reduction |\n| Auto-generated postmortems | 0% | 100% of P1/P2 | Eliminated manual RCA drafts |\n\nThe system prompt is the most important file in the codebase. It's not instructions — it's the agent's **operating manual**. Ours is ~5,000 words covering:\n\n**Invest in your prompt like you invest in your architecture docs.**\n\nBefore this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents.\n\nFRIDAY is explicitly told it **does NOT take remediation actions**. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a **design choice that builds trust**. When on-call engineers trust the AI's analysis, they act on it faster.\n\nThe two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth.\n\nWe're extending this pattern to **autonomous security remediation** — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain.\n\nThe pattern is reproducible with:\n\nThe hard part isn't the code — it's the **system prompt**. That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.\n\nThe name also works as a backronym: **F** irst **R** esponder for **I** ncident **D** iagnostics and **A** nal**Y** sis — but honestly, we just thought the Marvel reference was cooler.\n\n*I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations.*\n\n*This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.*", "url": "https://wpnews.pro/news/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65", "canonical_source": "https://dev.to/velumal09/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65-42ae", "published_at": "2026-06-18 04:22:12+00:00", "updated_at": "2026-06-18 04:51:33.406306+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "artificial-intelligence", "mlops"], "entities": ["FRIDAY", "PagerDuty", "AWS Lambda", "API Gateway", "Amazon Bedrock", "Claude Opus", "GitHub", "Datadog"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65", "markdown": "https://wpnews.pro/news/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65.md", "text": "https://wpnews.pro/news/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65.txt", "jsonld": "https://wpnews.pro/news/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65.jsonld"}}