OpenDevOps – An open-source AI agent that investigates AWS/Azure incidents

OpenDevOps, an open-source multi-cloud DevOps agent for AWS and Azure, was released, enabling incident investigation with any LLM via LiteLLM. In benchmarks, it found root causes for 9 out of 10 incidents in about 52 seconds at $0.03 per investigation, making it roughly 10 times cheaper than AWS's DevOps Agent and 1,000 times cheaper than manual triage. The tool supports air-gapped environments via Ollama and can reuse existing Claude Code subscriptions.

Open-source multi-cloud DevOps agent AWS + Azure . Bring any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama for air-gapped / regulated environments , or reuse your existing Claude Code subscription auto-detected . Investigates incidents, finds root causes, and gives actionable mitigation plans — without the cloud-vendor DevOps-agent price tag. On a reproducible 10-incident suite real AWS + Azure resources, scored against ground truth , running on a commodity open model gpt-oss-120b — no frontier model required : | Root causes found | Median time | Cost / investigation | vs. AWS DevOps Agent | vs. manual triage | |---|---|---|---|---| 9 / 10 90% | ~52 s | ~$0.03 | ~10× cheaper¹ ~$0.03 vs ~$0.43 | ~1,000× cheaper² | ~$0.03 of compute replaces ~$50 of engineer toil — and costs a fraction of a managed cloud DevOps agent — while returning the answer in under a minute instead of half an hour.Reproduce it with make eval →. full benchmark & methodology ¹ vs. AWS DevOps Agent — its per-second rate applied to the same wall-clock time ~$0.43/investigation ; verify against AWS's published pricing. ² Illustrative unit economics vs. ~20–40 min of on-call triage. Cost shown is the provider-dashboard actual; see caveats . Cloud setup: AWS IAM /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/iam setup.md · Azure service principal / login /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/azure setup.md Autonomous incident detection — a crashing Lambda is caught automatically, the agent reads the traceback from CloudWatch Logs, finds the root cause, surfaces it on the Monitoring dashboard, and posts the mitigation to Slack. No human in the loop. Amazon Q Developer https://aws.amazon.com/q/developer/ and the AWS DevOps Agent https://aws.amazon.com/devops-agent/ are excellent if you live entirely inside the AWS Console with Bedrock-managed models. OpenDevOps is the open-source alternative for everyone else: Any LLM, not just Bedrock. LiteLLM-compatible — OpenAI, Anthropic direct, OpenRouter, Groq, Gemini, Mistral, or run Ollama locally for air-gapped / regulated environments. Auto-detects your existing Claude Code subscription so you pay zero incremental LLM cost if you're already on a Max/Pro plan. Multi-cloud out of the box. AWS + Azure investigations in the same chat one organization can connect both clouds at once . AWS-only agents stop at the AWS perimeter. Your data stays in your database. Investigations, prompts, and tool outputs persist in your Postgres or SQLite — your VPC, your retention, your encryption. Matters for HIPAA, PCI, FedRAMP, and EU AI Act audits. Fully auditable. Every prompt, tool call args + result , and token is open and streamed live to the UI; nothing is hidden. AWS Agent is a closed black box. Customizable. Add tools as plain Python functions, add runbooks by dropping a SKILL.md file, modify the system prompt. Fork it if you need to. Investigate from anywhere. Built-in MCP server makes it usable from Claude Desktop, Cursor, or any MCP client — not just the AWS Console. OpenDevOps | AWS DevOps Agent / Q Developer | | |---|---|---| LLM | Any LiteLLM, Claude Code, Ollama | Bedrock-managed only | Cloud coverage | AWS + Azure more coming | AWS only | Data location | Your DB / VPC | AWS-managed, not portable | Customization | Open source — modify anything | Closed product | Pricing | LLM at retail or $0 via Ollama / Claude Code | Per-investigation + Bedrock markup | Self-host | Docker / Railway / on-prem / air-gapped | No | When AWS is the better pick: if you're 100% AWS, never plan to leave, and want zero infrastructure to run, Amazon Q Developer's native Console integration and AWS-only signals Trusted Advisor, AWS Config, Compute Optimizer are hard to beat. OpenDevOps is for everyone else. LangChain DeepAgents as the agent framework — planning, tool orchestration, and session memory out of the box 21 read-only AWS tools across CloudWatch 6 , CloudTrail 2 , ECS 4 , Lambda 4 , EC2 2 , RDS 2 , IAM 1 , plus bash escape hatch, cross-session history analytics, skills, and submit investigation — plain Python functions, schemas inferred automatically Azure support CLI-first — investigates Azure through the read-only az CLI + kubectl for AKS and a set of Azure runbook skills AKS debugging, App Service errors, Azure Monitor/KQL, VM diagnostics — no separate SDK tools needed. Read-only; connect via a service principal or az login — see apps/documentation/azure setup.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/azure setup.md Sandboxed bash execution tool — agent can run whitelisted read-only AWS CLI aws , Azure CLI az , kubectl, and docker commands as a last resort when the structured tools fall short; every command validated against an allowlist before execution; never uses shell=True ; hard 30-second timeout- Includes CloudWatch Logs Insights query logs insights — full query language support: fields , filter , stats , sort , limit ; results include scanned MB - Includes Streaming responses — FastAPI SSE endpoint streams agent tokens in real time as the LLM reasons; tool calls appear as they complete Event-driven incident detection — EventBridge → SQS → long-poll consumer; 9 EventBridge rules cover CloudWatch alarms, ECS task failures, Lambda async errors, RDS events, EC2 state changes, CodePipeline failures, and AWS Health events; uses a DLQ plus database-backed incident claims to avoid duplicate investigations; runs alongside the metric poller — see apps/documentation/event detection.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/event detection.md Context enrichment — before the LLM runs, deterministic boto3 calls fetch facts about the affected resource alarm details, recent logs, function config, etc. to reduce tool call count and speed up investigations Monitoring dashboard — live incident feed showing all event-driven investigations: confidence level or FAILED badge , affected service, root cause summary; each alert links back to its original investigation session via View investigation so you can follow up without losing context; real-time SSE push keeps the page live without polling — see apps/documentation/monitoring.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/monitoring.md AWS Configuration settings tab — admin-only editable tab in Settings for SQS Queue URL and AWS Region; shared org-wide via database-backed app config; includes an inline IAM permission checker per service Web UI — React + Vite SPA served by FastAPI: Chat page — streaming responses, collapsible tool call inspector, cost/latency card, stop button; supports ?prompt= deeplink for pre-seeded investigations from the Monitoring dashboard Session history sidebar — lists all past conversations; click any to resume with full tool call inspector and cost card restored; new chat and delete soft buttons Monitoring page — live incident feed from event-driven detection; alert detail with investigate deeplink Dashboard — session counts, tool call stats, cost/latency, context saved, activity chart, service breakdown, root cause distribution, recent sessions History page — keyword search across all past sessions Settings page — AWS Configuration editable, admin-only , Environment read-only env vars , Agent config, Integrations Team page — admin-only user management: add, remove, and change roles Auth & RBAC — optional password-based auth with admin and user roles; JWT tokens; first registered user auto-becomes admin; disabled by default set JWT SECRET to enable — see apps/documentation/auth.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/auth.md Three storage backends — pick one via CHECKPOINT BACKEND in .env ; see apps/documentation/databases.md memory — zero config, no persistence; great for CI and quick testing; autonomous polling/event monitoring is disabled in this mode sqlite — local file, no external services; recommended for single-server and personal use postgres — full production persistence via psycopg3 + AsyncPostgresSaver - Schema: users , sessions , messages , tool calls , usage events — see apps/documentation/schema.md - Soft delete — deleted sessions are hidden immediately but data is preserved for the 30-day cleanup job Structured logging via Loguru — used consistently across all modules tools, agent, API, CLI ; every request shows agent reasoning, tool calls with args/results, and a done summary with latency + token counts CLI — devops-agent investigate , ask , and report commands powered by the same agent Any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama local / air-gapped , or any OpenAI-compatible endpoint. Auto-detects local Claude Code subscription ~/.claude OAuth so a Max/Pro plan can power the agent at zero incremental cost. Swap models via a single env var LLM MODEL — no code changes cd apps/backend && uv sync cp .env.example .env Edit .env — add your OPENROUTER API KEY and set AWS PROFILE aws configure --profile devops-agent-readonly AWS Access Key ID: your key id AWS Secret Access Key: your secret key Default region: us-east-1 Default output format: json Verify aws sts get-caller-identity --profile devops-agent-readonly Three options — pick one and add it to .env . Full details in apps/documentation/databases.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/databases.md . Memory default — zero config, nothing persists on restart CHECKPOINT BACKEND=memory SQLite recommended for local dev — persists to a file, no external service needed CHECKPOINT BACKEND=sqlite SQLITE PATH=./data/agent.db created automatically on first start PostgreSQL recommended for production Start Postgres with Docker docker run -d --name opendevops-pg \ -e POSTGRES DB=opendevops \ -e POSTGRES USER=dev \ -e POSTGRES PASSWORD=dev \ -p 5433:5432 \ postgres:16 Add to .env CHECKPOINT BACKEND=postgres DATABASE URL=postgresql://dev:dev@localhost:5433/opendevops Create app tables safe to re-run cd apps/backend && uv run migrate Option A — Docker Compose recommended, AWS CLI included docker compose -f deployment/docker-compose/docker-compose.yml up --build Backend: http://localhost:8000 Frontend: http://localhost:80 Postgres host : localhost:5433 The backend image installs AWS CLI v2 automatically — the bash execution tool works out of the box. Host AWS credentials ~/.aws are mounted read-only into the container. For production on AWS, remove the volume mount and attach an IAM role to the instance/task instead. Option B — Local dev two terminals Terminal 1 — FastAPI backend with hot reload cd apps/backend && uv run dev Terminal 2 — React frontend Vite dev server with HMR cd apps/frontend && npm run dev Open http://localhost:5173 Note:local dev requires aws CLI installed on your machine for the bash tool to work. Install it from https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html CLI cd apps/backend Investigate an incident uv run devops-agent investigate "high error rate on my payment Lambda" With alarm and service hints uv run devops-agent investigate "latency spike" --alarm HighLatencyAlarm --service api-service Freeform Q&A uv run devops-agent ask "why would a Lambda function suddenly start throttling?" Daily ops health report uv run devops-agent report The agent needs read access across your AWS account, plus optional write access scoped to opendevops- resources if you use the event-driven monitoring setup wizard. Two least-privilege policies Operational + Setup and full step-by-step instructions are in apps/documentation/iam setup.md . apps/ ├── core/ Installable package opendevops-core — the shared agent brain │ └── src/opendevops core/ │ ├── agent/ DeepAgents setup, prompts, LLM wiring, DB layer backends + ABC │ ├── tools/ bash, history, skills, final-answer + response cap/cache │ ├── providers/ AWS provider — tools, context, poller, event consumer │ ├── models/ Pydantic models: agent, chat, sessions, users │ ├── skills/ Markdown runbooks lambda-throttling + add your own │ ├── integrations/ slack webhook.py, telegram.py │ ├── migrations/ Numbered baseline SQL migrations 001–013 — bundled with the wheel │ └── config.py CoreSettings + get settings /configure injection hook ├── backend/ OSS web app + CLI — depends on opendevops-core via uv path source │ ├── src/ │ │ ├── api/ │ │ │ ├── app.py FastAPI app factory — mounts routers, serves frontend │ │ │ ├── auth.py JWT helpers + FastAPI auth dependencies │ │ │ └── routers/ chat, sessions, users, settings, history, dashboard, monitoring │ │ ├── cli/ Typer CLI commands │ │ ├── config/ │ │ │ └── appsettings.py Settings CoreSettings — adds web/auth-only fields, calls configure │ │ └── mcp server.py MCP server stdio / HTTP+SSE │ ├── migrations/ OSS-app-only migrations currently none; all schema is core baseline │ ├── tests/ │ └── pyproject.toml ├── frontend/ │ └── src/ │ ├── pages/ ChatPage, DashboardPage, HistoryPage, SettingsPage, UsersPage, LoginPage │ └── components/ Sidebar, Header, ProtectedRoute, AgentMessage, ... └── documentation/ Feature reference — auth, schema, skills, databases, UI, ... deployment/ ├── docker-compose/ docker-compose.yml PostgreSQL + backend + frontend └── railway/ Dockerfile.railway + railway.toml combined single-image deploy | Variable | Default | Description | |---|---|---| LLM MODEL | openrouter/openai/gpt-4o | LiteLLM model string — provider/model format; see | LLM API BASE | none | Custom base URL for OpenAI-compatible endpoints e.g. Ollama, vLLM | LLM API KEY | none | API key for custom endpoints; standard provider keys e.g. ANTHROPIC API KEY are read automatically | OPENROUTER API KEY | none | Required when using any openrouter/ model | CHECKPOINT BACKEND | memory | Storage backend: memory · sqlite · postgres — see | SQLITE PATH | ./data/agent.db | SQLite file path — only used when CHECKPOINT BACKEND=sqlite | DATABASE URL | none | PostgreSQL connection string — only used when CHECKPOINT BACKEND=postgres | AWS REGION | us-east-1 | AWS region | AWS PROFILE | none | AWS named profile e.g. devops-agent-readonly | MAX TOOL CALLS | 20 | Hard cap on tool calls per investigation | INVESTIGATION TIMEOUT | 120 | Timeout in seconds | TOOL RESPONSE MAX CHARS | 40000 | Truncate tool responses larger than this before feeding to the LLM; 0 disables | SLACK WEBHOOK URL | none | Slack incoming webhook URL; leave unset to disable notifications | TELEGRAM BOT TOKEN | none | Telegram bot token from @BotFather; leave unset to disable | TELEGRAM CHAT ID | none | Target chat/group/channel ID negative number for groups | POLL INTERVAL SECONDS | 0 | Proactive polling interval in seconds; 0 disables the poller | POLL ERROR THRESHOLD | 5.0 | Lambda error rate % that triggers an automatic investigation | POLL REINVESTIGATE HOURS | 1 | Cooldown period — skip re-investigating the same alarm within N hours | SUMMARIZATION ENABLED | true | Auto-compact sessions when they exceed the threshold | SUMMARIZATION THRESHOLD CHARS | 60000 | Total session chars before compaction fires ~15K tokens | SUMMARIZATION KEEP CHARS | 20000 | Recent chars to preserve intact during compaction ~5K tokens | JWT SECRET | none | Secret key for JWT signing; leave unset to disable auth entirely | JWT EXPIRE MINUTES | 1440 | JWT token lifetime in minutes default 24 h | SNS TOPIC ARN | none | SNS topic to publish investigation findings to after each event-driven run | SQS QUEUE URL | none | SQS queue URL for the event consumer to poll; also set via Settings → AWS Configuration | EVENT CONSUMER ENABLED | false | Explicitly enable the SQS event consumer also auto-starts if SQS QUEUE URL is set | DATA DIR | data | Reserved data directory setting; init state is stored in the selected database backend | - Cache layer — in-process TTL cache cachetools on all 19 AWS tool functions; 2-minute TTL, 256 entry max, AWS profile+region included in cache key - Schema / models layer — centralized src/models/ package for all Pydantic models: agent domain, memory state, and API request/response schemas - Soft-deleted session cleanup job — product version only; OSS users manage their own DB - Investigation history skill — cross-session analysis: recurring errors, most-triggered alarms, patterns across all past sessions for a user - User roles — admin / user roles with JWT auth, first-user bootstrap, admin-only user management UI; optional disabled when JWT SECRET unset — see apps/documentation/auth.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/auth.md - React frontend — rewrite the single-file HTML UI in React; component-based architecture, proper state management, hot reload - Dashboard — summarized view of troubleshooting activity, recurring incidents, query breakdown by service - Multi-provider LLM support — 100+ providers via LiteLLM; swap models with a single LLM MODEL env var change; supports OpenRouter, Anthropic, OpenAI, Groq, Ollama, and any OpenAI-compatible endpoint; see apps/documentation/llm providers.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/llm providers.md - MCP integration — expose the agent as an MCP server devops-agent mcp ; investigate , ask , and list sessions tools available in Claude Desktop, Cursor, or any MCP-compatible client; stdio and HTTP+SSE transports; see apps/documentation/mcp server.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/mcp server.md - Multi-backend storage — memory zero config , sqlite local file, no external service , postgres production ; switch with one env var; see apps/documentation/databases.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/databases.md - Skills system — on-demand investigation skills loaded from src/skills/ /SKILL.md ; skill names injected into system prompt at startup, full content loaded only when agent calls use skill name ; ships with lambda-throttling skill; add your own by dropping a SKILL.md into src/skills/<name / - Custom tools via URL — register external tools by pointing at an OpenAPI/HTTP endpoint; agent discovers and calls them alongside built-in AWS tools - Bash CLI escape hatch Phase 1 — run bash command is implemented for read-only AWS CLI, kubectl, and docker commands with strict allowlist validation and timeout. - Bash sandbox Phase 2 — run each bash command in an isolated throwaway container --network none , read-only FS, non-root, resource limits . - Tool response capping — truncates oversized AWS tool responses CloudWatch logs, CloudTrail events before they reach the LLM context window; configurable via TOOL RESPONSE MAX CHARS default 40 000 chars ≈ 10 K tokens - Conversation summarization — automatically summarize old messages when the session approaches the model's context limit; preserves recent exchanges and injects a structured summary so long investigations never fail mid-session; summarization events tracked in usage events.metadata and surfaced in the dashboard - Optimize tool loading — pass only relevant tools per investigation context instead of the full 19-tool set - Message middleware pipeline — compaction, summarization, intent detection, context trimmer - Guardrails — input/output validation, PII scrubbing, query scope enforcement - Multi-model escalation — route simple queries to cheaper/smaller models, escalate hard investigations to larger ones - Fun streaming labels — contextual loading copy "Digging through CloudTrail…", "Lemonizing metrics…", "Cooking up a root cause…" - Slack & Telegram notifications — reactive: posts after every investigation to Slack Block Kit and/or Telegram HTML bot message ; proactive: background poller checks CloudWatch alarms and Lambda error rates, auto-investigates, and delivers to both channels; set SLACK WEBHOOK URL and/or TELEGRAM BOT TOKEN + TELEGRAM CHAT ID in .env ; see apps/documentation/telegram.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/telegram.md - Event-driven incident detection — EventBridge → SQS → long-poll consumer; 9 EventBridge rules covering CloudWatch alarms, ECS, Lambda, RDS, EC2, CodePipeline, and AWS Health; runs in parallel with the metric poller; see apps/documentation/event detection.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/event detection.md - Context enrichment — deterministic boto3 calls per event type before LLM runs; reduces tool call count by front-loading relevant resource facts - Monitoring dashboard — live incident feed with real-time SSE push, per-service health summary DB-backed, survives restarts , alert detail page; "View investigation" opens the original agent session for follow-up; failed investigations flagged separately; see apps/documentation/monitoring.md /AhmadHammad21/OpenDevOps/blob/main/apps/documentation/monitoring.md - AWS Configuration settings tab — admin-only editable tab for SQS/region config; shared org-wide via database-backed app config; inline IAM permission checker - Observability — OpenTelemetry traces for agent steps, tool call latency, LLM token usage - Follow-up question suggestions — after each investigation completes, generate 3 suggested follow-up questions in the background and surface them in the UI as clickable chips - Session / user feedback loop — thumbs up/down on investigations, feed signals back to the agent and to an internal quality dashboard - Knowledge base — attach internal runbooks, post-mortems, and architecture docs so the agent grounds answers in org-specific context - Multi-account AWS — support multiple AWS profiles per org via aws profiles table schema already in place - Multi-cloud support — extend tooling to GCP Cloud Monitoring, Cloud Logging, GKE and Azure Monitor, Log Analytics, AKS ; unified incident investigation across providers - Bash sandbox Phase 2 — isolated Docker container — each run bash command call runs inside a throwaway container: --network none , --read-only filesystem, --memory 256m , --cpus 0.5 , non-root user; container destroyed immediately after the command completes; IAM read-only role remains the last line of defense - Redis cache — replace in-process cachetools with Redis; shared across workers, survives restarts, per-org cache namespacing to prevent data leakage between tenants - Soft-deleted session cleanup — scheduled job Inngest or APScheduler to purge is deleted = TRUE sessions older than a configurable retention window default 30 days ; GDPR right-to-erasure compliance - Org-scoped AWS credential management — per-org credential vault; agents use org-scoped profiles instead of a single global AWS PROFILE - Per-org AWS credential store — encrypted credential vault per organization; agents use org-scoped profiles instead of a single global AWS PROFILE - Billing & usage metering — track token usage and tool calls per org/user; expose cost dashboards; integrate with Stripe for usage-based billing The bash execution tool runs whitelisted read-only commands only. Every command is validated against an allowlist before execution — anything not on the list is rejected immediately and logged. Current Phase 1 : allowlist validation + subprocess with hard timeout. No write commands permitted under any circumstances. shell=True is never used. Phase 2 — Isolated sandbox planned : - Every bash command runs inside a throwaway Docker container --network none — no internet access from inside the sandbox --read-only filesystem — container cannot write to disk --memory 256m and --cpus 0.5 — resource caps- Non-root user inside the container - Container is destroyed immediately after the command completes - Even if the LLM misbehaves, the IAM read-only role is the last line of defense The agent never executes fixes automatically. It investigates, suggests, and waits for human approval before anything changes. All backend commands run from apps/backend/ , or use root make targets. Run tests cd apps/backend && uv run pytest or: make test Lint + format cd apps/backend && uv run ruff check src/ tests/ cd apps/backend && uv run ruff format src/ tests/ or: make lint / make lint-fix