OpenDevOps – An open-source AI agent that investigates AWS/Azure incidents

wpnews.pro

Open-source multi-cloud DevOps agent (AWS + Azure). Bring any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama for air-gapped / regulated environments, or reuse your existing Claude Code subscription (auto-detected). Investigates incidents, finds root causes, and gives actionable mitigation plans — without the cloud-vendor DevOps-agent price tag.

On a reproducible 10-incident suite (real AWS + Azure resources, scored against ground truth), running on a commodity open model (gpt-oss-120b

— no frontier model required):

Root causes found	Median time	Cost / investigation	vs. AWS DevOps Agent	vs. manual triage
9 / 10 (90%)
~52 s
~$0.03
~10× cheaper¹(~$0.03 vs ~$0.43)
~1,000× cheaper²

~$0.03 of compute replaces ~$50 of engineer toil — and costs a fraction of a managed cloud DevOps agent — while returning the answer in under a minute instead of half an hour.Reproduce it withmake eval

→.[full benchmark & methodology]

¹ vs. AWS DevOps Agent — its per-second rate applied to the same wall-clock time (~$0.43/investigation); verify against AWS's published pricing. ² Illustrative unit economics vs. ~20–40 min of on-call triage. Cost shown is the provider-dashboard actual; see[caveats].

Cloud setup: AWS (IAM) · Azure (service principal / login)

Autonomous incident detection — a crashing Lambda is caught automatically, the agent reads the traceback from CloudWatch Logs, finds the root cause, surfaces it on the Monitoring dashboard, and posts the mitigation to Slack. No human in the loop.

Amazon Q Developer and the AWS DevOps Agent are excellent if you live entirely inside the AWS Console with Bedrock-managed models. OpenDevOps is the open-source alternative for everyone else:

Any LLM, not just Bedrock. LiteLLM-compatible — OpenAI, Anthropic direct, OpenRouter, Groq, Gemini, Mistral, or runOllama locally for air-gapped / regulated environments. Auto-detects your existingClaude Code subscription so you payzero incremental LLM cost if you're already on a Max/Pro plan.Multi-cloud out of the box. AWS + Azure investigations in the same chat (one organization can connect both clouds at once). AWS-only agents stop at the AWS perimeter.Your data stays in your database. Investigations, prompts, and tool outputs persist inyour Postgres or SQLite — your VPC, your retention, your encryption. Matters for HIPAA, PCI, FedRAMP, and EU AI Act audits.Fully auditable. Every prompt, tool call (args + result), and token is open and streamed live to the UI; nothing is hidden. AWS Agent is a closed black box.Customizable. Add tools as plain Python functions, add runbooks by dropping aSKILL.md

file, modify the system prompt. Fork it if you need to.Investigate from anywhere. Built-in MCP server makes it usable from Claude Desktop, Cursor, or any MCP client — not just the AWS Console.

OpenDevOps | AWS DevOps Agent / Q Developer | | |---|---|---| LLM | Any (LiteLLM, Claude Code, Ollama) | Bedrock-managed only | Cloud coverage | AWS + Azure (more coming) | AWS only | Data location | Your DB / VPC | AWS-managed, not portable | Customization | Open source — modify anything | Closed product | Pricing | LLM at retail (or $0 via Ollama / Claude Code) | Per-investigation + Bedrock markup | Self-host | Docker / Railway / on-prem / air-gapped | No |

When AWS is the better pick: if you're 100% AWS, never plan to leave, and want zero infrastructure to run, Amazon Q Developer's native Console integration and AWS-only signals (Trusted Advisor, AWS Config, Compute Optimizer) are hard to beat. OpenDevOps is for everyone else.

LangChain DeepAgents as the agent framework — planning, tool orchestration, and session memory out of the box21 read-only AWS tools across CloudWatch (6), CloudTrail (2), ECS (4), Lambda (4), EC2 (2), RDS (2), IAM (1), plus bash escape hatch, cross-session history analytics, skills, andsubmit_investigation

— plain Python functions, schemas inferred automaticallyAzure support (CLI-first)— investigates Azure through the read-onlyaz

CLI +kubectl

(for AKS) and a set of Azure runbook skills (AKS debugging, App Service errors, Azure Monitor/KQL, VM diagnostics) — no separate SDK tools needed. Read-only; connect via a service principal oraz login

— seeapps/documentation/azure_setup.mdSandboxed bash execution tool— agent can run whitelisted read-only AWS CLI (aws

), Azure CLI (az

), kubectl, and docker commands as a last resort when the structured tools fall short; every command validated against an allowlist before execution; never usesshell=True

; hard 30-second timeout- Includes CloudWatch Logs Insights(query_logs_insights

) — full query language support:fields

,filter

,stats

,sort

,limit

; results include scanned MB

Includes Streaming responses— FastAPI SSE endpoint streams agent tokens in real time as the LLM reasons; tool calls appear as they complete** Event-driven incident detection**— EventBridge → SQS → long-poll consumer; 9 EventBridge rules cover CloudWatch alarms, ECS task failures, Lambda async errors, RDS events, EC2 state changes, CodePipeline failures, and AWS Health events; uses a DLQ plus database-backed incident claims to avoid duplicate investigations; runs alongside the metric poller — seeapps/documentation/event_detection.mdContext enrichment— before the LLM runs, deterministic boto3 calls fetch facts about the affected resource (alarm details, recent logs, function config, etc.) to reduce tool call count and speed up investigationsMonitoring dashboard— live incident feed showing all event-driven investigations: confidence level (or FAILED badge), affected service, root cause summary; each alert links back to its original investigation session viaView investigation so you can follow up without losing context; real-time SSE push keeps the page live without polling — seeapps/documentation/monitoring.mdAWS Configuration settings tab— admin-only editable tab in Settings for SQS Queue URL and AWS Region; shared org-wide via database-backed app config; includes an inline IAM permission checker per serviceWeb UI— React + Vite SPA served by FastAPI:** Chat page**— streaming responses, collapsible tool call inspector, cost/latency card, stop button; supports?prompt=

deeplink for pre-seeded investigations from the Monitoring dashboardSession history sidebar— lists all past conversations; click any to resume with full tool call inspector and cost card restored; new chat and delete (soft) buttonsMonitoring page— live incident feed from event-driven detection; alert detail with investigate deeplink** Dashboard**— session counts, tool call stats, cost/latency, context saved, activity chart, service breakdown, root cause distribution, recent sessionsHistory page— keyword search across all past sessions** Settings page**— AWS Configuration (editable, admin-only), Environment (read-only env vars), Agent config, Integrations** Team page**— admin-only user management: add, remove, and change roles

Auth & RBAC— optional password-based auth withadmin

anduser

roles; JWT tokens; first registered user auto-becomes admin; disabled by default (setJWT_SECRET

to enable) — seeapps/documentation/auth.mdThree storage backends— pick one viaCHECKPOINT_BACKEND

in.env

; seeapps/documentation/databases.md

memory

— zero config, no persistence; great for CI and quick testing; autonomous polling/event monitoring is disabled in this modesqlite

— local file, no external services; recommended for single-server and personal usepostgres

— full production persistence via psycopg3 +AsyncPostgresSaver

Schema: users

,sessions

,messages

,tool_calls

,usage_events

— seeapps/documentation/schema.md

Soft delete — deleted sessions are hidden immediately but data is preserved for the 30-day cleanup job

Structured logging via Loguru — used consistently across all modules (tools, agent, API, CLI); every request shows agent reasoning, tool calls with args/results, and a done summary with latency + token countsCLI—devops-agent investigate

,ask

, andreport

commands powered by the same agentAny LLM via LiteLLM— OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral,** Ollama (local / air-gapped), or any OpenAI-compatible endpoint. Auto-detects local Claude Code**subscription (~/.claude

OAuth) so a Max/Pro plan can power the agent at zero incremental cost. Swap models via a single env var (LLM_MODEL

) — no code changes

cd apps/backend && uv sync
cp .env.example .env
aws configure --profile devops-agent-readonly

aws sts get-caller-identity --profile devops-agent-readonly

Three options — pick one and add it to .env

. Full details in apps/documentation/databases.md.

Memory (default — zero config, nothing persists on restart)

CHECKPOINT_BACKEND=memory

SQLite (recommended for local dev — persists to a file, no external service needed)

CHECKPOINT_BACKEND=sqlite
SQLITE_PATH=./data/agent.db   # created automatically on first start

PostgreSQL (recommended for production)

docker run -d --name opendevops-pg \
  -e POSTGRES_DB=opendevops \
  -e POSTGRES_USER=dev \
  -e POSTGRES_PASSWORD=dev \
  -p 5433:5432 \
  postgres:16

CHECKPOINT_BACKEND=postgres
DATABASE_URL=postgresql://dev:dev@localhost:5433/opendevops

cd apps/backend && uv run migrate

Option A — Docker Compose (recommended, AWS CLI included)

docker compose -f deployment/docker-compose/docker-compose.yml up --build

The backend image installs AWS CLI v2 automatically — the bash execution tool works out of the box. Host AWS credentials (~/.aws

) are mounted read-only into the container. For production on AWS, remove the volume mount and attach an IAM role to the instance/task instead.

Option B — Local dev (two terminals)

cd apps/backend && uv run dev
cd apps/frontend && npm run dev

Note:local dev requiresaws

CLI installed on your machine for the bash tool to work. Install it from[https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html]

CLI

cd apps/backend

uv run devops-agent investigate "high error rate on my payment Lambda"

uv run devops-agent investigate "latency spike" --alarm HighLatencyAlarm --service api-service

uv run devops-agent ask "why would a Lambda function suddenly start throttling?"

uv run devops-agent report

The agent needs read access across your AWS account, plus optional write access scoped to opendevops-*

resources if you use the event-driven monitoring setup wizard. Two least-privilege policies (Operational + Setup) and full step-by-step instructions are in ** apps/documentation/iam_setup.md**.

apps/
├── core/                  # Installable package `opendevops-core` — the shared agent brain
│   └── src/opendevops_core/
│       ├── agent/         # DeepAgents setup, prompts, LLM wiring, DB layer (backends + ABC)
│       ├── tools/         # bash, history, skills, final-answer + response cap/cache
│       ├── providers/     # AWS provider — tools, context, poller, event consumer
│       ├── models/        # Pydantic models: agent, chat, sessions, users
│       ├── skills/        # Markdown runbooks (lambda-throttling + add your own)
│       ├── integrations/  # slack_webhook.py, telegram.py
│       ├── migrations/    # Numbered baseline SQL migrations (001–013) — bundled with the wheel
│       └── config.py      # CoreSettings + get_settings()/configure() injection hook
├── backend/               # OSS web app + CLI — depends on opendevops-core via uv path source
│   ├── src/
│   │   ├── api/
│   │   │   ├── app.py     # FastAPI app factory — mounts routers, serves frontend
│   │   │   ├── auth.py    # JWT helpers + FastAPI auth dependencies
│   │   │   └── routers/   # chat, sessions, users, settings, history, dashboard, monitoring
│   │   ├── cli/           # Typer CLI commands
│   │   ├── config/
│   │   │   └── appsettings.py  # Settings(CoreSettings) — adds web/auth-only fields, calls configure()
│   │   └── mcp_server.py  # MCP server (stdio / HTTP+SSE)
│   ├── migrations/        # OSS-app-only migrations (currently none; all schema is core baseline)
│   ├── tests/
│   └── pyproject.toml
├── frontend/
│   └── src/
│       ├── pages/         # ChatPage, DashboardPage, HistoryPage, SettingsPage, UsersPage, LoginPage
│       └── components/    # Sidebar, Header, ProtectedRoute, AgentMessage, ...
└── documentation/         # Feature reference — auth, schema, skills, databases, UI, ...
deployment/
├── docker-compose/        # docker-compose.yml (PostgreSQL + backend + frontend)
└── railway/               # Dockerfile.railway + railway.toml (combined single-image deploy)

Variable	Default	Description
`LLM_MODEL`
`openrouter/openai/gpt-4o`
LiteLLM model string — `provider/model` format; see

`LLM_API_BASE`
none	Custom base URL for OpenAI-compatible endpoints (e.g. Ollama, vLLM)
`LLM_API_KEY`
none	API key for custom endpoints; standard provider keys (e.g. `ANTHROPIC_API_KEY` ) are read automatically
`OPENROUTER_API_KEY`
none	Required when using any `openrouter/` model
`CHECKPOINT_BACKEND`
`memory`
Storage backend: `memory` · `sqlite` · `postgres` — see

`SQLITE_PATH`
`./data/agent.db`
SQLite file path — only used when `CHECKPOINT_BACKEND=sqlite`
`DATABASE_URL`
none	PostgreSQL connection string — only used when `CHECKPOINT_BACKEND=postgres`
`AWS_REGION`
`us-east-1`
AWS region
`AWS_PROFILE`
none	AWS named profile (e.g. `devops-agent-readonly` )
`MAX_TOOL_CALLS`
`20`
Hard cap on tool calls per investigation
`INVESTIGATION_TIMEOUT`
`120`
Timeout in seconds
`TOOL_RESPONSE_MAX_CHARS`
`40000`
Truncate tool responses larger than this before feeding to the LLM; `0` disables
`SLACK_WEBHOOK_URL`
none	Slack incoming webhook URL; leave unset to disable notifications
`TELEGRAM_BOT_TOKEN`
none	Telegram bot token from @BotFather; leave unset to disable
`TELEGRAM_CHAT_ID`
none	Target chat/group/channel ID (negative number for groups)
`POLL_INTERVAL_SECONDS`
`0`
Proactive polling interval in seconds; `0` disables the poller
`POLL_ERROR_THRESHOLD`
`5.0`
Lambda error rate % that triggers an automatic investigation
`POLL_REINVESTIGATE_HOURS`
`1`
Cooldown period — skip re-investigating the same alarm within N hours
`SUMMARIZATION_ENABLED`
`true`
Auto-compact sessions when they exceed the threshold
`SUMMARIZATION_THRESHOLD_CHARS`
`60000`
Total session chars before compaction fires (~15K tokens)
`SUMMARIZATION_KEEP_CHARS`
`20000`
Recent chars to preserve intact during compaction (~5K tokens)
`JWT_SECRET`
none	Secret key for JWT signing; leave unset to disable auth entirely
`JWT_EXPIRE_MINUTES`
`1440`
JWT token lifetime in minutes (default 24 h)
`SNS_TOPIC_ARN`
none	SNS topic to publish investigation findings to after each event-driven run
`SQS_QUEUE_URL`
none	SQS queue URL for the event consumer to poll; also set via Settings → AWS Configuration
`EVENT_CONSUMER_ENABLED`
`false`
Explicitly enable the SQS event consumer (also auto-starts if `SQS_QUEUE_URL` is set)
`DATA_DIR`
`data`
Reserved data directory setting; init state is stored in the selected database backend

Cache layer— in-process TTL cache (cachetools

) on all 19 AWS tool functions; 2-minute TTL, 256 entry max, AWS profile+region included in cache key - Schema / models layer— centralizedsrc/models/

package for all Pydantic models: agent domain, memory state, and API request/response schemas - Soft-deleted session cleanup job— product version only; OSS users manage their own DB - Investigation history skill— cross-session analysis: recurring errors, most-triggered alarms, patterns across all past sessions for a user - User roles—admin

/user

roles with JWT auth, first-user bootstrap, admin-only user management UI; optional (disabled whenJWT_SECRET

unset) — seeapps/documentation/auth.md

React frontend— rewrite the single-file HTML UI in React; component-based architecture, proper state management, hot reload - Dashboard— summarized view of troubleshooting activity, recurring incidents, query breakdown by service - Multi-provider LLM support— 100+ providers via LiteLLM; swap models with a singleLLM_MODEL

env var change; supports OpenRouter, Anthropic, OpenAI, Groq, Ollama, and any OpenAI-compatible endpoint; seeapps/documentation/llm_providers.md - MCP integration— expose the agent as an MCP server (devops-agent mcp

);investigate

,ask

, andlist_sessions

tools available in Claude Desktop, Cursor, or any MCP-compatible client; stdio and HTTP+SSE transports; seeapps/documentation/mcp_server.md - Multi-backend storage—memory

(zero config),sqlite

(local file, no external service),postgres

(production); switch with one env var; seeapps/documentation/databases.md - Skills system— on-demand investigation skills loaded fromsrc/skills/*/SKILL.md

; skill names injected into system prompt at startup, full content loaded only when agent callsuse_skill(name)

; ships withlambda-throttling

skill; add your own by dropping aSKILL.md

intosrc/skills/<name>/

Custom tools via URL— register external tools by pointing at an OpenAPI/HTTP endpoint; agent discovers and calls them alongside built-in AWS tools - Bash CLI escape hatch (Phase 1)—run_bash_command

is implemented for read-only AWS CLI, kubectl, and docker commands with strict allowlist validation and timeout. - Bash sandbox Phase 2— run each bash command in an isolated throwaway container (--network none

, read-only FS, non-root, resource limits). - Tool response capping— truncates oversized AWS tool responses (CloudWatch logs, CloudTrail events) before they reach the LLM context window; configurable viaTOOL_RESPONSE_MAX_CHARS

(default 40 000 chars ≈ 10 K tokens) - Conversation summarization— automatically summarize old messages when the session approaches the model's context limit; preserves recent exchanges and injects a structured summary so long investigations never fail mid-session; summarization events tracked inusage_events.metadata

and surfaced in the dashboard - **Optimize tool **— pass only relevant tools per investigation context instead of the full 19-tool set - Message middleware pipeline— compaction, summarization, intent detection, context trimmer - Guardrails— input/output validation, PII scrubbing, query scope enforcement - Multi-model escalation— route simple queries to cheaper/smaller models, escalate hard investigations to larger ones - Fun streaming labels— contextual copy ("Digging through CloudTrail…", "Lemonizing metrics…", "Cooking up a root cause…") - Slack & Telegram notifications— reactive: posts after every investigation to Slack (Block Kit) and/or Telegram (HTML bot message); proactive: background poller checks CloudWatch alarms and Lambda error rates, auto-investigates, and delivers to both channels; setSLACK_WEBHOOK_URL

and/orTELEGRAM_BOT_TOKEN

+TELEGRAM_CHAT_ID

in.env

; seeapps/documentation/telegram.md - Event-driven incident detection— EventBridge → SQS → long-poll consumer; 9 EventBridge rules covering CloudWatch alarms, ECS, Lambda, RDS, EC2, CodePipeline, and AWS Health; runs in parallel with the metric poller; seeapps/documentation/event_detection.md - Context enrichment— deterministic boto3 calls per event type before LLM runs; reduces tool call count by front- relevant resource facts - Monitoring dashboard— live incident feed with real-time SSE push, per-service health summary (DB-backed, survives restarts), alert detail page; "View investigation" opens the original agent session for follow-up; failed investigations flagged separately; seeapps/documentation/monitoring.md - AWS Configuration settings tab— admin-only editable tab for SQS/region config; shared org-wide via database-backed app config; inline IAM permission checker

Observability— OpenTelemetry traces for agent steps, tool call latency, LLM token usage - Follow-up question suggestions— after each investigation completes, generate 3 suggested follow-up questions in the background and surface them in the UI as clickable chips - Session / user feedback loop— thumbs up/down on investigations, feed signals back to the agent and to an internal quality dashboard - Knowledge base— attach internal runbooks, post-mortems, and architecture docs so the agent grounds answers in org-specific context - Multi-account AWS— support multiple AWS profiles per org viaaws_profiles

table (schema already in place) - Multi-cloud support— extend tooling to GCP (Cloud Monitoring, Cloud Logging, GKE) and Azure (Monitor, Log Analytics, AKS); unified incident investigation across providers - Bash sandbox Phase 2 — isolated Docker container— eachrun_bash_command

call runs inside a throwaway container:--network none

,--read-only

filesystem,--memory 256m

,--cpus 0.5

, non-root user; container destroyed immediately after the command completes; IAM read-only role remains the last line of defense

Redis cache— replace in-processcachetools

with Redis; shared across workers, survives restarts, per-org cache namespacing to prevent data leakage between tenants - Soft-deleted session cleanup— scheduled job (Inngest or APScheduler) to purgeis_deleted = TRUE

sessions older than a configurable retention window (default 30 days); GDPR right-to-erasure compliance - Org-scoped AWS credential management— per-org credential vault; agents use org-scoped profiles instead of a single globalAWS_PROFILE

Per-org AWS credential store— encrypted credential vault per organization; agents use org-scoped profiles instead of a single globalAWS_PROFILE

Billing & usage metering— track token usage and tool calls per org/user; expose cost dashboards; integrate with Stripe for usage-based billing

The bash execution tool runs whitelisted read-only commands only. Every command is validated against an allowlist before execution — anything not on the list is rejected immediately and logged.

Current (Phase 1): allowlist validation + subprocess with hard timeout. No write commands permitted under any circumstances. shell=True

is never used.

Phase 2 — Isolated sandbox (planned):

Every bash command runs inside a throwaway Docker container --network none

— no internet access from inside the sandbox--read-only

filesystem — container cannot write to disk--memory 256m

and--cpus 0.5

— resource caps- Non-root user inside the container

Container is destroyed immediately after the command completes
Even if the LLM misbehaves, the IAM read-only role is the last line of defense

The agent never executes fixes automatically. It investigates, suggests, and waits for human approval before anything changes.

All backend commands run from apps/backend/

, or use root make

targets.

cd apps/backend && uv run pytest      # or: make test

cd apps/backend && uv run ruff check src/ tests/
cd apps/backend && uv run ruff format src/ tests/   # or: make lint / make lint-fix

source & further reading

github.com — original article

OpenDevOps – An open-source AI agent that investigates AWS/Azure incidents

Run your AI side-project on zahid.host