This layer has three tiers now. The gap between tier 1 and tier 2 is real, and tier 3 is growing fast.
These are the tools most professional developers use daily. The SWE-bench scores tell part of the story; the real picture is more nuanced.
| Tool | Type | Price | SWE-bench | Best For |
|---|---|---|---|---|
| Claude Code | ||||
| Terminal-native | $20–200/mo (Claude plans) | 87.6% (Opus 4.7) | Terminal-first architectural refactors, 1M context window | |
| Cursor | ||||
| AI-native IDE (VS Code fork) | $20–200/mo | 73.7% (Composer 2) | Best all-in-one agentic IDE, Background Agents (up to 8 parallel) | |
| GitHub Copilot | ||||
| IDE extension + Agent HQ | $10–39/mo | 56% | GitHub-native teams, deepest enterprise governance | |
| Windsurf | ||||
| AI-native IDE (VS Code fork) | $15–200/mo | — | Value-conscious, Cascade agent, EU compliant / FedRAMP certified |
What changed this year: Claude Code went from research preview to $2.5B+ run-rate. Cursor crossed 1M paid users. GitHub Copilot switched to credit-based billing (June 2026) and upset a lot of enterprise customers. Windsurf was acquired by Cognition, raising questions about its roadmap independence.
These are the open-source tools that serious developers swear by. They trade polish for control.
| Tool | Type | Price | Key Trait |
|---|---|---|---|
| Aider | |||
| Terminal CLI, Apache 2.0 | Free + BYO key | Git-native — every edit is a commit. Pairs with any model. 88% SWE-bench with GPT-5.5 under the hood | |
| Cline | |||
| VS Code extension, Apache 2.0 | Free + BYO key | 5M+ installs. Plan-and-act workflow, native MCP support, full control over every step | |
| Continue | |||
| VS Code + JetBrains, Apache 2.0 | Free + BYO key | 20+ model providers including local Ollama. Best for offline/air-gapped setups | |
| Kilo Code | |||
| VS Code + JetBrains + CLI, OSS | Free BYOK or $15/mo Teams | 500+ models from 60+ providers. True model neutrality across IDEs |
The trend here: BYOK (bring your own key) is standard now. Opaque SaaS-only subscriptions are dying. Developers want to own their model relationship and swap providers freely.
These run in the cloud and operate on their own. Different value proposition entirely — you delegate, not pair-program.
| Tool | Type | Price | Best For |
|---|---|---|---|
| Devin (Cognition) | |||
| Cloud autonomous agent | ~$500/mo Team + ACU | Delegate large async backlog tasks, sandboxed VMs | |
| Factory | |||
| Cloud enterprise agents | Enterprise | Enterprise code generation at scale | |
| Bolt.new (StackBlitz) | |||
| Browser, instant full-stack | Free / $20–200/mo | Quick prototypes, full-stack apps from prompts | |
| Lovable | |||
| Browser, visual builder | Free / $20–100/mo | Non-devs building web apps | |
| v0 (Vercel) | |||
| Browser, UI-focused | Free / $20/mo | React/Next.js component generation | |
| Replit Agent | |||
| Browser, full-stack | $25/mo | Students, hobbyists, fast iteration loops |
This layer is fragmenting into three sub-categories: pure observability, gateway+observability convergence, and the legacy tools that are being left behind.
| Tool | License | Self-Host | Pricing Entry | Best For |
|---|---|---|---|---|
| LangFuse | ||||
| MIT core | ✅ Yes | Free → $29/mo → $199/mo → $2,499/mo enterprise | OSS observability with prompt management, 29K ★. ThoughtWorks "Assess" recommendation | |
| LangSmith | ||||
| Closed (MIT SDK) | Enterprise only | Free → $39/seat/mo | LangChain/LangGraph teams. Deepest graph topology capture | |
| Arize Phoenix | ||||
| ELv2 (source-available) | ✅ Yes | Free → $50/mo AX Pro | OpenTelemetry/OpenInference native. Clean local dev workbench | |
| Braintrust | ||||
| Closed SaaS | ❌ | Free → $249/mo Pro | Best eval UI in the market. Polished, closed platform | |
| Weights & Biases | ||||
| Closed SaaS | ❌ | Free → enterprise | Experiment tracking + LLM evaluation. The ML default | |
| Datadog LLM Obs | ||||
| Closed SaaS | ❌ | APM-based | Existing Datadog shops that want LLM traces in the same dashboard |
The key tension here: LangFuse vs LangSmith is becoming the main OSS-vs-closed debate. LangFuse wins on portability and self-hosting; LangSmith wins on LangChain ergonomics. Phoenix has the best OTel story but the ELv2 license is a procurement headache for some enterprises.
A new pattern: tools that handle both routing AND tracing in one stack.
| Tool | License | Key Trait |
|---|---|---|
| Future AGI traceAI | ||
| Apache 2.0 | Full-stack: gateway + guardrails + evals + simulation. 14 span kinds, 50+ AI instrumentations | |
| Portkey | ||
| MIT gateway, closed control plane | Acquired by Palo Alto for $140M (April 2026). 250+ models, governance features, now part of Prisma AIRS | |
| LiteLLM | ||
| MIT | Most popular OSS proxy. 100+ providers, weighted fallbacks. Pairs with LangFuse or Braintrust for observability | |
| OpenLLMetry | ||
| Apache 2.0 | DIY OpenTelemetry pipeline. Backend-agnostic. Minimal UI |
| Tool | Status |
|---|---|
| Helicone | |
| Acquired by Mintlify (March 2026) → maintenance mode only. Still works, but no new features. Migration recommended | |
| W&B Weave | |
| Superseded by W&B's newer LLM eval platform | |
| MLflow (LLM tracing) | |
| Functional but not LLM-native. Better suited for traditional ML workflows |
This layer has seen the most dramatic change in 2026. One of the Big Three is effectively dead, and the provider-native SDKs are maturing fast.
| Framework | Status (June 2026) | License | GitHub ★ | Best For |
|---|---|---|---|---|
| LangGraph | ||||
| ✅ Active | MIT | ~32K | Explicit state machines, time-travel debugging, human-in-the-loop checkpoints | |
| CrewAI | ||||
| ✅ Active | MIT | ~51K | Role-based crews (researcher, writer, critic). Fastest time-to-first-demo | |
| AutoGen | ||||
| ❌ Maintenance mode | ||||
| MIT + CC-BY-4.0 | ~58K | |||
| Do not start new projects. Last release v0.7.5 (September 2025). Migrate to MAF or AG2 |
What happened to AutoGen: Microsoft merged it into Microsoft Agent Framework (MAF) — a combined runtime with Semantic Kernel. Python + C# parity, durability, governance features. ~10K ★. The community fork lives on at AG2 (ag2.ai).
The cloud providers are building their own. These are getting good.
| SDK | License | Languages | ★ | Best For |
|---|---|---|---|---|
| OpenAI Agents SDK | ||||
| Apache 2.0 | Python, TypeScript | ~26K | Cleanest handoff model. Sandboxed execution with workspace snapshots. 3-tier guardrails | |
| Google ADK | ||||
| Apache 2.0 | Python, TS, Java, Go, Kotlin | |||
| ~20K | Widest language support. Native A2A protocol. Deploys to Vertex AI Agent Engine | |||
| Claude Agent SDK | ||||
| MIT | Python, TypeScript | ~7K | Deepest MCP integration (200+ servers). Built-in file/shell access. Safety-first architecture |
Key trend: All three now support MCP. Google is pushing A2A for cross-vendor agent discovery. OpenAI has the best sandbox story. Anthropic has the deepest OS-level tools.
| Framework | Best For |
|---|---|
| PydanticAI | |
| Type-safe structured outputs, Python-native. Built on Pydantic | |
| DSPy (Stanford) | |
| Programmatic prompt optimization. Compile prompts from signatures | |
| Semantic Kernel (Microsoft) | |
| Enterprise .NET/Python plugin architecture | |
| LlamaIndex | |
| RAG-first agents with data connectors | |
| Vercel AI SDK | |
| TypeScript streaming + tool use. Frontend-native | |
| Mastra | |
| TypeScript agent framework with built-in workflow engine | |
| Agno (ex-Phidata) | |
| Lightweight, memory-aware, multi-modal support | |
| Bee Agent (IBM) | |
| ReAct patterns, enterprise-grade tool use | |
| Haystack (deepset) | |
| NLP pipelines, RAG, agent nodes | |
| Atomic Agents | |
| Minimalist, modular — explicitly anti-framework | |
| AG2 | |
| Community fork of AutoGen, keeping it alive |
Two distinct sub-layers that are increasingly being sold together.
| Tool | License | Price | Key Feature |
|---|---|---|---|
| LiteLLM | |||
| MIT / BSL 1.1 | Free OSS → $50/mo Cloud | 100+ providers, weighted round-robin, fallback chains | |
| Portkey | |||
| MIT / Closed CP | Free → $49/mo Prod | 250+ LLMs, governance + guardrails + semantic caching. Now part of Palo Alto Prisma AIRS | |
| Kong AI Gateway | |||
| Apache 2.0 | Free OSS → Enterprise | Unified API mesh + AI gateway | |
| Cloudflare AI Gateway | |||
| Closed | Pay-as-you-go | Zero ops, Cloudflare edge ecosystem | |
| AWS Bedrock Gateway | |||
| AWS-managed | Pay-as-you-go | AWS-native, FedRAMP, HIPAA eligible | |
| OpenRouter | |||
| Closed | Pay-per-token | 300+ models, single API key, simplest setup |
Supply chain alert: LiteLLM v1.82.7/1.82.8 on PyPI contained credential-stealing malware in March 2026 (TeamPCP attack). Live for ~3 hours. NHS issued a national alert. Official Docker images were unaffected. Pin versions and prefer Docker.
| Tool | License | Key Feature |
|---|---|---|
| Guardrails AI | ||
| MIT | Output validation — PII, toxicity, custom validators. Pairs with any gateway | |
| NeMo Guardrails (Nvidia) | ||
| Apache 2.0 | Colang DSL for dialog rails. Topical guardrails, fact-checking | |
| Microsoft Agent Governance Toolkit | ||
| — | Covers 10/10 OWASP Agentic Top 10 (gateways cover 0–1). Governs agent actions, not just LLM outputs | |
| Barbacane | ||
| — | Security-first AI gateway with guardrail integration |
Important architectural distinction from Microsoft's own docs: Guardrails validate LLM outputs. Agent governance controls agent actions (tool calls, identity, sandboxing, crypto auth). These are complementary, not competing.
There's a pattern visible across all four layers above. Every tool either watches or executes. None of them intervene.
| Layer | What It Does | Examples | Limitation |
|---|---|---|---|
| Coding Agents | |||
| Write code | Cursor, Copilot, Aider | No built-in failure detection | |
| Observability | |||
| Records what happened | LangFuse, Phoenix, Braintrust | Post-hoc only — you read reports after the fact | |
| Orchestration | |||
| Runs the agent graph | LangGraph, CrewAI, ADK | Executes faithfully even when the agent is failing | |
| Gateways | |||
| Routes requests | LiteLLM, Portkey, OpenRouter | Sees wire-level but not agent behavior | |
| Guardrails | |||
| Blocks bad output | Guardrails AI, NeMo | Validates text, doesn't understand agent loops/deadlocks/hallucination patterns |
The missing layer: something that watches the agent in real time, detects when it's going off the rails, and intervenes autonomously.
A few projects are starting to fill this gap:
| Project | Language | License | Approach |
|---|---|---|---|
| HarnessForge | |||
| Rust (PyO3 + NAPI-RS bindings) | MIT | Open-core SDK. 12 health observers, 16 detectors (loop, staleness, cost anomaly, secret leak, etc.), 14 intervention strategies (nudge → circuit-break). Two-level: session harness + meta-harness that improves its own rules across sessions | |
| Microsoft Agent Governance Toolkit | |||
| Python | — | Governs agent actions, identity, sandboxing. Covers the full OWASP Agentic Top 10. Focused on enterprise policy enforcement | |
| Future AGI Protect | |||
| Python/TS | Apache 2.0 | Guardrails-as-a-platform with real-time detection. Part of the Future AGI unified stack |
What makes this different from observability: Observability tells you "cost spiked at 2:34 PM." An active runtime detects the spike at turn 3 and swaps the model — you save the money before the spike happens.
What makes this different from guardrails: Guardrails check outputs. An active runtime understands agent behavior — loops, deadlocks, context degradation, goal drift, model mismatch. These aren't output problems; they're behavioral problems.
Based on 2026 surveys and public engineering blogs, here's what a typical production stack looks like:
┌──────────────────────────────────────────────────────────────┐
│ TYPICAL PRODUCTION STACK (Mid-2026) │
│ │
│ IDE/CLI Agent Observability Gateway │
│ ───────────── ───────────── ─────── │
│ Cursor + Claude Code LangFuse Portkey │
│ (daily flow + deep (traces, evals, (routing, │
│ architectural work) prompt management) fallback) │
│ │
│ Orchestration Guardrails CI/CD │
│ ───────────── ────────── ───── │
│ LangGraph or CrewAI NeMo + Guardrails AI GitHub │
│ (multi-agent flows) (output validation) Actions │
│ │
│ Model Access Sandbox │
│ ──────────── ─────── │
│ OpenRouter or LiteLLM Docker / E2B / Modal │
│ (multi-model routing) (safe code execution) │
│ │
│ Active Runtime (emerging) │
│ ───────────────────────── │
│ HarnessForge or MSFT Agent Gov │
│ (real-time detection + intervention) │
└──────────────────────────────────────────────────────────────┘
No single tool wins. The norm is 2–3 tools per layer, chosen based on team size, compliance requirements, and framework preferences.
| Shift | What Happened | What It Means |
|---|---|---|
| AutoGen → maintenance | ||
| Last release Sep 2025. Merged into Microsoft Agent Framework | New projects: choose MAF or AG2 community fork | |
| Helicone → maintenance | ||
| Acquired by Mintlify (Mar 2026) | Migrate to LiteLLM or Portkey for gateway; pair with LangFuse or Phoenix for observability | |
| Portkey acquired ($140M) | ||
| Palo Alto Networks, April 2026 | AI gateway+security convergence is the next big acquisition category | |
| LiteLLM supply-chain attack | ||
| Malicious PyPI packages (Mar 2026) | Pin versions. Use Docker images. Verify checksums | |
| Claude Code hits $2.5B run-rate | ||
| Anthropic's terminal agent driving massive revenue | Terminal-native agents are a real business, not a niche | |
| OpenTelemetry standardization | ||
| OTel becoming the common trace format | Reduces switching cost. LangFuse + Phoenix both support OTel ingestion | |
| MCP becomes universal | ||
| All 3 provider SDKs + most frameworks support MCP now | Tool definitions are portable across frameworks for the first time | |
| A2A protocol emerging | ||
| Google-led cross-vendor agent communication | Agents from different frameworks can discover and talk to each other | |
| Per-user pricing wins | ||
| Codacy, CodeRabbit, Snyk all per-dev. LOC-based pricing dying | Predictable costs. Easier procurement | |
| 30-70% of code is AI-generated | ||
| Depending on language and team | AI code governance is becoming a mandatory CI/CD stage | |
| Multi-tool stacks are the norm | ||
| Most devs use 2–3 AI tools daily | Integration and unified dashboards matter more than single-tool features | |
| EU AI Act Article 15 | ||
| Comes into force August 2026 | "Human oversight of high-risk AI" — creates compliance demand for intervention tools |
Short term (next 6 months):
Medium term (12–18 months):
Long term (2–3 years):
| Layer | Count | Status |
|---|---|---|
| Coding Agents | 13 | Tier 1 consolidating (Cursor, Copilot, Claude Code). OSS tools (Aider, Cline, Continue) gaining fast |
| Observability | 6 + 4 gateway-converged | LangFuse vs LangSmith is the main debate. 1 deprecated (Helicone) |
| Orchestration | 14 | 1 deprecated (AutoGen). Provider SDKs rising. Too many frameworks; consolidation coming |
| Gateways + Guardrails | 6 + 4 | Convergence accelerating. Portkey acquisition validates the space. Supply chain risk real |
| Active Runtime | 3 | New category. No dominant player yet. HarnessForge (MIT, Rust), MSFT Agent Gov, Future AGI Protect |
This is a point-in-time snapshot. The market is moving fast. I'll update this quarterly.
Disclosure: I'm the author of HarnessForge, one of the tools mentioned in the Active Runtime section. Everything else in this survey is based on publicly available data, vendor documentation, and community analysis.
Found a tool I missed? Drop it in the comments.