# The AI Engineering Tools Landscape — Mid-2026

> Source: <https://dev.to/agrawal_83a0b8e9e8b/every-ai-agent-tool-watches-none-of-them-act-harnessforge-changes-that-3190>
> Published: 2026-06-25 09:57:17+00:00

This layer has three tiers now. The gap between tier 1 and tier 2 is real, and tier 3 is growing fast.

These are the tools most professional developers use daily. The SWE-bench scores tell part of the story; the real picture is more nuanced.

| Tool | Type | Price | SWE-bench | Best For |
|---|---|---|---|---|
Claude Code |
Terminal-native | $20–200/mo (Claude plans) | 87.6% (Opus 4.7) | Terminal-first architectural refactors, 1M context window |
Cursor |
AI-native IDE (VS Code fork) | $20–200/mo | 73.7% (Composer 2) | Best all-in-one agentic IDE, Background Agents (up to 8 parallel) |
GitHub Copilot |
IDE extension + Agent HQ | $10–39/mo | 56% | GitHub-native teams, deepest enterprise governance |
Windsurf |
AI-native IDE (VS Code fork) | $15–200/mo | — | Value-conscious, Cascade agent, EU compliant / FedRAMP certified |

**What changed this year:** Claude Code went from research preview to $2.5B+ run-rate. Cursor crossed 1M paid users. GitHub Copilot switched to credit-based billing (June 2026) and upset a lot of enterprise customers. Windsurf was acquired by Cognition, raising questions about its roadmap independence.

These are the open-source tools that serious developers swear by. They trade polish for control.

| Tool | Type | Price | Key Trait |
|---|---|---|---|
Aider |
Terminal CLI, Apache 2.0 | Free + BYO key | Git-native — every edit is a commit. Pairs with any model. 88% SWE-bench with GPT-5.5 under the hood |
Cline |
VS Code extension, Apache 2.0 | Free + BYO key | 5M+ installs. Plan-and-act workflow, native MCP support, full control over every step |
Continue |
VS Code + JetBrains, Apache 2.0 | Free + BYO key | 20+ model providers including local Ollama. Best for offline/air-gapped setups |
Kilo Code |
VS Code + JetBrains + CLI, OSS | Free BYOK or $15/mo Teams | 500+ models from 60+ providers. True model neutrality across IDEs |

**The trend here:** BYOK (bring your own key) is standard now. Opaque SaaS-only subscriptions are dying. Developers want to own their model relationship and swap providers freely.

These run in the cloud and operate on their own. Different value proposition entirely — you delegate, not pair-program.

| Tool | Type | Price | Best For |
|---|---|---|---|
Devin (Cognition) |
Cloud autonomous agent | ~$500/mo Team + ACU | Delegate large async backlog tasks, sandboxed VMs |
Factory |
Cloud enterprise agents | Enterprise | Enterprise code generation at scale |
Bolt.new (StackBlitz) |
Browser, instant full-stack | Free / $20–200/mo | Quick prototypes, full-stack apps from prompts |
Lovable |
Browser, visual builder | Free / $20–100/mo | Non-devs building web apps |
v0 (Vercel) |
Browser, UI-focused | Free / $20/mo | React/Next.js component generation |
Replit Agent |
Browser, full-stack | $25/mo | Students, hobbyists, fast iteration loops |

This layer is fragmenting into three sub-categories: pure observability, gateway+observability convergence, and the legacy tools that are being left behind.

| Tool | License | Self-Host | Pricing Entry | Best For |
|---|---|---|---|---|
LangFuse |
MIT core | ✅ Yes | Free → $29/mo → $199/mo → $2,499/mo enterprise | OSS observability with prompt management, 29K ★. ThoughtWorks "Assess" recommendation |
LangSmith |
Closed (MIT SDK) | Enterprise only | Free → $39/seat/mo | LangChain/LangGraph teams. Deepest graph topology capture |
Arize Phoenix |
ELv2 (source-available) | ✅ Yes | Free → $50/mo AX Pro | OpenTelemetry/OpenInference native. Clean local dev workbench |
Braintrust |
Closed SaaS | ❌ | Free → $249/mo Pro | Best eval UI in the market. Polished, closed platform |
Weights & Biases |
Closed SaaS | ❌ | Free → enterprise | Experiment tracking + LLM evaluation. The ML default |
Datadog LLM Obs |
Closed SaaS | ❌ | APM-based | Existing Datadog shops that want LLM traces in the same dashboard |

**The key tension here:** LangFuse vs LangSmith is becoming the main OSS-vs-closed debate. LangFuse wins on portability and self-hosting; LangSmith wins on LangChain ergonomics. Phoenix has the best OTel story but the ELv2 license is a procurement headache for some enterprises.

A new pattern: tools that handle both routing AND tracing in one stack.

| Tool | License | Key Trait |
|---|---|---|
Future AGI traceAI |
Apache 2.0 | Full-stack: gateway + guardrails + evals + simulation. 14 span kinds, 50+ AI instrumentations |
Portkey |
MIT gateway, closed control plane | Acquired by Palo Alto for $140M (April 2026). 250+ models, governance features, now part of Prisma AIRS |
LiteLLM |
MIT | Most popular OSS proxy. 100+ providers, weighted fallbacks. Pairs with LangFuse or Braintrust for observability |
OpenLLMetry |
Apache 2.0 | DIY OpenTelemetry pipeline. Backend-agnostic. Minimal UI |

| Tool | Status |
|---|---|
Helicone |
Acquired by Mintlify (March 2026) → maintenance mode only. Still works, but no new features. Migration recommended |
W&B Weave |
Superseded by W&B's newer LLM eval platform |
MLflow (LLM tracing) |
Functional but not LLM-native. Better suited for traditional ML workflows |

This layer has seen the most dramatic change in 2026. One of the Big Three is effectively dead, and the provider-native SDKs are maturing fast.

| Framework | Status (June 2026) | License | GitHub ★ | Best For |
|---|---|---|---|---|
LangGraph |
✅ Active | MIT | ~32K | Explicit state machines, time-travel debugging, human-in-the-loop checkpoints |
CrewAI |
✅ Active | MIT | ~51K | Role-based crews (researcher, writer, critic). Fastest time-to-first-demo |
AutoGen |
❌ Maintenance mode
|
MIT + CC-BY-4.0 | ~58K |
Do not start new projects. Last release v0.7.5 (September 2025). Migrate to MAF or AG2 |

**What happened to AutoGen:** Microsoft merged it into **Microsoft Agent Framework (MAF)** — a combined runtime with Semantic Kernel. Python + C# parity, durability, governance features. ~10K ★. The community fork lives on at **AG2** (ag2.ai).

The cloud providers are building their own. These are getting good.

| SDK | License | Languages | ★ | Best For |
|---|---|---|---|---|
OpenAI Agents SDK |
Apache 2.0 | Python, TypeScript | ~26K | Cleanest handoff model. Sandboxed execution with workspace snapshots. 3-tier guardrails |
Google ADK |
Apache 2.0 | Python, TS, Java, Go, Kotlin
|
~20K | Widest language support. Native A2A protocol. Deploys to Vertex AI Agent Engine |
Claude Agent SDK |
MIT | Python, TypeScript | ~7K | Deepest MCP integration (200+ servers). Built-in file/shell access. Safety-first architecture |

**Key trend:** All three now support MCP. Google is pushing A2A for cross-vendor agent discovery. OpenAI has the best sandbox story. Anthropic has the deepest OS-level tools.

| Framework | Best For |
|---|---|
PydanticAI |
Type-safe structured outputs, Python-native. Built on Pydantic |
DSPy (Stanford) |
Programmatic prompt optimization. Compile prompts from signatures |
Semantic Kernel (Microsoft) |
Enterprise .NET/Python plugin architecture |
LlamaIndex |
RAG-first agents with data connectors |
Vercel AI SDK |
TypeScript streaming + tool use. Frontend-native |
Mastra |
TypeScript agent framework with built-in workflow engine |
Agno (ex-Phidata) |
Lightweight, memory-aware, multi-modal support |
Bee Agent (IBM) |
ReAct patterns, enterprise-grade tool use |
Haystack (deepset) |
NLP pipelines, RAG, agent nodes |
Atomic Agents |
Minimalist, modular — explicitly anti-framework |
AG2 |
Community fork of AutoGen, keeping it alive |

Two distinct sub-layers that are increasingly being sold together.

| Tool | License | Price | Key Feature |
|---|---|---|---|
LiteLLM |
MIT / BSL 1.1 | Free OSS → $50/mo Cloud | 100+ providers, weighted round-robin, fallback chains |
Portkey |
MIT / Closed CP | Free → $49/mo Prod | 250+ LLMs, governance + guardrails + semantic caching. Now part of Palo Alto Prisma AIRS |
Kong AI Gateway |
Apache 2.0 | Free OSS → Enterprise | Unified API mesh + AI gateway |
Cloudflare AI Gateway |
Closed | Pay-as-you-go | Zero ops, Cloudflare edge ecosystem |
AWS Bedrock Gateway |
AWS-managed | Pay-as-you-go | AWS-native, FedRAMP, HIPAA eligible |
OpenRouter |
Closed | Pay-per-token | 300+ models, single API key, simplest setup |

**Supply chain alert:** LiteLLM v1.82.7/1.82.8 on PyPI contained credential-stealing malware in March 2026 (TeamPCP attack). Live for ~3 hours. NHS issued a national alert. Official Docker images were unaffected. Pin versions and prefer Docker.

| Tool | License | Key Feature |
|---|---|---|
Guardrails AI |
MIT | Output validation — PII, toxicity, custom validators. Pairs with any gateway |
NeMo Guardrails (Nvidia) |
Apache 2.0 | Colang DSL for dialog rails. Topical guardrails, fact-checking |
Microsoft Agent Governance Toolkit |
— | Covers 10/10 OWASP Agentic Top 10 (gateways cover 0–1). Governs agent actions, not just LLM outputs |
Barbacane |
— | Security-first AI gateway with guardrail integration |

**Important architectural distinction from Microsoft's own docs:** Guardrails validate LLM **outputs**. Agent governance controls agent **actions** (tool calls, identity, sandboxing, crypto auth). These are complementary, not competing.

There's a pattern visible across all four layers above. Every tool either watches or executes. None of them intervene.

| Layer | What It Does | Examples | Limitation |
|---|---|---|---|
Coding Agents |
Write code | Cursor, Copilot, Aider | No built-in failure detection |
Observability |
Records what happened | LangFuse, Phoenix, Braintrust | Post-hoc only — you read reports after the fact |
Orchestration |
Runs the agent graph | LangGraph, CrewAI, ADK | Executes faithfully even when the agent is failing |
Gateways |
Routes requests | LiteLLM, Portkey, OpenRouter | Sees wire-level but not agent behavior |
Guardrails |
Blocks bad output | Guardrails AI, NeMo | Validates text, doesn't understand agent loops/deadlocks/hallucination patterns |

The missing layer: something that watches the agent *in real time*, detects when it's going off the rails, and *intervenes autonomously*.

A few projects are starting to fill this gap:

| Project | Language | License | Approach |
|---|---|---|---|
HarnessForge |
Rust (PyO3 + NAPI-RS bindings) | MIT | Open-core SDK. 12 health observers, 16 detectors (loop, staleness, cost anomaly, secret leak, etc.), 14 intervention strategies (nudge → circuit-break). Two-level: session harness + meta-harness that improves its own rules across sessions |
Microsoft Agent Governance Toolkit |
Python | — | Governs agent actions, identity, sandboxing. Covers the full OWASP Agentic Top 10. Focused on enterprise policy enforcement |
Future AGI Protect |
Python/TS | Apache 2.0 | Guardrails-as-a-platform with real-time detection. Part of the Future AGI unified stack |

**What makes this different from observability:** Observability tells you "cost spiked at 2:34 PM." An active runtime detects the spike at turn 3 and swaps the model — you save the money before the spike happens.

**What makes this different from guardrails:** Guardrails check outputs. An active runtime understands agent behavior — loops, deadlocks, context degradation, goal drift, model mismatch. These aren't output problems; they're behavioral problems.

Based on 2026 surveys and public engineering blogs, here's what a typical production stack looks like:

```
┌──────────────────────────────────────────────────────────────┐
│ TYPICAL PRODUCTION STACK (Mid-2026)                          │
│                                                              │
│  IDE/CLI Agent          Observability          Gateway       │
│  ─────────────          ─────────────          ───────       │
│  Cursor + Claude Code   LangFuse               Portkey       │
│  (daily flow + deep     (traces, evals,        (routing,     │
│   architectural work)    prompt management)     fallback)     │
│                                                              │
│  Orchestration          Guardrails              CI/CD        │
│  ─────────────          ──────────              ─────        │
│  LangGraph or CrewAI    NeMo + Guardrails AI    GitHub       │
│  (multi-agent flows)    (output validation)      Actions      │
│                                                              │
│  Model Access           Sandbox                              │
│  ────────────           ───────                              │
│  OpenRouter or LiteLLM  Docker / E2B / Modal                 │
│  (multi-model routing)  (safe code execution)                │
│                                                              │
│  Active Runtime (emerging)                                   │
│  ─────────────────────────                                   │
│  HarnessForge or MSFT Agent Gov                              │
│  (real-time detection + intervention)                        │
└──────────────────────────────────────────────────────────────┘
```

**No single tool wins.** The norm is 2–3 tools per layer, chosen based on team size, compliance requirements, and framework preferences.

| Shift | What Happened | What It Means |
|---|---|---|
AutoGen → maintenance |
Last release Sep 2025. Merged into Microsoft Agent Framework | New projects: choose MAF or AG2 community fork |
Helicone → maintenance |
Acquired by Mintlify (Mar 2026) | Migrate to LiteLLM or Portkey for gateway; pair with LangFuse or Phoenix for observability |
Portkey acquired ($140M) |
Palo Alto Networks, April 2026 | AI gateway+security convergence is the next big acquisition category |
LiteLLM supply-chain attack |
Malicious PyPI packages (Mar 2026) | Pin versions. Use Docker images. Verify checksums |
Claude Code hits $2.5B run-rate |
Anthropic's terminal agent driving massive revenue | Terminal-native agents are a real business, not a niche |
OpenTelemetry standardization |
OTel becoming the common trace format | Reduces switching cost. LangFuse + Phoenix both support OTel ingestion |
MCP becomes universal |
All 3 provider SDKs + most frameworks support MCP now | Tool definitions are portable across frameworks for the first time |
A2A protocol emerging |
Google-led cross-vendor agent communication | Agents from different frameworks can discover and talk to each other |
Per-user pricing wins |
Codacy, CodeRabbit, Snyk all per-dev. LOC-based pricing dying | Predictable costs. Easier procurement |
30-70% of code is AI-generated |
Depending on language and team | AI code governance is becoming a mandatory CI/CD stage |
Multi-tool stacks are the norm |
Most devs use 2–3 AI tools daily | Integration and unified dashboards matter more than single-tool features |
EU AI Act Article 15 |
Comes into force August 2026 | "Human oversight of high-risk AI" — creates compliance demand for intervention tools |

**Short term (next 6 months):**

**Medium term (12–18 months):**

**Long term (2–3 years):**

| Layer | Count | Status |
|---|---|---|
| Coding Agents | 13 | Tier 1 consolidating (Cursor, Copilot, Claude Code). OSS tools (Aider, Cline, Continue) gaining fast |
| Observability | 6 + 4 gateway-converged | LangFuse vs LangSmith is the main debate. 1 deprecated (Helicone) |
| Orchestration | 14 | 1 deprecated (AutoGen). Provider SDKs rising. Too many frameworks; consolidation coming |
| Gateways + Guardrails | 6 + 4 | Convergence accelerating. Portkey acquisition validates the space. Supply chain risk real |
| Active Runtime | 3 | New category. No dominant player yet. HarnessForge (MIT, Rust), MSFT Agent Gov, Future AGI Protect |

*This is a point-in-time snapshot. The market is moving fast. I'll update this quarterly.*

*Disclosure: I'm the author of HarnessForge, one of the tools mentioned in the Active Runtime section. Everything else in this survey is based on publicly available data, vendor documentation, and community analysis.*

**Found a tool I missed? Drop it in the comments.**