Everyone is building “AI agents.” Very few are building ones that survive contact with reality. Most teams run into the same wall: what works in a demo breaks quickly in production.
Most agents today are still fragile compositions: a strong model wrapped around weak tools, incomplete data, fragmented context and nonexistent governance. They succeed in curated demos and fail in production environments where ambiguity, partial observability, and system constraints dominate.
The core misunderstanding is this: teams over-index on model intelligence and under-invest in system quality.
At scale, agent performance is not primarily a function of the model. It is a function of three coupled systems:
- Data quality (what the model learns and reasons over)
- API quality (what the agent can reliably do)
- Execution quality (how an agent’s decisions are validated, observed, and controlled)
At Postman, we treat agents not as chat interfaces, but as distributed systems operating over APIs. From that lens, the problem becomes clearer and solvable.
The real shift: from intelligence to reliability #
The industry narrative has been: better models → better agents.
What we are seeing in practice is different:
- Marginal gains from model improvements are diminishing in production settings.
- Variance in outcomes is dominated by tool reliability and data ambiguity.
- Most failures are not reasoning failures. They are interface failures.
An agent rarely fails because it “cannot think.” It fails because:
- The API schema is underspecified or inconsistent.
- The data returned is incomplete, stale, or ambiguous.
- The system lacks guardrails to validate or correct actions.
This reframes the problem: building agents is a systems engineering challenge, not just an AI problem.
Three converging frontiers #
Three major shifts are colliding to make this moment unique.
Agents are becoming execution engines, not assistants. Agents are no longer suggesting actions. They are taking them. This introduces hard requirements around correctness, reversibility, and auditability. Planning is easy; safe execution is not.
Data quality is now a first-class bottleneck. Today, the best-performing agent systems are not those with the largest models, but those with the cleanest, most structured, and most semantically rich data. Poor data creates compounding errors across multi-step reasoning chains.
APIs are becoming the control plane for intelligence. APIs are no longer just integration points; they define the action space of agents. If data is the “training substrate,” APIs are the “execution substrate.”
The implication: you cannot decouple model quality from API and data quality. They form a single system.
A new mental model: the agent reliability stack #
To build production-grade agents, we think in terms of a layered system:
- Data layer: structured, labeled, versioned, and observable data
- Interface layer: APIs that are deterministic, typed, and discoverable
- Reasoning layer: models that plan and adapt under uncertainty
- Execution layer: workflows that validate, monitor, and constrain actions
- Governance layer: policies, auditability, and human oversight
Most teams overinvest in the reasoning layer and underinvest everywhere else.
That imbalance is why agents fail.
APIs are not integrations. They are policy surfaces #
The industry still treats APIs as passive endpoints. For agents, APIs must become active contracts.
An “agent-ready” API is not just documented. It is:
- Semantically explicit: clear intent, constraints, and edge cases
- Machine-interpretable: strongly typed inputs and outputs with examples
- Deterministic where possible: minimizing hidden side effects
- Observable: every call produces traceable, inspectable outputs
- Governed: access, rate limits, and policies are enforced consistently
Protocols like the Model Context Protocol (MCP) are an example of formalizing this contract. Instead of forcing models to infer intent from prose, we expose structured capabilities directly.
But the deeper shift is conceptual: APIs are no longer just for developers. They are for autonomous systems. That changes how they must be designed.
Data and API quality is the hidden multiplier #
If API quality defines what an agent can do, data quality defines whether it can do it correctly. In multi-step agent workflows, errors compound geometrically. A single ambiguous field or missing constraint can propagate through planning, tool selection, and execution.
High-performing agent systems share common data characteristics:
- Canonical schemas across services (no semantic drift)
- Rich metadata and descriptions (not just field names)
- Versioned datasets with lineage tracking
- Explicit handling of uncertainty (nulls, ranges, confidence)
- Realistic examples that reflect production edge cases
One useful way to think about this:
Training quality determines what an agent knows. Data and API quality determine whether it knows what’s true right now.
Without the latter, even perfect reasoning fails.
Agents inside systems, not beside them #
Agents should not live in chat windows. They should live inside execution paths.
The most effective deployments embed agents directly into:
- CI/CD pipelines (test generation, regression detection)
- Monitoring systems (incident triage, anomaly explanation)
- API workflows (validation, transformation, orchestration)
- Governance layers (policy checks, compliance enforcement)
This eliminates the “copy-paste gap” between insight and action.
In Postman, this shows up as agents operating directly on collections, tests, and flows and not as separate conversational artifacts. The agent is not an interface; it is a capability embedded in the system.
Governance is not optional—it is the system #
Autonomous execution without governance is just automated risk.
Production agents must support:
- Full audit trails of decisions and actions
- Deterministic replay for debugging
- Policy enforcement before execution
- Scoped access to data and APIs
- Human approval for high-impact changes
The key insight here is that governance is not a constraint on agents. It is what makes them usable. Teams that skip this step inevitably roll back deployments after the first serious incident.
The human-in-the-loop (HIL) is a design primitive #
There is a persistent idea that the goal is full autonomy.
In practice, the most robust systems follow a different pattern:
- Humans define intent and constraints
- Agents perform structured execution
- Systems validate outcomes
- Humans approve or override when needed
This is not a temporary compromise. It is a stable architecture.
Fully autonomous agents are brittle because real-world environments are underspecified and constantly changing. Human oversight provides the adaptive layer that models cannot reliably replicate.
What actually works in practice #
Across teams successfully deploying agents at scale, a few patterns consistently emerge:
Start with one high-quality workflow, not a general agent. Postman Agent Mode that generates and validates API tests with strict schemas outperforms a general “API assistant.”Treat API improvement as agent optimization. Fixing inconsistent parameter naming often yields larger gains than switching models.Evaluate agents on system metrics, not prompts. Latency, success rate, rollback frequency, and error propagation matter more than benchmark scores.Build feedback loops into execution. Every failure should produce structured signals that improve both data and APIs.
Where this is going #
The next phase of agent systems will not be defined by larger models, but by tighter integration between data, APIs, and execution environments.
We are moving toward:
- Self-healing systems where agents detect and propose fixes for API and data issues
- Continuous evaluation pipelines that measure agent reliability in production
- Cross-agent coordination through shared, governed tool ecosystems
- Standardized capability interfaces that make tools universally discoverable
In this world, we at Postman are defining and building control planes for agent ecosystems where APIs are defined, discovered, governed, and executed safely by both humans and machines.
The practical takeaway #
If you are building agents today, the highest-leverage work is not prompt engineering or model selection. It is:
- Cleaning and structuring your data
- Making your APIs explicit, consistent, and machine-readable
- Adding observability and governance to every execution path
A simple test:
If a new engineer cannot reliably use your API from its specification alone, neither can an agent. And if an agent cannot use your API reliably, no model will fix that.