The Seven Capabilities Every Agent Harness Must Provide

wpnews.pro

Everyone is building AI agents.

From customer support copilots and sales assistants to autonomous research agents, enterprises are moving beyond simple chatbots toward systems that can reason, plan, use tools, access enterprise data, and execute real business workflows. Sales agents prepare account strategies. Customer success agents flag risks and recommend actions. Research agents synthesize across knowledge sources. Workflow agents automate complex operations.

The excitement is understandable.

Yet most organizations discover the same thing once agents move from prototype to production: building an agent is relatively easy. Operating one reliably at scale is not.

An agent that completes a task flawlessly in a demo can fail unpredictably in production. It invokes the wrong tool, retrieves stale information, enters reasoning loops, exceeds its cost budget, or produces an output that violates policy.

The challenge is not the agent itself. The challenge is the infrastructure surrounding it.

This is where an Agent Harness becomes essential — and the cleanest way to think about it is borrowed from infrastructure that already solved a structurally identical problem. Kubernetes separates a control plane that governs from the workloads it schedules. An Agent Harness is the control plane for agents. The agent is the workload.

That reframing matters because of a pattern worth naming: teams keep building identity, governance, memory, routing, evaluation, observability, and oversight into the agent, as features of its code. They are not features of the agent. They are properties of the system around it — and buried in agent code, they do not survive contact with production. The seven capabilities below are the services the control plane must provide instead, none of which an agent can safely provide for itself.

It helps to see where the harness sits relative to the work most teams have already done. Prompt engineering tunes what you say to the model. Context engineering tunes what the model can see at runtime. Harness engineering governs what the agent is allowed to do — and that outer layer is where reliability is actually won.

This is no longer a thought experiment. By recent industry surveys, 57% of companies already run AI agents in production . And the failure rate is sobering: Gartner expects 40% of agentic AI projects to be cancelled by 2027, largely due to insufficient risk controls .

The reliability gap is measurable. On τ-bench, the leading benchmark for tool-and-policy customer-service agents, even strong function-calling models succeed on well under half of airline tasks — and consistency collapses when you run the same task repeatedly, with pass⁸ falling to roughly 25% in retail . Translating that: an agent that looks like it works 60% of the time may complete a multi-step task reliably only a quarter of the time. The intelligence is fine. The operational substrate is missing.

Single-run scores hide how unreliable agents are across repeated trials. Source: τ-bench.

The principle to hold onto: a capability the agent can disable, forget, or route around is not a control. Each of the seven below belongs in the plane the agent cannot reach.

The first question is the oldest one in security, asked about a new kind of actor: what is this agent allowed to touch?

Agents are a genuinely new identity class. They are not human users with sessions, and they are not static service accounts. They spin up on demand, call dozens of tools in a single task, and run unattended. In most large enterprises, non-human identities already outnumber human ones by 40-to-1, and agents are the fastest-growing slice of that population .

The wrong answer is a long-lived API key embedded in the agent. The scale of that mistake is visible in the data: more than 28 million secrets leaked to public GitHub in 2025, with leaks in AI-assisted code running at roughly twice the baseline rate. A static credential inside an autonomous, fast-moving workload is a breach waiting for a destination.

The harness should instead issue short-lived, task-scoped credentials at the moment of access and revoke them when the task ends, so the agent never stores a secret at all. OAuth 2.1 is now mandated by the Model Context Protocol spec for remote tool access, which gives this a real standard to build on. The most useful pattern emerging this year is blended identity: every policy decision evaluates both the agent’s identity and the identity of the human who invoked it, simultaneously. Read-only documentation access for a research agent; scoped write access for a workflow agent; never raw financial records for a customer-facing one — and a complete, replayable audit trail of who asked for what, through which agent.

The structural reason this cannot live in the agent: OAuth was built to validate individual requests, but agents produce sequences of requests. Authority has to be governed across the whole chain, which only a layer outside the agent can see.

Tools are where agents stop talking and start acting — querying databases, filing tickets, calling APIs, moving money. That is also where the blast radius lives.

The governing distinction is between capability and authorization. An agent that can create a support ticket should not automatically be able to close one. An agent that can read a customer record should not silently be able to modify it. When tool permissions are implicit in “the agent has this function available,” that distinction disappears.

A harness turns tool use from an open capability into a governed process by mediating five things: which tools exist for this agent, under what conditions each may fire, what parameter values are allowed, which calls require approval, and how every invocation is recorded. The industry now treats controlled, explicit, auditable tool access as a defining AgentOps practice rather than a nice-to-have . OWASP’s Top 10 for Agentic Applications, published for 2026, centers precisely on the risks of autonomous systems that plan and act across tools — formal recognition that ungoverned tool access is a named, ranked threat class.

The point is not to slow the agent down. It is to make the answer to “could this agent have done that?” knowable in advance.

Memory has quietly become the limiting factor in agent performance — not raw model capability . Users expect agents to remember prior conversations, preferences, and project history; long-horizon tasks demand it. But memory left to each agent to implement produces drift, contradiction, and quiet privacy violations.

The field has converged on a layered model, drawn from the CoALA framework and now baked into tools like LangGraph, Mem0, and Zep:

Why the harness, not the agent? Because the hard parts of memory are policies, not storage: write rules that decide what is worth keeping, provenance-aware retrieval so you know where a “fact” came from, intelligent forgetting, conflict resolution between contradictory memories, and retention limits that satisfy regulators. An agent optimizing for the next response has every incentive to remember too much. Data-retention law says otherwise. Only a layer above the agent can hold that line consistently.

Mature deployments are rarely one agent. They are fleets — research, analytics, sales, support, compliance, workflow — and the central question becomes which agent, or which model, handles this request?

The dominant production pattern is the supervisor (orchestrator-worker): one coordinator classifies a request, decomposes it, dispatches sub-tasks to specialists, and assembles the result. Industry estimates put this at roughly 70% of production multi-agent deployments, and it is the reference design at companies like Stripe and Mercury. Done well, it is not just organizational tidiness — it is cost control. Routing simple queries to cheap models and reserving frontier models for hard ones cuts inference cost by 30–60%. Wells Fargo uses orchestrated routing to put 1,700 procedures in front of 35,000 bankers in about 30 seconds, down from ten minutes.

Send simple queries to cheap models and reserve frontier models for hard ones.

But routing belongs in the harness for a sharper reason than efficiency: it is a single point of failure that compounds. The orchestrator misclassifies, the wrong worker runs, and at scale those errors multiply. A three-level hierarchy with a two-second model call at each tier adds six seconds of pure coordination latency before any real work starts. These are system properties — failover, per-branch budget caps, cycle detection, latency budgets — that no individual agent can manage, because no individual agent can see the whole graph.

Traditional software is testable because it is deterministic: same input, same output. Agents break that assumption. The same request yields different reasoning paths, tool choices, and outcomes across runs — which is exactly why single-run benchmark scores systematically overstate production reliability.

So evaluation cannot be a pre-launch gate you pass once. The harness has to support continuous evaluation, offline before deployment and online in production, across several axes: accuracy, groundedness, task completion, tool-use quality, policy compliance, safety, and cost efficiency. The payoff is concrete — pairing real-time trust scoring of each agent step with a simple fallback (re-generate or escalate) has been shown to cut failure rates on τ²-bench by up to 50%. That is the difference between an agent that fails silently and one that knows when to stop.

This is the capability teams most often skip, and the omission is why so many are flying blind through their own production systems. You cannot manage a regression you have no instrument to detect.

When an agent fails, the question is always why — did retrieval miss, did it pick the wrong tool, did the model hallucinate, did an upstream API lie? You cannot answer any of these without a trace.

Agents fail in a way ordinary software does not: they fail while looking successful — a well-formed but wrong answer, a redundant tool call, a semantically invalid action that returns HTTP 200. So observability has to capture intent and process, not just inputs and outputs: the reasoning trace, the tools considered versus invoked, the arguments passed, tokens and latency at each hop, memory accesses, and policy violations — stitched into one replayable trace.

The genuinely important development here is not a product; it is a standard. The OpenTelemetry GenAI semantic conventions define a common gen_ai.* vocabulary for agent telemetry, already supported natively by Datadog and New Relic and emitted by LangChain, CrewAI, and AutoGen. As of early 2026 much of the spec is still experimental, but the direction is set, and it ends the era of every vendor inventing its own incompatible trace format. One design note that belongs in the harness, not the agent: store prompt and tool content in span events, not span attributes, so personal data can be redacted at the collector before it leaves your network. Observability made cloud operations manageable. It will do the same for agents — and you cannot improve what you cannot see.

Full autonomy is the wrong default for a large class of enterprise actions — financial disbursements, contract changes, regulated decisions, customer commitments, security-sensitive operations. These need accountability and human judgment.

But the binary framing — autonomous or supervised — is itself the trap. The mature model is risk-tiered, selective autonomy. Three modes coexist: human in the loop (approve before acting) for high-stakes, irreversible actions; human on the loop (monitor and intervene) for medium-risk reversible ones; and human after the loop (act, then log for audit and sampling) for routine work. A single workflow may move between all three — an agent that books a low-risk flight and then negotiates a high-risk vendor contract needs different oversight at each step.

This is not optional much longer. The EU AI Act’s Article 14 makes demonstrable human oversight a legal requirement for high-risk systems as of August 2, 2026, and the Colorado AI Act took effect in February 2026. And a real critique deserves airing: in April 2026, MIT Technology Review argued that human oversight has in some settings become an illusion — operators nominally in control of systems they cannot meaningfully audit.

That is precisely why oversight must be a harness service, not a line in the agent. The approval logic you see in nearly every agent codebase looks like this:

if action.type == "payment":    interrupt_for_human_approval()

It looks responsible. It also covers exactly the one case the developer thought of and nothing else. When the agent gains a new tool six weeks later — issuing a refund, modifying a contract, changing a permission — the gate is not there, and nobody notices until something expensive happens. Encoded in agent code, approval logic suffers coverage drift the moment a new action type appears. Encoded in the control plane, the policy applies to actions the original developer never anticipated. The goal is not to keep humans busy. It is to spend human judgment where it changes the outcome — and to review the decision, not the entire run.

A growing genre of excellent engineering write-ups documents a single team’s harness in depth — the incidents, the retry budgets, the test layers, the self-healing loops that drag a failing CI run back to green overnight. These are some of the most useful agent content being published. If you are building a harness, read them.

But notice what they optimize for. Nearly all of them are built around one workflow, usually engineering operations: a signal arrives, an agent investigates, a change ships, CI is watched until merge. What they optimize for is getting to merge, autonomously and overnight. That is a genuine achievement — and it is also why the same three capabilities keep turning up in their own “deliberately thin” sections.

The first is evaluation. The recurring position is that production feedback is the evaluation: if the work merges and nobody complains, it worked. That holds for an internal tool with an operator watching every case. It does not survive a regulated workflow, where “nobody has complained yet” is not an audit trail.

The second is identity and access. In the build-logs it usually appears as scar tissue — an expired token that silently killed a pipeline, a misconfigured credential that nearly posted to the wrong channel — patched after the incident, rarely designed in. For a single-operator system that is survivable. For an agent acting on behalf of many users across many systems, identity is not a patch; it is the foundation.

The third is cost governance, typically tracked loosely and bounded by the human in the loop, never made first-class.

None of this makes the build-logs wrong. It makes them narrow by design: deep on one workflow, thin on the capabilities that workflow happens not to stress. The seven here are the superset — what a harness needs when the workload is not one team’s eng-ops pipeline but a fleet of agents acting for the whole enterprise. A production memoir tells you how one harness was built. A capability model tells you whether yours is ready for work its author never imagined.

I am not arguing every agent needs all seven services on day one. There are honest costs. Every control adds a hop: a heavy orchestration hierarchy can add seconds of coordination overhead, and continuous evaluation and full-fidelity tracing both burn compute. For a latency-critical, single-purpose agent, a thinner harness is the right call.

It is also easy to reach for multi-agent routing too early. Princeton researchers found a single agent matched or beat multi-agent systems on 64% of benchmarked tasks, so Capability 4 earns its place only once you genuinely have a fleet; until then, routing is complexity you pay for and do not use.

The way through is to adopt these in order of risk. Start with identity and tool governance, which bound the blast radius. Add observability next, because you cannot improve what you cannot see. Layer in the rest as the deployment matures. A prototype talking to read-only data does not need the full control plane; a production agent moving money does.

The test for whether a capability belongs in your harness rather than your agent stays the same throughout: if the agent could turn it off, forget it, or route around it, it was never a control.

Score your own harness. Seven boxes checked is a control plane; every “it’s in the agent” is a gap to close.

The industry’s attention is still fixed on building more capable agents. As deployments mature, it will shift — as it did from DevOps to MLOps — toward operating those agents safely, observably, and at scale. The teams that win the next phase will not have the smartest agents. They will have the most reliable systems wrapped around ordinary ones.

That is the bet behind the Agent Harness as a control plane. Kubernetes did not make containers smarter; it made them governable, and that is what made them enterprise infrastructure. The same separation — agent as workload, harness as control plane — is what will turn agentic AI from an impressive demo into something you can put your name on in production.

Because in the enterprise, intelligence was never the hard part. Reliability is what earns trust.

The Seven Capabilities Every Agent Harness Must Provide was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article Building a Production-Grade Coding Agent on Snowflake: From Trial Account to Enterprise Deployment Anthropic’s Claude Certified Architect Exam (CCA-F): The 40 Percent the CCA-F Practice Exam Never… Wicked Ticket (wiki-ticket): Graph Engineering for your current SLDC tools

The Seven Capabilities Every Agent Harness Must Provide

Run your AI side-project on zahid.host