Harness Engineering Has No Fixed Address

wpnews.pro

TL;DR— Harness engineering is one thing:getting reliable judgment out of a reasoner you didn't train— given an instruction, bounded by a spec that can override that instruction, and verified before the result ships. It is not "the wrapper around the model." It's apropertyof code, not a place in your stack, and it lives on both sides of every tool call. The generic scaffolding — loops, dispatch, memory — melts into the model a little more every quarter. What survives is the part that staysexternalto the model: the specification of what the agent may do, and the verification that it did. That's the discipline. The rest is plumbing.

I've written before that Agent = Model × Harness, and that the harness is the half you actually engineer. I still believe the formula. But stop there and it oversells two things — so this is me correcting my own framing, precisely, with code.

The × oversells separability. Multiplication implies the factors are independent. They aren't. A better model doesn't leave your harness untouched — it dissolves parts of it. Chain-of-thought prompting became reasoning models. ReAct-style tool loops became native tool use. Half your RAG plumbing is being quietly absorbed by longer context and better-trained retrieval. Every model generation collapses a layer of harness the previous one needed.

"Harness absorbs software engineering" oversells the annexation. It doesn't absorb it. It partly consumes it — pulls in the slice that encodes the agent's behavior — while the deterministic spine (payments, ledgers, auth) and the dumb tools stay exactly what they were. Relocation with an annexation at the seam. Not a takeover.

Both corrections point at the same uncomfortable question: if the model eats the scaffold every quarter, isn't the harness the melting part — the thing you'd be foolish to bet a career on?

Here's the resolution, and it's the whole point.

The harness mechanism melts. The loop, the dispatch, the retry, the memory-stitching — that's the part the model and the frameworks commoditize, and they should.

What doesn't melt is the part that's external to the model:

The model can get arbitrarily good at

runningan agent and still not know what your agent isfor.

Your refund policy isn't in the weights and never will be. Your escalation thresholds, your risk tolerance, what this agent must never do regardless of how it's asked — that's private, contextual, and changing. It's a specification, and specifications live outside the thing they govern. Same for evaluation: a model can't be its own measurement layer without circularity — the thing being judged can't be trusted to judge itself.

So the durable core of harness engineering was never the machinery. It's specification and verification — defining the envelope of allowed behavior, and proving the agent stayed inside it. Both survive every model upgrade for the same reason: they're external to the model, one as constraint, one as adversary.

The tempting picture is tidy: the harness wraps the model, and the tools sit outside it. That picture does not survive contact with agent-facing APIs.

Ask where the refund cap actually lives. In the agent's orchestration guardrails? Or in the refund endpoint that refuses the call? The honest answer is both — and the load-bearing copy is the one in the API, because you never trust a probabilistic reasoner to self-limit. Defense in depth pushes the hard constraints down into the tool. Which means the agent-facing endpoint is doing harness-grade work — specifying and constraining agency — from the far side of the boundary you thought separated harness from tool.

So harness engineering isn't a location in the stack. It's a property:

Does this code exist to bend a fallible reasoner toward a spec?

Run that test and the boundary stops mattering. The model-facing side — the code that talks to the model, elicits its judgment, constrains it — passes wholesale; that's its entire job. The service side passes selectively: a plain database or a human-facing endpoint is just a tool, but an endpoint purpose-built for an agent — idempotent, dry-runnable, policy-enforcing, with errors a model can parse and recover from — is harness. The Stripe-for-agents dev is doing harness engineering. The Stripe-for-humans dev whose endpoint an agent happens to call is not. Membership by function, not address.

And the model itself? Never. It's exogenous — the × you don't own.

Here's the whole discipline as one function. The comment headers are the taxonomy:

def handle_refund(instruction, ctx):
    decision = model.decide(
        instruction=instruction,        # per-task: "refund this customer"
        policy=ctx.policy_summary,      # standing spec, handed to the model
        account=ctx.account,
    )  # -> { action: "refund", amount: 52000, reason: "..." }

    if decision.action == "refund" and decision.amount > ctx.policy.auto_approve_cap:
        return escalate(decision, reason="over auto-approve cap")

    check = evals.verify(decision, ctx.policy)
    if not check.passed:
        log_failure(check)              # outer loop: feeds harness re-tuning across versions
        return escalate(decision, reason=check.violation)

    return refund_api.execute(
        amount=decision.amount,
        idempotency_key=ctx.request_id, # because the model might retry
    )

(model.decide

, evals.verify

, and refund_api.execute

stand in for the real implementations.)

The teaching is in two lines.

The envelope if

is what makes this harness engineering and not prompt engineering. If the cap lived in the prompt, the model could be cajoled past it — flattered, confused, jailbroken. In code, it can't. That single branch is "reliable judgment given an instruction, bounded by a spec that can override it" — compiled.

And evals.verify

before execute

is the judgment getting checked while it's still cheap to be wrong.

Now notice what's absent: the model itself (exogenous), and any human-facing refund UI (a tool, not harness). Run the membership test down every line — does this exist to bend a fallible reasoner to a spec? — and everything that remains is the harness.

The naive read of "reliable behavior" is does the agent do what it's told. That's the easy half, and it's the half models are getting better at on their own.

The durable, hard half is the opposite:

Reliable judgment includes the agent

refusingwhen the instruction would breach the spec.

The refund agent's hardest moment isn't executing "refund this customer." It's declining "refund $50,000" even though it was told to. That refusal doesn't come from the instruction — it comes from the envelope that outranks it. And making a probabilistic reasoner reliably disobey at exactly the right boundary, every time, is the whole safety problem in miniature. It's also the failure that doesn't surface until it's expensive — which is exactly why you verify it in code rather than hope for it in a prompt.

A discipline with a measured error signal you can drive down is engineering. Without it, it's prompt-craft. Evals are what close the loop — and there are two loops, nested at different timescales.

The inner loop is the runtime check that catches the bad judgment before it commits — the evals.verify

above. It feeds the agent, mid-task. It's how the envelope gets enforced, and it saves the money now.

The outer loop is the offline suite that tells you "this agent breaches policy 3% of the time." It feeds you, across versions, and re-tunes the harness over weeks.

Same discipline, two controllers. One keeps a single decision inside the lines; the other keeps the system improving.

Step back and it's a control system, and every piece has a name:

That's harness engineering, fully factored.

Here's the part that reconciles "bet on the harness" with "the harness melts," because both are true.

Every individual harness melts. Each one has the model's shadow baked into it — its specific failure modes, its current reliability — and gets rewritten when the curve moves: fewer guardrails as the model gets dependable, different ones as its failure modes shift. The artifact is disposable.

The discipline doesn't melt. Something must always specify what an autonomous reasoner may do, and prove it stayed inside that — and that something sits external to the model, which is exactly why no model generation can absorb it.

The harness melts. Harness engineering doesn't.

So the bet isn't "build the loop" — the model ships more of the loop every quarter. The bet is: be the person who decides what the agent is allowed to want, and can prove it stayed there. Mechanism commoditizes. Specification and verification don't.

And the surface is only getting wider, because the consumer of software is changing — your users are becoming agents too. Every system that exposes itself to an autonomous reasoner now needs a face built for one. That face is harness work, wherever it sits.

The model half is being handed to all of us for free, and it gets better every quarter. Whether your agent — your product, your role — gets better is a question about the half you engineer: the judgment you specify, and the verification that holds it true.

That's where I'd place the bet. It just doesn't have a fixed address.

source & further reading

dev.to — original article Why Your Agent's Search Results Look Right and Are Wrong: The Index Distribution Problem Stop Telling Your AI to "Be Careful Next Time." It Has No Memory of Yesterday. container escape is becoming an agent workload

Harness Engineering Has No Fixed Address

Run your AI side-project on zahid.host