{"slug": "harness-engineering-has-no-fixed-address", "title": "Harness Engineering Has No Fixed Address", "summary": "Harness engineering is not a fixed location in the stack but a property of code that involves specifying and verifying agent behavior, according to a developer. The durable core of harness engineering is specification and verification, which remain external to the model and survive model upgrades. The developer argues that as models improve, they dissolve parts of the harness, but the external constraints and evaluation layers persist.", "body_md": "TL;DR— Harness engineering is one thing:getting reliable judgment out of a reasoner you didn't train— given an instruction, bounded by a spec that can override that instruction, and verified before the result ships. It is not \"the wrapper around the model.\" It's apropertyof code, not a place in your stack, and it lives on both sides of every tool call. The generic scaffolding — loops, dispatch, memory — melts into the model a little more every quarter. What survives is the part that staysexternalto the model: the specification of what the agent may do, and the verification that it did. That's the discipline. The rest is plumbing.\n\nI've written before that **Agent = Model × Harness**, and that the harness is the half you actually engineer. I still believe the formula. But stop there and it oversells two things — so this is me correcting my own framing, precisely, with code.\n\n**The × oversells separability.** Multiplication implies the factors are independent. They aren't. A better model doesn't leave your harness untouched — it *dissolves* parts of it. Chain-of-thought prompting became reasoning models. ReAct-style tool loops became native tool use. Half your RAG plumbing is being quietly absorbed by longer context and better-trained retrieval. Every model generation collapses a layer of harness the previous one needed.\n\n**\"Harness absorbs software engineering\" oversells the annexation.** It doesn't absorb it. It *partly consumes* it — pulls in the slice that encodes the agent's behavior — while the deterministic spine (payments, ledgers, auth) and the dumb tools stay exactly what they were. Relocation with an annexation at the seam. Not a takeover.\n\nBoth corrections point at the same uncomfortable question: if the model eats the scaffold every quarter, isn't the harness the *melting* part — the thing you'd be foolish to bet a career on?\n\nHere's the resolution, and it's the whole point.\n\nThe harness *mechanism* melts. The loop, the dispatch, the retry, the memory-stitching — that's the part the model and the frameworks commoditize, and they should.\n\nWhat doesn't melt is the part that's **external to the model**:\n\nThe model can get arbitrarily good at\n\nrunningan agent and still not know what your agent isfor.\n\nYour refund policy isn't in the weights and never will be. Your escalation thresholds, your risk tolerance, what this agent must never do regardless of how it's asked — that's private, contextual, and changing. It's a *specification*, and specifications live outside the thing they govern. Same for evaluation: a model can't be its own measurement layer without circularity — the thing being judged can't be trusted to judge itself.\n\nSo the durable core of harness engineering was never the machinery. It's **specification and verification** — defining the envelope of allowed behavior, and proving the agent stayed inside it. Both survive every model upgrade for the same reason: they're external to the model, one as constraint, one as adversary.\n\nThe tempting picture is tidy: the harness wraps the model, and the tools sit outside it. That picture does not survive contact with agent-facing APIs.\n\nAsk where the refund cap actually lives. In the agent's orchestration guardrails? Or in the refund endpoint that refuses the call? The honest answer is *both* — and the load-bearing copy is the one in the API, because you never trust a probabilistic reasoner to self-limit. Defense in depth pushes the hard constraints *down into the tool*. Which means the agent-facing endpoint is doing harness-grade work — specifying and constraining agency — from the far side of the boundary you thought separated harness from tool.\n\nSo harness engineering isn't a *location* in the stack. It's a *property*:\n\nDoes this code exist to bend a fallible reasoner toward a spec?\n\nRun that test and the boundary stops mattering. The **model-facing** side — the code that talks to the model, elicits its judgment, constrains it — passes wholesale; that's its entire job. The **service side** passes *selectively*: a plain database or a human-facing endpoint is just a tool, but an endpoint *purpose-built for an agent* — idempotent, dry-runnable, policy-enforcing, with errors a model can parse and recover from — is harness. The Stripe-for-agents dev is doing harness engineering. The Stripe-for-humans dev whose endpoint an agent happens to call is not. Membership by function, not address.\n\nAnd the model itself? Never. It's exogenous — the × you don't own.\n\nHere's the whole discipline as one function. The comment headers are the taxonomy:\n\n``` python\ndef handle_refund(instruction, ctx):\n    # ── MODEL-FACING (agent side): elicit judgment from the non-deterministic reasoner\n    decision = model.decide(\n        instruction=instruction,        # per-task: \"refund this customer\"\n        policy=ctx.policy_summary,      # standing spec, handed to the model\n        account=ctx.account,\n    )  # -> { action: \"refund\", amount: 52000, reason: \"...\" }\n\n    # ── THE ENVELOPE: a spec that OUTRANKS the instruction.\n    # This is CODE, not a prompt. You can argue with a prompt.\n    # You cannot argue with an if-statement. That line is the discipline.\n    if decision.action == \"refund\" and decision.amount > ctx.policy.auto_approve_cap:\n        return escalate(decision, reason=\"over auto-approve cap\")\n\n    # ── RUNTIME EVAL (inner loop): verify the judgment BEFORE the side effect commits\n    check = evals.verify(decision, ctx.policy)\n    if not check.passed:\n        log_failure(check)              # outer loop: feeds harness re-tuning across versions\n        return escalate(decision, reason=check.violation)\n\n    # ── ENVIRONMENT-FACING (service side): agent-optimized tool, idempotent on purpose\n    return refund_api.execute(\n        amount=decision.amount,\n        idempotency_key=ctx.request_id, # because the model might retry\n    )\n```\n\n(`model.decide`\n\n, `evals.verify`\n\n, and `refund_api.execute`\n\nstand in for the real implementations.)\n\nThe teaching is in two lines.\n\nThe envelope `if`\n\nis what makes this **harness engineering and not prompt engineering**. If the cap lived in the prompt, the model could be cajoled past it — flattered, confused, jailbroken. In code, it can't. That single branch is \"reliable judgment given an instruction, bounded by a spec that can override it\" — compiled.\n\nAnd `evals.verify`\n\n*before* `execute`\n\nis the judgment getting checked while it's still cheap to be wrong.\n\nNow notice what's **absent**: the model itself (exogenous), and any human-facing refund UI (a tool, not harness). Run the membership test down every line — *does this exist to bend a fallible reasoner to a spec?* — and everything that remains is the harness.\n\nThe naive read of \"reliable behavior\" is *does the agent do what it's told.* That's the easy half, and it's the half models are getting better at on their own.\n\nThe durable, hard half is the opposite:\n\nReliable judgment includes the agent\n\nrefusingwhen the instruction would breach the spec.\n\nThe refund agent's hardest moment isn't executing \"refund this customer.\" It's declining \"refund $50,000\" *even though it was told to.* That refusal doesn't come from the instruction — it comes from the envelope that outranks it. And making a probabilistic reasoner reliably *disobey* at exactly the right boundary, every time, is the whole safety problem in miniature. It's also the failure that doesn't surface until it's expensive — which is exactly why you verify it in code rather than hope for it in a prompt.\n\nA discipline with a measured error signal you can drive down is engineering. Without it, it's prompt-craft. Evals are what close the loop — and there are two loops, nested at different timescales.\n\nThe **inner loop** is the runtime check that catches the bad judgment *before it commits* — the `evals.verify`\n\nabove. It feeds the agent, mid-task. It's how the envelope gets enforced, and it saves the money *now*.\n\nThe **outer loop** is the offline suite that tells you \"this agent breaches policy 3% of the time.\" It feeds *you*, across versions, and re-tunes the harness over weeks.\n\nSame discipline, two controllers. One keeps a single decision inside the lines; the other keeps the *system* improving.\n\nStep back and it's a control system, and every piece has a name:\n\nThat's harness engineering, fully factored.\n\nHere's the part that reconciles \"bet on the harness\" with \"the harness melts,\" because both are true.\n\nEvery *individual* harness melts. Each one has the model's shadow baked into it — its specific failure modes, its current reliability — and gets rewritten when the curve moves: fewer guardrails as the model gets dependable, different ones as its failure modes shift. The artifact is disposable.\n\nThe *discipline* doesn't melt. Something must always specify what an autonomous reasoner may do, and prove it stayed inside that — and that something sits external to the model, which is exactly why no model generation can absorb it.\n\nThe harness melts. Harness engineering doesn't.\n\nSo the bet isn't \"build the loop\" — the model ships more of the loop every quarter. The bet is: be the person who decides what the agent is allowed to want, and can prove it stayed there. Mechanism commoditizes. Specification and verification don't.\n\nAnd the surface is only getting wider, because the consumer of software is changing — your users are becoming agents too. Every system that exposes itself to an autonomous reasoner now needs a face built for one. That face is harness work, wherever it sits.\n\nThe model half is being handed to all of us for free, and it gets better every quarter. Whether your **agent** — your product, your role — gets better is a question about the half you engineer: the judgment you specify, and the verification that holds it true.\n\nThat's where I'd place the bet. It just doesn't have a fixed address.", "url": "https://wpnews.pro/news/harness-engineering-has-no-fixed-address", "canonical_source": "https://dev.to/saurav_bhattacharya/harness-engineering-has-no-fixed-address-2m7a", "published_at": "2026-06-22 00:17:04+00:00", "updated_at": "2026-06-22 00:25:16.962898+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-safety", "developer-tools"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/harness-engineering-has-no-fixed-address", "markdown": "https://wpnews.pro/news/harness-engineering-has-no-fixed-address.md", "text": "https://wpnews.pro/news/harness-engineering-has-no-fixed-address.txt", "jsonld": "https://wpnews.pro/news/harness-engineering-has-no-fixed-address.jsonld"}}