{"slug": "agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it", "title": "Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It", "summary": "A developer argues that the evaluation and observability layers should be part of an agent's harness, not external tools. Using the formula Agent = Model × Harness, they explain that closed-loop agents require built-in lens and eval components to detect and escalate failures, as demonstrated when a crashed worker left a trace and was flagged as invalid until resolved.", "body_md": "There's a formula I keep coming back to when people ask why their slick demo agent falls apart in production:\n\nAgent = Model × Harness\n\nThe model is the raw reasoning — Claude, GPT, whatever. It's swappable, and it's getting better on a curve you don't control. The harness is everything else: the goals, the loops, the tools, the scheduler, the retry logic. Most of the engineering that matters lives in the harness, not the model.\n\nBut here's the part most teams get wrong. They define the harness as *the plumbing to run the model* — goals + loops + tools — and then bolt **evals** and **observability** on the side as external QA. Things you point *at* the agent, after the fact, from outside.\n\nThat's the mistake. **Your eval layer and your trace layer are inside the harness.** They're not tools beside the agent; they're the half of the agent that makes it a closed loop instead of an open one.\n\nHarness = goals + loops + tools + lens + evals.The first three let the agentact. The last two let itknow whether the action was any good— which is the only thing that turns an agent thatrunsinto an agent thatimproves.\n\nLet me make that concrete, because I learned it the unglamorous way: by watching a scheduled agent crash and discovering exactly which parts of my harness saved me.\n\nAn **open-loop** agent acts and moves on. It writes the file, hits the API, ships the commit — and nobody, including the agent itself, knows if the outcome was correct. You find out when a human notices something's broken. Most \"autonomous\" agents in the wild are open-loop. They're impressive right up until they silently do the wrong thing for three days.\n\nA **closed-loop** agent has a feedback path:\n\n```\nact → observe → evaluate → correct → act better\n```\n\nThe two pieces that close it are the **lens** (observability — every model call and tool step, resolved inputs, raw outputs, so a run is never a black box) and the **evals** (judgment — the output scored against a standard: deterministic checks, contract validation, model-as-judge). Lens tells you *what the agent did*; evals tell you *whether it was good*. You need both, wired into the harness itself — not run by hand once a week by a human who remembers to check.\n\nI run a fleet of scheduled agents — background workers on cron, each doing a focused job on a timer, each in an isolated session the main process can't watch live.\n\nOne of them crashed mid-run.\n\nIn an open-loop setup, that's a silent disaster: the worker dies, produces nothing, leaves no trace. You discover it days later when the output that should exist doesn't. (If you've run background agents, you know this exact pain — the ones that \"go dark\" for a week before anyone notices.)\n\nHere's what happened instead, because the lens and eval layers were part of the harness:\n\n**1. The lens meant the failure left a record.** Every worker writes its transcript *stub-first*: the moment it starts, before doing any real work, it writes a record with `Outcome: IN-PROGRESS`\n\n. So even a worker that dies one second later has left a footprint. The crash wasn't invisible — there was a transcript sitting there, frozen mid-run, saying \"I started and never finished.\"\n\n**2. The evals meant the failure got caught — loudly, and repeatedly.** Worker transcripts are validated against a versioned contract. A stub stuck at `IN-PROGRESS`\n\nis fine in normal mode (it might still be running), but under the \"finished\" check it becomes an *error*:\n\n```\n# Normal: an unfinished stub is acceptable\nagent-eval validate transcripts/\n\n# Finished mode: a lingering IN-PROGRESS is a hard error\nagent-eval validate transcripts/ --finished\n```\n\nSo the dead run didn't just leave a record — it got *scored as invalid*, and kept showing up in the gate's invalid list every single run until someone dealt with it. The eval layer refused to let the failure quietly blend into the background.\n\nThat's the whole point. **The lens made the failure observable; the evals made it un-ignorable.** Neither is something I went and ran by hand — they're standing parts of the harness that turned a crash from a black hole into a tracked defect.\n\nDetecting the failure is observe + evaluate. The last piece is *correct*.\n\nThe naive fix is \"add a `finally`\n\nblock so the worker finalizes its own transcript on crash.\" That doesn't work, and the reason is instructive: **a crashed isolated session is gone.** The process that would run the `finally`\n\nis the thing that died. You can't ask a dead worker to clean up after itself.\n\nSo the correction step has to live *outside* the worker, as its own small loop. I wrote a sweeper: a scheduled janitor that scans for transcripts stuck at `IN-PROGRESS`\n\nwhose run is *provably over* — older than a threshold set safely past the longest possible run time, so it can never race a worker that's still legitimately going — and finalizes them to `fail`\n\n.\n\nThe logic is deliberately boring:\n\n```\nfor each transcript stub still marked IN-PROGRESS:\n    if age > MAX_RUN_TIME + safety_margin:   # the run is provably dead\n        rewrite Outcome -> \"fail (auto-finalized: abandoned mid-run)\"\n```\n\nIdempotent, never deletes anything, runs on a timer. The first time it ran it finalized a backlog of long-dead stubs and dropped the gate's invalid count in finished-mode to just the known, expected exceptions. Now it keeps the loop closed automatically: a worker can still crash, but its corpse gets cleaned up within the hour without me touching anything.\n\nThat's `act → observe → evaluate → correct`\n\n, fully wired — and three of those four steps are the lens + eval layers doing their job.\n\nHere's where I have to be honest, because this is the part it's tempting to oversell.\n\nWhat I built is a **self-correcting** harness. The loop closes: the system detects deviations from its own standard and repairs them without a human in the loop. That's real, and it's the floor you want under any fleet of autonomous agents.\n\nBut self-correcting is not **self-improving**. Self-correcting means the system *holds the line* — it keeps itself at the standard. Self-improving would mean the system *moves the line up* on its own: noticing a check has false positives and tightening its own rubric, noticing a worker keeps regressing and adjusting that worker's instructions. My harness doesn't do that. *I* still author every improvement. The loops run themselves; the loops were designed by a human.\n\nAnd honestly — that's the right place to stop, for now. A harness that rewrites its own workers and its own eval criteria unsupervised is a different and much riskier thing, precisely because **the evals would be grading work the same system authored.** The moment the thing being judged and the thing doing the judging are the same closed system with no external anchor, \"it passes its own evals\" stops meaning very much. Self-correction *within a human-set contract* is the sound version. Self-modification *of the contract itself* is where you want hard guardrails and a human gate before you flip it on.\n\nSo the honest formula, fully expanded:\n\nAgent = Model × Harness, whereHarness = goals + loops + tools + lens + evals— and it's the lens + evals that make the agentself-correcting. Getting from there toself-improvingis a real step further, and it's one you should take deliberately, with a human holding the gate.\n\nIf you're building agents and your evals and traces are something you run *occasionally, from outside*, you don't have a closed loop — you have an open-loop agent with some QA scripts nearby. The upgrade isn't more tooling. It's a reframe:\n\nAgent = Model × Harness. The model is improving on its own. Whether your *agent* improves is a question about your harness — and specifically about whether you've wired the loop closed.\n\n*The two layers in this post map to two tools I use as a single unit. **agent-eval** is the eval framework — it scores and gates an agent's **output** (contract validation, drift, hallucination, staleness); the `validate --finished`\n\ncheck in the incident above is its work. **AgentLens** is the trace layer — it captures **how the agent got there** (every model and tool step, resolved inputs, raw outputs) so the eval signal is actually debuggable. agent-eval tells you something broke; AgentLens tells you why. They ship together on purpose: two halves of one feedback loop — which is exactly the argument of this post.*", "url": "https://wpnews.pro/news/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it", "canonical_source": "https://dev.to/saurav_bhattacharya/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it-1422", "published_at": "2026-06-20 22:49:09+00:00", "updated_at": "2026-06-20 23:09:04.340091+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "machine-learning", "ai-infrastructure"], "entities": ["Claude", "GPT"], "alternates": {"html": "https://wpnews.pro/news/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it", "markdown": "https://wpnews.pro/news/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it.md", "text": "https://wpnews.pro/news/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it.txt", "jsonld": "https://wpnews.pro/news/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it.jsonld"}}