{"slug": "harness-base-definition-the-control-system-outside-the-model", "title": "Harness Base Definition: The Control System Outside the Model", "summary": "A developer has defined \"Harness\" as the control system outside the model in an AI agent architecture, distinguishing it from the Agent itself. While the Agent judges the next step in a task, Harness makes each step executable, constrained, observable, recoverable, verifiable, and governable in the real environment. The Harness handles engineering responsibilities such as permission, execution environment, session lifecycle, observability logs, verification criteria, and governance policy that are separate from the model's reasoning core.", "body_md": "Previously, we split Agent into several minimal parts:\n\n```\nModel: judge the next step\nLoop: keep the process moving\nTools: interact with the real world\nState: keep the task connected\n```\n\nAt this point, a natural question appears:\n\n**If Agent already has model, loop, tools, and state, why talk about Harness?**\n\nAn even easier confusion is:\n\n```\nIs Harness a higher-level, smarter Agent that manages other Agents?\n```\n\nThat sounds plausible, but it bends the architecture in the wrong direction. Harness is not another Agent. It is not a larger prompt, and it is not a framework name. It is the control system outside the model. Continue with the same small CLI Agent:\n\n```\nUser says: help me figure out why this project's tests are failing, and fix it.\n```\n\nIf this CLI Agent is only a demo, it can be simple:\n\n```\nsend user input to model\nmodel says read file\nprogram reads file\nput result back into prompt\nmodel says edit file\nprogram edits file\nmodel says run tests\nprogram runs tests\n```\n\nThis chain can work once and already look like an Agent. But as soon as someone else really uses it, questions appear. What if the model wants to execute `rm -rf`\n\n? What if it wants to read private files under the user's home directory? If it runs for ten minutes and the user interrupts, how is the working state saved? After a tool error, should the next model turn see the full log or only a summary? If the same task continues tomorrow, where does the session resume from? If a modification looks successful but no test verified it, how does the system know it is done? If a user says the Agent damaged a file, how do we reconstruct what happened?\n\nThese questions do not belong to the model itself. They should not be left for the model to decide. The model only generates the next-step judgment from the current context. Permission, execution environment, session lifecycle, observability logs, verification criteria, and governance policy are engineering responsibilities outside the model. Together, those responsibilities are Harness.\n\nOne sentence:\n\n**Agent judges the next step in a task; Harness makes each step executable, constrained, observable, recoverable, verifiable, and governable in the real environment.**\n\nFrameworks may provide parts of Harness, but Harness is more a set of model-external engineering responsibilities than a package name or product name.\n\nThis article does not turn Harness into a giant terminology box. It answers one core question:\n\nWhat is the relationship between Harness and Agent? Why is Harness not another Agent?\n\nThe problem sequence:\n\n``` php\nOnce Agent can call tools\n-> it must distinguish \"model proposal\" from \"system execution\"\n-> system execution needs permission, sandbox, budget, and error handling\n-> once the task becomes long\n-> it needs session, lifecycle, interrupt, and recovery\n-> once the product is used by others\n-> it needs trace, eval, regression, and governance\n-> these model-external responsibilities together\n-> are Harness\n```\n\nHarness does not appear to make architecture look advanced. It is the control plane forced out by reality when Agent enters a real engineering environment. A seven-layer map can be remembered as:\n\n```\nETCLOVG\n\nExecution\nTools\nContext\nLifecycle\nObservability\nVerification\nGovernance\n```\n\nDiagram:\n\nThe key is not the seven names. The key is the responsibility boundary:\n\n```\nAgent proposes next step\nHarness decides whether that step can execute, how it executes, how it is recorded, and how it is verified\n```\n\nThe model remains the reasoning core. It understands the user goal, reads context, and proposes next action. But it does not directly own the filesystem, shell, permissions, or long-term memory and audit records. Those belong to Harness.\n\nA minimal chat app does not need Harness. It only needs to manage messages:\n\n``` php\nuser input\n-> model answer\n-> show result\n```\n\nHere model output is only text. If text is wrong, the user ignores it. If incomplete, they ask again. Text has no side effects.\n\nAgent is different. When it can call tools, model output is no longer only \"answer\"; it becomes \"action proposal.\" For example:\n\n```\n{\n  \"tool\": \"bash\",\n  \"input\": {\n    \"cmd\": \"npm test\"\n  }\n}\n```\n\nThis is not ordinary text. It is an application to enter the real environment. The system must answer:\n\n```\nDoes this tool exist?\nAre arguments valid?\nDoes this session allow shell?\nWill this command modify files?\nDoes it require user confirmation?\nWhich working directory should it run in?\nWhat is the timeout?\nHow is long output truncated?\nHow is failure fed back to the model?\n```\n\nWithout Harness, these questions get collapsed into:\n\n```\nWhatever the model wants, we help it do.\n```\n\nThat is the danger of many Agent demos. They treat the model's action intent as a system command. In short tasks this may be fine. Once connected to a real codebase, it becomes an incident entry point.\n\nIn our CLI Agent, the model may propose:\n\n```\nread package.json\nsearch failed test name\nopen related source file\nmodify implementation\nrun tests\n```\n\nAll sound reasonable, but their risks differ. Reading files differs from writing files. Running `npm test`\n\ndiffers from running arbitrary shell. Modifying the current repo differs from modifying the user's home directory. Running local commands differs from network access.\n\nHarness's first value is turning \"the model said so\" into \"the system reviewed and executed it.\" This difference is crucial. In implementation, model output should be seen as intent. Harness receives intent and turns it into action, a controlled action. Minimal pseudocode:\n\n``` js\nwhile (!session.done) {\n  const modelInput = harness.context.build(session);\n  const intent = await model.next(modelInput);\n\n  const decision = await harness.policy.review(intent, session);\n\n  if (decision.type === \"deny\") {\n    session.appendObservation(decision.reason);\n    continue;\n  }\n\n  if (decision.type === \"ask_user\") {\n    session.pauseForApproval(decision.prompt);\n    continue;\n  }\n\n  const observation = await harness.execution.run(decision.action);\n  session.appendObservation(observation);\n}\n```\n\nThe model is not calling tools here. It produces `intent`\n\n. Harness places `intent`\n\ninto policy, execution, session, and observation boundaries.\n\nThat is the first reason Harness is not another Agent:\n\n**Harness does not think of the next step for the model; it places the model's next step inside engineering constraints.**\n\nDraw a runtime boundary. For \"fix failing tests,\" one full turn looks like:\n\nThe `Model -->> Harness`\n\narrow matters. The model returns tool intent, not tool result. `Tool Runtime`\n\nand `Execution`\n\ntouch the project environment. `Session Store`\n\nsaves factual process. The policy layer inside Harness decides whether execution is allowed.\n\nWhen this boundary blurs, three common problems appear.\n\nFirst: Agent becomes \"model plus naked executor\":\n\n``` js\nwhile (true) {\n  const output = await model(prompt);\n  if (output.includes(\"bash\")) {\n    const result = await exec(output.command);\n    prompt += result;\n  }\n}\n```\n\nThis code is short, but hides all key issues: no permission, no structured tool protocol, no interruption recovery, no audit, no verification, no context policy. It only proves \"the model can drive one external action,\" not \"the system can host a real task.\"\n\nSecond: Harness is imagined as \"another Agent supervising the Agent.\" For example, using an outer model to decide whether the inner model may execute a command. This can be part of policy in some scenarios, but it is not the essence of Harness. Harness's key capability is not \"reason again\"; it is deterministic engineering control:\n\n```\npaths must be inside workspace\nfile writes must go through patch\nshell commands must have timeouts\ndangerous commands must ask the user\nevery tool call must land in the event log\ntest verification must bind to final completion state\n```\n\nThese rules should not be entirely left to another model's free judgment. They should be system policy, type constraints, runtime checks, and audit records.\n\nThird: Harness is treated as an optional \"product layer.\" That is also wrong. Harness is not only UI, deployment, account, or billing. It exists from the minimal CLI stage. Once you distinguish:\n\n```\nmodel proposal\nsystem execution\nexecution result written back to state\nnext-turn context assembled again\n```\n\nyou are already writing Harness. Early Harness is only thin. It thickens as Agent faces real tasks.\n\nIf Harness is understood only as a \"wrapper around Agent,\" one layer is still missing.\n\nA wrapper feels like thin adapter code: receive input, call Agent, return output. Real Harness is more like a control loop. Before model action, it provides feedforward constraints. After action, it collects feedback signals. Then it uses those signals to adjust the next model input, tool visibility, permission policy, budget, and verification requirements.\n\nIn a CLI Agent run:\n\n```\nFeedforward constraints:\nsystem instruction, visible tools, working directory, budget, permission mode, project rules\n\nModel judgment:\ngenerate text or tool intent\n\nExecution feedback:\ntool result, error type, file changes, cost, latency, user approval result\n\nState update:\nsession event, context projection, trace, verification evidence\n\nNext-turn constraints:\nreduce visible tools, compact context, require verification first, pause for user, end task\n```\n\nHarness is not \"one more model outside the model.\" Its value is placing dynamic model judgment in an engineering system with sensors, constraints, feedback, and state.\n\nWithout this control loop, the system may still run:\n\n``` php\nmodel -> tool -> model -> tool -> final\n```\n\nBut it does not know whether tool choices are getting worse, why cost rises, whether a failure was permission rejection, tool error, context pollution, or model misjudgment, or what should change next turn.\n\nBoundary sentence:\n\n```\nAgent produces action intent; Harness regulates action conditions.\n```\n\n\"Regulates\" matters. Harness not only executes; it constrains, senses, and feeds back.\n\nMature Agent systems often split three things:\n\n```\nSession: source of truth, records what happened in this task.\nHarness: control loop, decides how the next step runs.\nSandbox: execution hand, actually touches files, commands, network, and external systems.\n```\n\nIf these are collapsed into one process object, early writing is fast and later maintenance is painful.\n\nIn a minimal demo, an in-process variable may hold:\n\n```\nmessages\ncwd\ntool results\ncurrent plan\npermission state\ntemporary files\nrunning process handles\nfinal answer\n```\n\nThis works for one run. But if the process crashes, all facts vanish. If the sandbox is cleaned, the session disappears. If the user wants to continue tomorrow, the system can only guess from a compressed summary.\n\nSeparating responsibilities helps.\n\nSession is not messages. Messages are only the projection visible to the next model turn. Session should record a fuller event ledger:\n\n```\nUserMessage\nModelIntent\nToolValidated\nPolicyReviewed\nApprovalRequested\nApprovalGranted\nToolStarted\nToolFinished\nObservationAppended\nContextCompacted\nVerificationRun\nTaskCompleted\nTaskBlocked\n```\n\nHarness can resume around session. Even if the control process crashes, reading the session log reveals the user goal, executed tools, permission decisions, file changes, verification results, and unfinished work.\n\nSandbox is replaceable execution. It may be a local working directory, temporary git worktree, container, remote VM, browser environment, or hosted execution pool. Sandbox crash should not equal task disappearance. It should become a recorded execution failure, then Harness decides retry, environment change, rollback, or asking the user.\n\nThis three-way split avoids a common mistake: treating \"the process currently running the Agent\" as the system source of truth. Processes die. The source of truth should be session. The execution hand can be replaced. The control loop can restart.\n\nThe first ETCLOVG layer is Execution. It asks:\n\n```\nWhere, as whom, and under which limits does the model-proposed action run?\n```\n\nIn the CLI Agent, Execution must know:\n\n```\ncurrent working directory\naccessible file range\navailable environment variables\ncommand timeout\nmaximum output length\nwhether network is allowed\nwhether file writes are allowed\nwhether background processes are allowed\n```\n\nWithout an Execution layer, tool calls run directly against the operating system. For a personal demo this may be tolerable. For an Agent used by others, it is dangerous. If the user only expects the Agent to fix the current repository, and the model proposes:\n\n```\n/Users/alice/.ssh/id_rsa\n```\n\ndo not expect the model to realize \"this should not be read.\" Harness must stop it in Execution.\n\nLikewise:\n\n```\nnpm test\n```\n\nlooks safe, but the test script may start services, write cache, access network, or run for a long time. Execution must provide timeout, output truncation, process cleanup, and working-directory isolation, or ordinary tests can hang the Agent.\n\nA minimal Execution interface:\n\n```\ntype ExecutionRequest = {\n  kind: \"read_file\" | \"write_file\" | \"shell\";\n  cwd: string;\n  args: unknown;\n  timeoutMs: number;\n  allowedPaths: string[];\n  sessionId: string;\n};\n\ntype ExecutionResult = {\n  ok: boolean;\n  stdout?: string;\n  stderr?: string;\n  changedFiles?: string[];\n  exitCode?: number;\n  truncated?: boolean;\n};\n```\n\nThe point is that \"execution\" becomes a governable object. The model cannot bypass it. Tools should not bypass it privately. UI should not bypass it directly. Every action that interacts with the real environment passes through Execution. This is Harness's first gate to reality.\n\nThe second layer is Tools. Execution is closer to the OS. Tools are closer to the model. They ask:\n\n```\nWhich capabilities can the model see?\nIn what structure should it submit them?\nHow does the system turn tool results into observation?\n```\n\nMany minimal Agents define tools as functions:\n\n```\nasync function readFile(path: string) {\n  return fs.readFile(path, \"utf8\");\n}\n```\n\nThe function is fine. But if exposed to Agent, it also needs protocol:\n\n```\ntool name\ninput schema\nread-only or write\nwhether confirmation is needed\nwhether it can run concurrently\nhow errors are expressed\nhow results are trimmed\nwhether results enter context\n```\n\nOtherwise the model and system can only guess through natural language. Tool protocol turns \"I want to read a file\" into a structured request. Tool runtime turns the structured request into controlled execution. A full tool pipeline:\n\n`Observation`\n\nis easily missed. Tool result cannot be only stdout. It must tell the system:\n\n```\nwhether this call succeeded\nwhether output was truncated\nwhich files were read\nwhich files were modified\nwhether a recoverable error occurred\nwhat the next model turn should see\nwhat the UI should show\nwhat the audit log should save\n```\n\nReturning only strings is convenient short-term, but it makes all later mechanisms harder. Context does not know what to retain. Lifecycle does not know how to recover. Observability cannot investigate. Verification cannot know what to verify. Governance cannot know whether a boundary was crossed.\n\nSo Tools is not \"more capabilities is better.\" Its real work is protocolizing capability entrypoints. For a small CLI Agent, four tools may be enough:\n\n```\nread_file\nsearch\napply_patch\nrun_command\n```\n\nBut all four should go through the same protocol. Tool count can be small. Tool boundaries cannot be vague.\n\nThe third layer is Context. It asks:\n\n```\nWhat exactly should the model see this turn?\n```\n\nThis looks like prompt concatenation, but in Agent it is much more complex. Long tasks accumulate:\n\n```\noriginal user goal\nproject rules\nfiles read\nsearch results\ntest logs\nmodification records\npermission refusals\nuser approvals\nmodel's own plan\nprevious tool result\n```\n\nPutting everything into prompt creates three problems: token explosion, old and new information polluting each other, and irrelevant detail distracting the model. Context is not \"save all state.\" It projects the workbench the model needs this turn from state:\n\n```\nState is the fact store.\nContext is this turn's view.\nMemory is cross-session experience.\nPrompt is the final input format.\n```\n\nHarness must separate these. In the CLI Agent, Session Store may save the full test log, but the next model turn may only need:\n\n```\ntest command: npm test\nfailed file: src/parser.test.ts\nerror summary: expected 3 but received 2\nrecent modification: near line 42 of src/parser.ts\nconstraint: only modify current workspace\n```\n\nContext prepares a clean, relevant, constrained decision context. It does not think for the model.\n\nDiagram:\n\n`Context Policy`\n\nis key. Many Agent failures are not because the model cannot reason, but because Harness shows it a messy context: obsolete logs from 30 minutes ago placed before latest observation, a user-rejected plan kept in high-priority context, or dependency install logs crowding out relevant code.\n\nContext's goal is not \"more information.\" It is:\n\n```\ncomplete enough\nfresh enough\nrelevant enough\nexplainable enough\n```\n\nWithout Context, Agent gradually goes blind in long tasks. With Context, each model turn returns to an organized workbench.\n\nThe fourth layer is Lifecycle. It asks:\n\n```\nWhich states does an Agent task go through from start to finish?\n```\n\nMinimal demos often write:\n\n``` js\nwhile (true) {\n  const intent = await model.next(context);\n  const result = await run(intent);\n  context.push(result);\n}\n```\n\nThis can run. But real tasks are not endless `while true`\n\n. They can be interrupted by users, wait for approval, enter recovery after tool failure, pause after budget exhaustion, complete after tests pass, block due to insufficient permission, or require re-judgment after network, file conflict, or concurrent modification.\n\nHarness must model task lifecycle explicitly. The point is not pretty state names; it is admitting tasks break. An Agent for real users cannot assume the user sits there until one run finishes, that every tool succeeds, or that every model turn goes in the right direction.\n\nLifecycle saves process boundaries:\n\n```\nwhen the task started\nwhere it is stuck\nwhy it paused\nwhat the user approved\nwhich actions already executed\nwhich actions can be retried\nwhich actions cannot be retried\nwhat the completion condition is\n```\n\nThis naturally leads to Session. Session is not chat history. It is the long-task source of truth. It should save events, not only prompts:\n\n```\nUserMessage\nModelIntent\nPolicyDecision\nToolStarted\nToolFinished\nFileChanged\nApprovalRequested\nApprovalGranted\nVerificationPassed\nTaskCompleted\n```\n\nWith these events, the system can replay. Replay enables debugging. Debugging enables improvement. Recovery enables hosting long tasks. Without Lifecycle, every Agent run is a gamble: successful runs look magical; failed runs are hard to review.\n\nThe fifth layer is Observability. It asks:\n\n```\nWhen Agent makes a mistake, how do we know where it went wrong?\n```\n\nFor normal programs, logs, metrics, and traces are common sense. Many Agent demos ironically lack this foundation. They only save the final conversation. When a user says \"it changed my files incorrectly,\" developers can only inspect a vague transcript. That is not enough.\n\nAgent failures can happen at many layers:\n\n```\nthe model misunderstood the user goal\nContext included stale information\ntool schema was too loose\npermission policy allowed a dangerous action\nshell timed out but was not marked\ntool output was truncated without telling the model\ntests failed but the final answer said done\nan action the user rejected was executed again\n```\n\nWithout Observability, all these collapse into:\n\n```\nthe model is unstable.\n```\n\nThat sentence has almost no engineering value. Harness observability must split a task into inspectable event chains. At minimum, it should answer:\n\n```\nwhat was the original user goal\nwhat did the model see each turn\nwhat did the model propose each turn\nwhat did the system allow or reject\nwhat did tools actually execute\nwhat did tools return\nwhich outputs were truncated\nwhich files changed\nwhat was the verification command\nwhere did final completion judgment come from\n```\n\nThat is the value of trace. It is not for a pretty dashboard; it lets failures be attributed to the right layer. If tests are not fixed, causes may differ completely:\n\n```\nthe model did not read the right file\nsearch did not find the test name\nContext trimmed the key log\napply_patch modified the wrong place\nrun_command ran the wrong test command\nVerification did not treat failing exit code as failure\n```\n\nEach cause has a different fix. Without observability, you blindly tune prompt. With observability, you can tune Context, Tool, Execution, Verification, or model instruction appropriately.\n\nObservability is Harness's basis for long-term improvement. It returns Agent from mystical tuning to engineering diagnosis.\n\nThe sixth layer is Verification. It asks:\n\n```\nWhy should the system believe the task is complete?\n```\n\nIn chat apps, if the model says \"I have explained it,\" that is usually enough. In programming Agents, it is not. The user asks the CLI Agent to fix failing tests. The model's final answer:\n\n```\nI have fixed the issue.\n```\n\nis not completion evidence. Real evidence should come from external verification:\n\n```\nrelevant tests pass\nno new failures introduced\nmodification scope is expected\nkey files were actually updated\nuser constraints were not violated\n```\n\nVerification changes \"model claims done\" into \"system verified done.\" Minimal implementation:\n\n```\ntype VerificationPlan = {\n  commands: string[];\n  expectedFiles?: string[];\n  successCriteria: string[];\n};\n\nasync function verifyFix(plan: VerificationPlan) {\n  for (const command of plan.commands) {\n    const result = await execution.run({\n      kind: \"shell\",\n      args: { command },\n      timeoutMs: 120_000,\n    });\n\n    if (!result.ok) {\n      return { ok: false, reason: result.stderr ?? result.stdout };\n    }\n  }\n\n  return { ok: true };\n}\n```\n\nVerification is not always running tests. Different tasks need different evidence:\n\n```\ndocumentation task: check links, headings, format\nrefactor task: run unit tests, typecheck, lint\ndata task: validate row count, schema, samples\ndeployment task: check health probe, logs, rollback point\nresearch task: preserve source, time, citation chain\n```\n\nPrinciple:\n\n```\ncompletion state cannot come only from model language.\ncompletion state must bind to external evidence.\n```\n\nWithout Verification, Agent easily hallucinates completion. It may edit one file but not run tests, run tests but the wrong command, ignore failure in the summary, or fix only the first error while marking the task complete. Harness must stop these cases.\n\nVerification often sits next to Observability. Observability says what happened. Verification says whether it met the bar. Together they move Agent from \"can do work\" to \"can finish work.\"\n\nThe seventh layer is Governance. It asks:\n\n```\nUnder which rules, and for whom, does this Agent work?\n```\n\nIf Execution is low-level runtime restriction and Tools are capability entry protocols, Governance is higher-level policy. It cares not only whether one command may run, but also:\n\n```\ndifferent users have different permissions\ndifferent workspaces have different policies\nwhich tools are read-only by default\nwhich actions require second confirmation\nwhich data cannot enter model context\nwhich logs need redaction\nwhich memory may be saved long-term\nwhich external services may be called\nwhich tasks need human acceptance\n```\n\nFor a personal CLI, Governance can be thin:\n\n```\nonly access current repository\nfile writes must go through auditable patch\ndangerous shell commands must ask the user\n```\n\nIn a team environment, governance quickly grows. Projects have different rules. Some repositories cannot upload code snippets. Some commands cannot run outside CI. Some files contain secrets. Some users can read but not write. Some tasks must leave audit records.\n\nThe model cannot reliably follow these by self-discipline. Harness must turn them into policy.\n\nA simple policy check:\n\n```\nfunction reviewAction(action: Action, session: Session): PolicyDecision {\n  if (!isInsideWorkspace(action.path, session.workspace)) {\n    return { type: \"deny\", reason: \"path outside workspace\" };\n  }\n\n  if (action.kind === \"shell\" && isDestructive(action.command)) {\n    return { type: \"ask_user\", prompt: \"Dangerous command requires confirmation\" };\n  }\n\n  if (action.kind === \"write_file\" && session.mode === \"read_only\") {\n    return { type: \"deny\", reason: \"session is read-only\" };\n  }\n\n  return { type: \"allow\" };\n}\n```\n\nThis code is ordinary, and that is exactly the point. Harness does not solve problems by being \"smarter\"; it solves them by making boundaries clear. This is the fundamental difference between Harness and Agent. Agent's value comes from judgment under uncertain tasks. Harness's value comes from control under clear boundaries. They are not intelligence layers above and below each other. They are different responsibility layers.\n\nReturn to the example. A small Claude Code-style CLI Agent that helps the user fix failing tests can start with a minimal Harness like this:\n\n```\nsrc/\n  agent/\n    loop.ts\n    model-client.ts\n  harness/\n    execution.ts\n    tools.ts\n    context.ts\n    lifecycle.ts\n    trace.ts\n    verification.ts\n    policy.ts\n  session/\n    event-log.ts\n    store.ts\n  cli/\n    main.ts\n```\n\nThese are not recommended fixed directory names. They emphasize that responsibilities should not be mixed.\n\n`agent/loop.ts`\n\nadvances model turns. It should not directly `exec`\n\nshell.\n\n`harness/execution.ts`\n\nruns commands and file actions. It should not decide how the model thinks next.\n\n`harness/tools.ts`\n\nhandles tool protocol and observation. It should not secretly bypass permission.\n\n`harness/context.ts`\n\nprojects this turn's input from session. It should not dump full history into the model.\n\n`harness/lifecycle.ts`\n\nhandles pause, resume, completion, and failure states. It should not express the world only as `while true`\n\n.\n\n`harness/trace.ts`\n\nrecords events and debug information. It should not save only the final answer.\n\n`harness/verification.ts`\n\nconfirms task completion with external evidence. It should not trust the model saying \"it is fixed.\"\n\n`harness/policy.ts`\n\nhandles permissions, scope, and governance. It should not leave high-risk actions to model self-control.\n\nTogether they produce a load-bearing chain:\n\n``` php\nuser input\n-> Lifecycle creates session\n-> Context assembles this turn's input\n-> Model returns next-step intent\n-> Tools validate intent structure\n-> Governance decides whether it is allowed\n-> Execution runs the action\n-> Observability records events\n-> Context generates next-turn input\n-> Verification decides whether complete\n```\n\nThis chain is the Harness skeleton. It does not steal reasoning from the model. It does not pretend to be another Agent. It does one thing:\n\n```\nhost the model's dynamic judgment inside a stable engineering control system.\n```\n\nFor a minimal demo, Harness can be thin. But do not make boundaries messy at the start. Even with only local file tools, distinguish:\n\n```\ntool intent\npolicy decision\nexecution result\nsession event\nmodel context\nverification evidence\n```\n\nThese objects feel verbose at first, but they save you when tasks grow, tools multiply, and users increase.\n\nIf Harness is only understood through seven layers, it remains abstract. In real code, Harness professionalism often appears in a small, stable set of events and objects.\n\nFor example, model output should not be called a \"command\" directly. A better name is:\n\n```\nIntent: the action intent proposed by the model.\n```\n\nIntent has not been approved or executed. It is only the model's next step from current context. Calling it Intent forces the system to keep asking:\n\n```\nIs this intent structurally valid?\nWhich tool does it map to?\nWhat is its risk level?\nIs it allowed in the current session?\n```\n\nAfter permission, it becomes:\n\n```\nExecutionRequest: the action request ready for the execution environment.\n```\n\nAfter execution, it becomes:\n\n```\nExecutionResult: the factual result returned by the execution environment.\n```\n\nBut ExecutionResult should not be stuffed into the model unchanged. It must be organized into:\n\n```\nObservation: the observation the next model turn can understand.\n```\n\nObservation should include more than stdout:\n\n```\nsuccess or failure\nexit code\nwhether truncated\nchanged files\nerror category\nwhether retryable\nwhether permission or security event was triggered\nwhether verification evidence was produced\n```\n\nThese names may look pedantic, but they solve review and recovery. If the user asks \"why did it change this file,\" you cannot show only a final answer. You need:\n\n```\nModelIntent: why the model proposed this change\nPolicyDecision: why the system allowed execution\nExecutionResult: what actually happened\nObservation: what the model saw next\nVerificationEvidence: basis for completion\nAuditRecord: who approved what\n```\n\nWithout these objects, Harness can only guess from transcript. Transcript is narrative for people, not a source of truth for system recovery and evaluation.\n\nA fuller event flow:\n\nThe point is where facts come from. The model saying \"I need to run tests\" is only `ModelIntent`\n\n. Only after tests actually finish is there `ExecutionResult`\n\n. Only after failure logs, exit code, and truncation state are organized is there `Observation`\n\n. Whether the task can be called complete depends on `VerificationEvidence`\n\n.\n\nThat is why messages cannot equal session. Messages are context visible to the model; session event log is the system source of truth. The former may be compacted, reordered, and projected. The latter should stay auditable, replayable, and attributable as much as possible.\n\nEvent modeling also affects eval. Many Agent evaluations that only inspect final answers miss the real issue. An Agent may answer correctly while using a dangerous tool. It may fail, but because the test environment lacks dependencies, not because model judgment was wrong. Only a clear event chain can attribute failure to:\n\n```\nmodel judgment?\ntool schema too vague?\npermission policy too broad?\ncontext projection missed key file?\nsandbox environment inconsistent?\nverification command wrong?\n```\n\nWithout event objects, there is no professional failure attribution. Without failure attribution, Harness improvement is guesswork.\n\nCompress the article into three sentences. First, Agent is not the model doing things by itself; it is the model proposing next steps inside a loop. Second, once the next step enters a real environment, it needs Execution, Tools, Context, Lifecycle, Observability, Verification, and Governance. Third, these model-external control responsibilities together are Harness.\n\nSo Harness is not another Agent, not a longer system prompt, not a supervising model, not a tool collection, not a framework name, and not product UI. It is the control system that lets Agent safely enter the real world.\n\nThe next article continues along the natural evolution path:\n\n``` php\nChat Agent\n-> Tool Agent\n-> Runtime Agent\n-> Managed Agent\n```\n\nThen you will see that Harness is not a huge architecture designed up front. It is an engineering boundary forced to grow every time Agent operates on a little more of the real world.\n\nIn the teaching project, the Harness does not need to be thick at first, but responsibilities must stay separate: Express API orchestrates requests, `runAgentLoop()`\n\nhandles state transitions, `ToolRegistry`\n\nowns tool execution boundaries, `JsonlSessionStore`\n\nrecords facts, and React UI projects messages and events. If these objects do not swallow each other, permissions, traces, and resume can be added later without rewriting core.\n\nGitHub source: [00-04-harness-control-system.md](https://github.com/LienJack/build-harness/blob/main/docs/en/00-04-harness-control-system.md)", "url": "https://wpnews.pro/news/harness-base-definition-the-control-system-outside-the-model", "canonical_source": "https://dev.to/lien_jp_db54b8b7fd9fa0118/harness-base-definition-the-control-system-outside-the-model-453m", "published_at": "2026-06-03 03:37:11+00:00", "updated_at": "2026-06-03 03:41:39.624336+00:00", "lang": "en", "topics": ["ai-agents", "ai-infrastructure", "ai-safety", "large-language-models"], "entities": ["Agent", "Harness", "CLI Agent"], "alternates": {"html": "https://wpnews.pro/news/harness-base-definition-the-control-system-outside-the-model", "markdown": "https://wpnews.pro/news/harness-base-definition-the-control-system-outside-the-model.md", "text": "https://wpnews.pro/news/harness-base-definition-the-control-system-outside-the-model.txt", "jsonld": "https://wpnews.pro/news/harness-base-definition-the-control-system-outside-the-model.jsonld"}}