Autonomous Long-Running Coding Agents

Engineers are shifting from better prompting to better control systems for autonomous coding agents, designing systems with goals, evaluators, loops, and artifacts that allow agents to work without constant human steering. This approach, demonstrated in Claude Code's /goal mode and /loop command, uses verifiers like test suites and agent evaluators to ensure reliable progress on long-horizon engineering tasks.

Autonomous Long-Running Coding Agents What is the big deal with loop engineering and autonomous long-running agents. Autonomous coding is moving from better prompting to better control systems. The important shift is that engineers are learning how to wrap agents in goals, evaluators, loops, and artifacts that let them keep working after the human stops typing. This matters because most serious engineering work spans long horizons: ambiguous requirements, hidden constraints, partial failures, changing context, and repeated verification. The new frontier is designing the system around the agent so it can plan, execute, check its work, recover from mistakes, and keep making progress without constant human steering. This piece is based on a DAIR.AI Academy session on autonomous long-running coding agents, where I walked through Claude Code’s /goal mode, the newer /loop command, verifiers, artifacts, and orchestration patterns in practice. Written in collaboration with Codex and Claude Code. From Prompting to Goal Design The core idea behind features like Claude Code’s /goal is simple. A coding agent remains the executor, but the human no longer interacts with it turn by turn. Instead, the human specifies the desired end state, the evidence required to prove success, the constraints that must not be violated, and, where possible, the number of turns and budget. That goal works more like a contract than a longer prompt. A weak goal gives the model room to stop early, take shortcuts, or redefine success in a way that looks plausible in the transcript but fails in the real system. A strong goal gives the agent a target it can repeatedly measure itself against. Engineering judgment still matters here. The best goals encode domain knowledge that the model would otherwise guess. For a research experiment, that might mean a target benchmark score, a held-out evaluation, a required loss curve, and a rule that the result must beat an initial baseline. For a UI task, it might mean a screenshot reference, concrete layout constraints, and a browser verification step. The model can execute, but the human still defines what “done” actually means. The Evaluator Becomes a First-Class Component Long-running agents need a second role besides the goal. That evaluator can be another coding agent, an LLM-as-judge, a script, a test suite, a benchmark harness, or a mix of all of them. The key design choice is matching the evaluator to the task. When success is crisp, deterministic checks are better. Type checks, unit tests, lint rules, integration tests, and benchmark scripts should be used whenever they can express the condition clearly. When success is fuzzy, an agent evaluator becomes useful. A script can tell you whether tests pass, but it cannot easily decide whether a generated research report is coherent, whether an implementation faithfully follows a paper, or whether a UI matches a design intent. This is where the evaluator benefits from language, judgment, and sometimes vision. The practical pattern uses deterministic checks as the floor and agent evaluation as the higher-level review. That combination reduces hallucinated success while still allowing autonomy on tasks that do not fit cleanly into a test assertion. Verifiers Define the Boundary of Trust The deeper point is that autonomy only works when the system has a reliable verifier. A coding agent can generate a plan, implement a feature, and explain why it believes the work is complete, but that explanation should not be treated as evidence. Evidence comes from an external check that the agent cannot easily talk its way around. For code, the verifier might be a test suite, type checker, benchmark, browser run, screenshot comparison, or reproducible script. For research work, it might be a held-out evaluation, a reproduced table, a loss curve, or a benchmark score that improves over the baseline. For design work, it might be a reference screenshot plus a visual review step. The verifier is what turns a long-running agent from a confident text generator into a system that can be trusted with more time. Most shortcuts appear at this boundary. If the verifier is vague, the model will often satisfy the easiest interpretation of the task. If the verifier is too narrow, the model may overfit to it and miss the broader intent. A good autonomous workflow, therefore, needs layered verification, with cheap deterministic checks catching basic failures and higher-level review catching judgment-heavy failures. A few of the frontier models can already achieve some level of verification, but based on my research, there is still an evident OOD problem, where if the verification task you assign to the agent falls outside the training distribution, models struggle significantly. Verifiers are still an open area of research, but I anticipate more companies will start to make huge investments in this area. The concept of fine-tuned verifiers is also in high demand in the enterprise. Loops Make Autonomy Durable A goal gives the agent direction, but a loop keeps the work alive. This distinction is important because models often stop before the real task is finished. They may hit a turn limit, lose confidence, exhaust context, or decide that a partial solution is enough. The loop is the outer control system. It wakes up, inspects progress, runs checks, compares the result against the goal, and sends the agent back in with the next instruction when the goal has not been met. In its simplest form, this is the Ralph loop pattern with a coding agent and a deterministic condition. In a more flexible form, the loop includes an evaluator agent that can reason about progress and decide what should happen next. Long-running autonomy works as repeated effort under supervision from a control layer, not as one continuous act of intelligence. The agent can still fail, but the loop gives the system a way to notice the failure and continue instead of silently declaring victory. Planning Is Where Expertise Enters One of the strongest themes from the session was that planning remains critical. You can ask a frontier model to generate a plan, but you still need to inspect it, challenge assumptions, and make the success criteria sharper before handing the task to an autonomous loop. This leads to a useful division of labor. A stronger planning model can help define the goal, identify missing constraints, and structure the evaluation. A different execution model can then run the implementation once the plan is clear. In practice, this means engineers should stop thinking of “the model” as a single choice. Model choice becomes an architecture decision. Some models are better planners. Some are better executors. Some are cheaper evaluators. Some are better at vision-based review. A good orchestrator lets you swap these roles instead of waiting for one vendor to provide the perfect coding agent interface. Visual Artifacts Become Control Surfaces Terminal transcripts do not scale when many agents are running. Once you have several sessions working in parallel, raw text becomes a poor interface for understanding progress. Live artifacts matter because a dashboard with loss curves, benchmark scores, task states, screenshots, cost estimates, and recent decisions gives the human a much better way to supervise autonomy. The artifact becomes the control surface for deciding when to intervene, rather than a report generated after the fact. The most useful pattern is to separate storage from presentation. Markdown or a vault can store durable evidence, logs, notes, plans, and results. HTML artifacts can render that state into something visual and interactive. The agent can search the Markdown, while the human can monitor the artifact. For UI and product work, visual cues are especially powerful. A screenshot reference can communicate design intent more precisely than prose, and a vision-capable evaluator can compare the implementation against that reference. This reduces the common failure mode where the agent technically implements the requested component but misses spacing, hierarchy, alignment, or product feel. Session Mining Turns Usage Into Memory Another important insight is that past agent sessions are a rich source of workflow data. If an agent repeatedly fails in the same way, forgets to run the same check, uses the wrong path, or retries the same broken command, that pattern should not stay buried in logs. Session mining turns those transcripts into operating rules. An agent can scan the last thirty days of work, find recurring failure modes, and propose updates to project instructions, vault learnings, or agent rules. This is how a team can gradually improve its harness without manually remembering every mistake. The goal is to make the local environment smarter without training a model from scratch. A small rule in an agent instruction file can prevent repeated failures across future sessions, especially when the rule is specific to the project. A Practical Operating Model For AI engineers, the emerging workflow looks like this. Start with a small, cheap subset before launching the full autonomous run. Write a goal with measurable success criteria, explicit constraints, and a turn or time budget where possible . Separate the executor from the evaluator so implementation and judgment are not collapsed into one role. Define external verifiers before the long-running loop starts. Use deterministic checks wherever possible, then add agent review for fuzzy criteria. Require proof artifacts such as logs, screenshots, benchmark curves, or changed files. Mine past sessions and promote repeated lessons into project instructions. That is the difference between using a coding agent and engineering an autonomous coding system. One gives you a conversation. The other gives you a harness. What Still Breaks None of this removes the hard problems. Agents still take shortcuts. They still stop early. They still overestimate completion. They still produce confident but weak plans, especially on recent papers, unfamiliar benchmarks, or systems outside their training distribution. Trusting them more will not solve this. Better control systems will. Goals, loops, evaluators, deterministic checks, visual artifacts, and session memory are all ways of making autonomy observable and correctable. The direction is clear. The future of coding agents depends on better orchestration around more capable models, where engineers design the conditions under which agents can safely run for hours or days and still produce work that can be verified.