# What Is Loopcraft? From Prompt Engineering to Agent Loop System Design

> Source: <https://dev.to/luhuidev/what-is-loopcraft-from-prompt-engineering-to-agent-loop-system-design-2dff>
> Published: 2026-06-26 10:53:45+00:00

🙋

*I’m Luhui Dev, a developer who has been breaking down Agent engineering and exploring how AI can be applied in education.
I focus on Agent Harness, LLM application engineering, AI for Math, and the productization of education SaaS.*

A new term has been circulating in the Silicon Valley agent world: **Loopcraft**.

My first reaction was: isn't this just putting an agent inside `while true`

? A few years ago people called it Agent Loop. Then it became Workflow and Harness Engineering. Now we have Loopcraft. The AI industry never stops inventing new names.

But after following recent discussions from Peter Steinberger, Claude Code lead Boris Cherny, and Andrej Karpathy around agent loops, I do think something real is changing.

Peter Steinberger put it this way:

You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.

In other words, you should not manually prompt a coding agent again and again. You should design a loop that prompts the agent for you.

Boris Cherny said something similar:

I don’t prompt Claude anymore. I write loops. The loops do the work.

Karpathy made a related point when introducing Autoresearch: **if a human still has to inspect every result, decide the next step, and give the agent another instruction, the human becomes the throughput bottleneck of the whole system.**

Put together, these comments point to an abstraction shift:

``` php
Before:
Human -> Prompt -> Agent -> Result

Now:
Human -> Design the loop
             ↓
Task discovery -> Agent execution -> Automatic verification -> Retry on failure -> Save state -> Continue running
```

My shortest definition is: **Prompt Engineering optimizes a single interaction. Loopcraft optimizes the whole system that runs repeatedly.**

Loopcraft is less interested in how to complete one isolated task and more interested in questions like:

This article breaks down three questions:

For the past two years, the typical way to use a coding agent looked roughly like this:

``` php
Tell the agent what to do
-> Wait for code changes
-> Review the result
-> Tell it what is wrong
-> Let the agent continue
-> Review again
```

The model can already write code, search files, and run tests. But the whole process is still driven step by step by a human.

After each round, the agent stops and waits for the next instruction.

On the surface, the human is using the agent. From another angle, the human is acting as the scheduler, state machine, and verifier of the agent system.

So even if the model is fast, the human still cannot leave. The async mobile supervision features shipped by many agent products are one attempt to relieve this bottleneck.

That is the problem behind the recent loop discourse: **do not automate only one step inside the work. Design the surrounding system for task discovery, assignment, verification, and continuation.**

For example, fixing a CI failure used to look like this:

``` php
I see a CI failure -> Open Codex -> Copy the error log -> Ask it to analyze -> Review the diff -> Ask it to run tests -> Confirm green -> Manually create a PR
```

Inside a loop, it can become:

``` php
CI failure event -> Automatically read logs -> Decide whether the problem is safe to automate -> Start an agent in an isolated worktree -> Modify code -> Run tests and lint -> A second verifier checks the diff -> Create a PR when it passes -> Notify a human when it cannot proceed
```

The real automation here is not just code editing. It is the closed loop around code editing.

So Loopcraft is not a new model capability, and it is not one specific framework.

It is closer to an agent system design discipline: **organizing task execution, result verification, event triggers, state persistence, and system improvement into nested loops.**

The name is new. The underlying technical pieces are not.

We already had:

Even the simplest Ralph Loop is basically repeated invocation of a coding agent:

```
while true; do
  claude "Read the task and current progress, then continue the work"
done
```

This is where the terms are easiest to confuse.

Over the past year, Agent Harness has already become a popular concept.

Anthropic's definition is clear: a harness is the system that enables a model to work as an agent, including context handling, tool use, permissions, environment, state management, and result return.

Put simply, Harness answers: **what environment does this agent work in?**

Loopcraft answers a different question: **when is this agent started, why does it continue running, who checks the result, and what should happen in the next round?**

A simplified analogy:

```
Model: the worker's brain
Tools: the tools in the worker's hands
Harness: the worker's workstation and work environment
Loop: the factory cadence, quality control, and task scheduling
Loopcraft: how to design and layer the whole production loop system
```

In practice, the boundary is not absolute.

A mature long-running harness already includes retries, verification, and state handoff. A loop also depends on the harness for tools and execution environment.

I prefer to separate them by focus:

| Concept | Main question |
|---|---|
| Prompt Engineering | What instruction should the model see in this round? |
| Context Engineering | What information should the model see right now? |
| Tool Engineering | What actions can the agent take? |
| Harness Engineering | How can one agent run happen reliably? |
| Loopcraft | How are repeated runs triggered, verified, connected, and improved? |

Loopcraft does not replace Harness.

In fact, **without a stable harness, a loop just manufactures errors automatically and continuously.**

LangChain later broke Loopcraft into four practical layers. I find the breakdown useful.

The innermost layer is the agent loop we already know:

``` php
The model reasons
-> Calls a tool
-> Reads the tool result
-> Continues reasoning
-> Stops when it believes the task is done
```

For example, a documentation agent can:

``` php
Read an issue -> Search the repository -> Edit Markdown -> Check links -> Create a PR
```

An agent saying "done" does not mean the task is actually done. So we wrap the agent in a verification layer:

``` php
Agent executes
-> Verifier checks
-> If it fails, return concrete feedback
-> Agent executes again
-> Repeat until it passes or the budget is exhausted
```

The verifier can be unit tests, type checks, lint, schema validation, and so on.

One important principle: **try not to let the same entity that writes the answer also grade its own exam.**

Once execution and verification are in place, the next step is removing manual startup.

Tasks can be triggered by real events. The agent is no longer just a chat tool; it becomes a background component in a business system.

``` php
Event
-> Deterministic rule decides whether to handle it
-> Start agent
-> Verify result
-> Update the real system
```

The first three layers automate work.

The fourth layer starts automating how the work gets better.

Every agent run leaves a trace:

An outer system can periodically analyze these traces:

``` php
Collect many run records
-> Identify frequent failure modes
-> Adjust prompts, tools, skills, or verifiers
-> Re-test on an eval set
-> Update the harness after passing
```

This is the layer where Loopcraft becomes most valuable.

An ordinary loop repeats work. A hill-climbing loop changes the system that produces the work.

``` php
Ordinary loop:
Failure -> Try again

Improvement loop:
Failure -> Analyze why it failed
        -> Modify prompts, tools, or verification rules
        -> Make future runs more reliable
```

The outer loop's arrow does not just go back to the beginning of the task. It reaches into the agent and changes the inner loop.

That is where compounding starts.

Karpathy's Autoresearch is a good concrete example for understanding Loopcraft.

The project is conceptually simple:

``` php
Agent proposes a training improvement
-> Modify train.py
-> Run training for a fixed five minutes
-> Read the val_bpb metric
-> Keep the change if the metric improves
-> Roll back if the metric worsens
-> Start the next experiment
```

It can run about 12 experiments per hour without human intervention. Overnight, it can complete close to 100 experiments.

The clever part is not a fancy agent prompt. It is that Karpathy reshaped the problem into an environment that is ideal for loop optimization:

That is the core shift of Loopcraft: **humans move from directly doing the task to designing a system that can repeatedly do, verify, and improve the task.**

Autoresearch is a special environment. Ordinary developers can start with something simpler:

Automatically receive a small issue, attempt a fix, create a PR after tests pass, and retry with feedback when it fails.

Do not start with multi-agent orchestration. A minimal loop needs only six parts:

The whole flow can be simplified to:

```
for attempt in range(3):
    result = run_agent(goal, load_state())
    verdict = verify(result)
    save_state(result, verdict)

    if verdict == "passed":
        create_pull_request()
        break

    if verdict != "retryable":
        notify_human()
        break
```

Whether you use Claude Code, Codex, GitHub Actions, Bash, or Python is not the important part.

What matters is designing this chain clearly:

``` php
Trigger -> Execute -> Verify -> Feedback -> Retry or exit
```

As long as a task has a clear goal, reliable feedback, recoverable state, and a stopping condition, you already have a minimal loop.

Running repeatedly is not the same as improving.

If the agent receives no new feedback, repeating ten times usually means spending ten times the tokens to make similar mistakes.

The execution agent should not freely modify tests, evaluation metrics, time budgets, permission boundaries, or verifier prompts.

Otherwise it may not be making the task better. It may only be making "pass" easier.

Multiple agents do not automatically create intelligence. They first create more token cost, file conflicts, duplicate work, and state synchronization problems.

Get one worker, one verifier, and one persistent state path working before adding parallelism.

Number of agents, runtime, token usage, and tool-call count are not the final value.

What matters is **verified progress per unit cost**.

Examples include issue auto-resolution rate, average cost per qualified PR, and human handoff ratio.

This is the risk I care about most.

When an agent can automatically write code, test it, fix it, and create a PR, humans may be tempted to look only at the final green check.

But the faster the system produces code, the faster human understanding of that system can decline.

Loopcraft should not become an excuse to stop thinking. It actually raises the bar for how much the human has to understand.

I increasingly feel that agent engineering is going through an abstraction shift.

At first we discussed prompts. Then we moved to context, tools, memory, and harnesses.

Now the focus is moving outward again: how to put a single agent run inside a larger cycle of tasks, verification, and improvement.

I remain skeptical of fully removing humans from the loop.

But I agree with one thing:

**Do not only fix the current result produced by the agent. Start fixing the system that keeps producing those results.**