cd /news/ai-agents/self-improving-agents-still-need-hum… · home topics ai-agents article
[ARTICLE · art-33150] src=goose-docs.ai ↗ pub= topic=ai-agents verified=true sentiment=· neutral

Self-Improving Agents Still Need Humans

Goose, an AI coding agent, still requires human oversight to prevent benchmark overfitting and ensure genuine capability improvements. The team uses Terminal-bench with a weaker model to identify failure patterns, then has humans analyze and guide agent improvements rather than relying on self-improvement loops that could lead to benchmark tricks.

read4 min views1 publishedJun 18, 2026

Goodhart's law is the benchmarker's curse: when a measure becomes a target, it stops being a good measure. Coding-agent benchmarks are almost designed to trigger it. The tasks are public, the result is one number, and the leaderboard inevitably fills up with harnesses that are, often without meaning to be, overfit to the benchmark.

That does not make the go-to standard Terminal-bench useless, but it does change how the goose team uses it. The leaderboard is a noisy measure of general agent ability. The signal is a pattern of failures: places where goose keeps getting stuck or where goose fails and another harness succeeds.

That is also why we usually benchmark with Sonnet rather than the strongest model available. We are not trying to get the largest possible number. We want enough failures left on the table to see what support the agent is missing.

Self-improving agents are all the rage, but the version we trust requires humans for now. The loop is to run the benchmark, have goose compare a task where one harness succeeded and another failed, and ask it to explain the difference in concrete terms. A human then looks across a few of those failures, decides what the general lesson is, and asks goose to implement that broader improvement.

The human step keeps the loop from collapsing into benchmark tricks. Without it, self-improving agents might choose to just write one Skill per task; agents are just as lazy as humans. With it, we can hopefully turn a failure on one task into a capability that should help on tasks we have not seen yet.

The tooling for this lives in evals/harbor in the goose repo. The main entry point is

./evals/harbor/cmd.py

, a little Python command with subcommands for run

, list

, show

, task

, compare

, and pull

. It wraps Harboraround a particular goose binary, model, and set of extensions, then leaves a directory full of per-task JSON and logs.

This is a place where Python works really well. cmd.py

is a PEP 723 uv script, so the dependencies sit at the top of the file. There is no packaging ceremony, no new repo, and no remembering which virtualenv was blessed. You ask an agent to add a subcommand, instrument a value, or make the output table less terrible, and it just does it. The script becomes both the control surface and the documentation of the experiment. This is where vibe coding shines: small tools that express what needs to be done and where it really doesn't matter if they are coded elegantly.

The actual benchmark runs happen on a remote Linux machine, because a full run is slow, parallel, and Docker-heavy work. But the analysis does not have to stay there. cmd.py pull tbench@...:/path/to/goose

rsyncs the run directories back down and then the local tool can list runs, inspect one task, or compare two jobs. That matters more than it sounds. If every question requires SSHing into the benchmark box and spelunking through logs by hand, you ask fewer questions.

The comparison recipes in PR #9637 implement the step where goose looks at the result. They read two runs for the same task, the agent trajectories, the verifier output, and the reference solution. The useful output explains the mechanism: "A noticed the image, B never opened it" or "A stopped after the correct file existed, B kept rewriting it until it broke."

Here are two findings from recent work. First, goose would sometimes keep exploring after it had enough information to finish. It would be near the answer, but instead of concluding, it would keep poking and eventually run out of turns, failing the benchmark. The same thing can happen in a normal conversation: if goose does not know how far it is into the exchange, it has no real reason to stop exploring and steer toward a solution. PR #9636 added turn count awareness to MoIM, the context goose injects to keep the model oriented. So now goose knows when it is time to call it a day.

Second, goose had lost the ability to read images from disk. Originally, when you dragged an image into a goose conversation, we inserted the path to that image and gave the model a tool to read it. That was clunky, so we replaced it with proper image in the conversation. In the process, we accidentally deleted the disk image tool too. In the benchmark this showed up as goose using PIL to inspect an image instead of using the visual abilities of the model. The comparison recipe made the failure obvious because the successful run looked at the image and the failing run worked around it.

Thanks to the loop, we found two real weaknesses in goose and fixed them. The benchmark improved, which is nice, but more importantly, both fixes should make goose better in normal use: less likely to drift when it should finish and able again to look at an image file sitting on disk.

That is when a benchmark is useful: when it stops being a leaderboard and starts being a bug report.

── more in #ai-agents 4 stories · sorted by recency
── more on @goose 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/self-improving-agent…] indexed:0 read:4min 2026-06-18 ·