cd /news/artificial-intelligence/karpathy-s-autoresearch-just-went-vi… · home topics artificial-intelligence article
[ARTICLE · art-28882] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

Karpathy's "Autoresearch" Just Went Viral — Here's How Software Engineers Can Actually Use the Pattern at Work

Andrej Karpathy's open-source 'autoresearch' repo introduces a pattern for structuring AI-agent work that generalizes beyond machine learning. The pattern divides responsibility into a fixed evaluator, a modifiable implementation, and human-authored instructions, enabling AI agents to run unattended experiments overnight. Software engineers can apply this pattern to any problem with objective measurement, changeable components, and clear success criteria.

read9 min views1 publishedJun 16, 2026

Forget neural networks for a second. The real idea inside this repo is a blueprint for letting AI agents run unattended overnight — and it maps onto problems you already have on your team.

If you've been anywhere near tech Twitter or LinkedIn this week, you've probably seen people losing their minds over a small GitHub repo called autoresearch, published by Andrej Karpathy — former Tesla AI director and OpenAI founding member. The framing is dramatic: an AI agent that runs machine learning experiments on its own, overnight, while you sleep. Tweak the code, train for five minutes, check if it got better, keep it or throw it away, repeat. Wake up to a log of a hundred experiments and a model that's quietly improved itself.

If you're not an ML researcher, your instinct might be to scroll past. "Cool, but I don't train neural networks. How does this apply to me?" Here's the thing — the neural network part is almost incidental. What Karpathy actually open-sourced is a pattern for structuring AI-agent work: a specific way of dividing responsibility between human and AI that happens to generalize to a huge range of engineering problems. Once you see the pattern, you start noticing places in your own job where it fits.

The repo itself is intentionally tiny — and that's the point. There are really only three files that matter:

The evaluator (untouchable). A file containing the fixed constants, data preparation, and the scoring logic. The agent is never allowed to modify this. It's the ruler everything else gets measured against.

The implementation (the agent's playground). A single file containing the actual model, training loop, and hyperparameters. This is the only file the agent is allowed to change. Architecture, batch size, optimizer — all fair game.

The instructions (the human's only job). A plain Markdown file describing what the agent should try, what the constraints are, how to interpret results, and what to do when something breaks. Karpathy calls this "programming the research org in Markdown."

The loop itself is almost embarrassingly simple: the agent reads the instructions, forms a hypothesis, edits the implementation file, runs a fixed time-boxed experiment (exactly 5 minutes), checks a single number, and decides — keep the change or revert it with git. Then it does it again. Roughly 12 times an hour. Around 100 times overnight.

One detail that's easy to miss but matters a lot: the philosophy baked into the instructions explicitly favors deleting code while maintaining performance as a win, and treats tiny improvements that add complexity as not worth keeping. That's a value judgment a human encoded once — and the agent now applies it autonomously, every cycle, for as long as it runs.

Strip away the machine learning context, and what you're left with is this:

Evaluator(fixed, automatic, trustworthy) +Implementation(the thing being improved) +Direction(human-authored intent and constraints) = a loop an AI agent can run unattended, with a built-in safety mechanism — revert if it didn't help.

This is exactly the shape of a huge number of problems that already exist on software teams — we just don't usually frame them this way. Anywhere you have something that can be objectively measured, something that can be changed, and a clear sense of what "better" means, this pattern applies.

The biggest shift this represents isn't technical. It's about where your effort goes. In the old model, you spend your time editing the implementation — writing the code, tweaking the config, trying the next idea yourself. In this model, you spend your time writing and refining the instructions — the evaluator and the direction file. The agent burns through the implementation iterations. You define what "good" means and let volume do the work.

Karpathy's own phrase for this is "programming the programmer." You're not writing the training script anymore. You're writing the thing that writes it, over and over, faster than you ever could by hand.

If you've ever had a slow function, query, or API endpoint that "could probably be faster" but nobody has time to dig into — this is the cleanest possible fit. The evaluator: A benchmark script that runs the function with realistic inputs and reports a number — latency, throughput, memory.

The implementation: The function or module itself.

The direction file might say:

Your goal is to reduce p95 latency of the batch processing function without changing its public interface or breaking any existing tests. Run the benchmark after every change. If latency improves and all tests pass, commit the change. If not, revert. Prefer simpler implementations — if two versions perform similarly, keep the one with less code. Stop after 15 attempts or once you've had 3 consecutive attempts with less than 1% improvement.

Set this running against a non-critical service, come back in an hour, and review the diff and the experiment log — not just the final code.

Every team has that one test that fails 1 in 20 runs and nobody has gotten around to fixing. This is a perfect "run it 50 times and see" problem.

The evaluator: Run the flaky test repeatedly (say, 30 times) and report the pass rate.

The implementation: The test file and the code it's testing.

The direction file might say:

This test fails intermittently. Form a hypothesis about why — race condition, timing assumption, shared state, or something else — make a targeted fix, then run the test suite 30 times. If the pass rate improves and no other tests regress, keep the change. If not, revert and try a different hypothesis. Log your reasoning for each attempt, including failed ones, so a human can review the investigation afterward.

Even if the agent doesn't fully fix it, the log of hypotheses it tried and ruled out is often more valuable than starting from scratch yourself.

"This module needs cleanup" is a sentence every engineer has said and almost nobody has time for. The autoresearch pattern turns your existing test suite into the evaluator.

The evaluator: Your existing test suite, which must stay green, plus a complexity metric such as lines of code or cyclomatic complexity.

The implementation: The module being refactored.

The direction file might say:

Reduce the complexity of this module while keeping all tests passing. Favor deleting code over adding it — a change that removes 20 lines and keeps tests green is more valuable than one that adds a new abstraction. After each change, run the full test suite. If anything breaks, revert immediately. Stop after 10 attempts or when you can no longer find a simplification that doesn't break a test.

This directly borrows Karpathy's "deletion is a win" philosophy — which most engineering teams say they hold but rarely enforce systematically.

Thread pool sizes, cache TTLs, retry and backoff settings, connection pool limits, batch sizes for background jobs — most teams set these once, based on a guess, and never revisit them.

The evaluator: A load-testing script that reports your target metric — cost per request, p99 latency, error rate — under a fixed, repeatable load.

The implementation: The configuration file.

The direction file might say:

Your goal is to minimize average cost per request without increasing p99 latency above 400ms or error rate above 0.1%. Change one configuration value at a time. Run the load test after each change. Keep changes that improve cost without violating the constraints; revert everything else. Try at least 20 configurations before stopping.

This is the closest analog to the original repo — fixed time budget, single metric, lots of small experiments, git-based keep or revert.

If your product has any AI-powered features — and increasingly, most products do — this is arguably the most natural application of the entire pattern, because it maps almost one-to-one. The evaluator: A fixed evaluation set of inputs with known-good expected outputs, scored automatically through exact match, similarity score, or a rubric-based judge.

The implementation: Your prompt template, system instructions, or few-shot examples.

The direction file might say:

Your goal is to improve the accuracy score on the evaluation set without increasing average token usage by more than 10%. You may modify the prompt template, instructions, and examples — but not the evaluation set itself. Run the evaluation after every change. Keep improvements, revert regressions. Prefer shorter prompts when scores are equivalent.

This is literally the same three-file contract — the eval set as the untouchable evaluator, the prompt as the implementation, and your instructions as the direction file.

Slow CI pipelines are a tax every team pays daily, and "someone should really look at why the build takes 18 minutes" is a perennial backlog item that never gets prioritized.

The evaluator: Total CI run time, reported automatically by your pipeline.

The implementation: Your CI configuration, caching strategy, and parallelization setup.

The direction file might say:

Your goal is to reduce total CI run time while keeping the pipeline green and not removing any test coverage. Try caching strategies, parallelization, and dependency optimization one change at a time. Run the full pipeline after each change. Keep improvements, revert if the pipeline fails or coverage drops. Try at least 10 distinct approaches.

The original setup works because the scope is extremely tight: one file can be changed, one machine is affected, one metric decides success, and every change is either committed or instantly reverted with git. Nothing the agent does is irreversible, and nothing it does affects anything outside a sandboxed environment.

The moment you widen that scope, the calculus changes. So when you apply this pattern at work, keep these non-negotiables:

Used within these boundaries, this isn't reckless automation. It's closer to setting up a very thorough, very patient junior engineer with a clear rubric and walking away for a few hours.

You don't need a GPU or a research budget to test this pattern. Pick the smallest, lowest-stakes version of one of the ideas above — something on a branch, in a sandbox, with a test suite or benchmark you already trust.

A good starting point: take a function with an existing unit test, write a short direction file describing the goal and constraints, point your coding agent at it, and ask it to run a handful of iterations — committing improvements and reverting regressions as it goes.

Then do the part that actually matters: read the log of what it tried. Not the final code — the reasoning trail. That's where you'll find out whether this pattern earns a permanent spot in your workflow.

Want to go straight to the source? The original repo is here: github.com/karpathy/autoresearch

Part of the AI for Everyone series — practical AI guides for engineers and professionals. Follow for more.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @andrej karpathy 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/karpathy-s-autoresea…] indexed:0 read:9min 2026-06-16 ·