Everyone's saying the same thing right now: stop prompting your coding agent, start designing the loop that prompts it for you, and let it do the work. We agree. We've just been doing it long enough that it isn't a prediction anymore — autonomous loops have been running our R&D on four production repos for weeks.
Here's a concrete one. On NyxID, our open-source gateway, a loop took a load-balancing feature from a GitHub issue to a merged PR last week: about 1,400 lines of Rust, and the merge metadata records human_touch_count = 0
, meaning no human edited the diff. A person still scoped the issue and clicked merge — but the code came out of the loop and survived review without anyone rewriting it. (PR #975)
That's the part everyone's excited about, and it's real. It's also not the hard part, and not the reason we trust the thing enough to leave it running.
The failure mode of an autonomous loop isn't that it does nothing. It's that it does something confidently wrong: writes plausible code that doesn't hold up, papers over a failing test, claims a result it can't support, and runs until your budget is gone. A single model is sure of itself even when it shouldn't be, and a naive loop inherits all of that confidence with none of the brakes. That's the real reason most "agent runs for 10 hours" demos stay demos.
So the thing we actually built consensus-loop
around isn't "make the agent run." It's "make the agent trustworthy enough that you can walk away." The way you get there is to stop letting one confident model decide alone.
consensus-loop
is a skill you inject into a host you already use — Claude Code, Codex, Cursor, or Gemini. You point it at a repo, hand it one host.env
file with that repo's facts, and it takes over the development loop from there.
REPO_ROOT=/path/to/your/repo
GH_REPO_SLUG=your-org/your-repo
BUILD_CMD="cargo build"
TEST_CMD="cargo test"
INTEGRATION_BRANCH=consensus-rnd-integration
REVIEW_BASE_BRANCH=main
One detail worth pulling out, because it's most of why the consensus means anything: the loop runs across two different systems. The host you install into — Claude Code, in our setup — is the controller. It routes, posts to GitHub, commits, and merges, but it does none of the thinking. The thinking runs on separate Codex workers it spawns in isolated git worktrees. Claude Code drives; Codex reasons. The agent steering the loop isn't the one doing the work, and the work itself is split across independent Codex workers that can't see each other.
Here's how it works:
There's no algorithmic novelty here, and we won't pretend otherwise. Underneath, this is multi-agent debate, an LLM judge, and self-consistency — patterns you already know. What's hard, and what took us weeks of debugging on real repos, is the reliability engineering around the loop: the daemons that keep it alive, the leases that stop two instances from fighting, the release gates, and the stop rules. The idea is cheap. Making it trustworthy is not.
If you just want to try the consensus idea on a single hard decision without any of the daemon machinery, there's a lightweight skill called sshx
that spins up a few isolated workers to give you multiple angles and nothing else.
We'll be straight about the conflict of interest: all the repos below are ours, and consensus-loop
has zero external adoption so far. This is our own tape, not third-party validation. Everything is a public issue or PR you can open.
The NyxID feature up top is the loop doing the work. These are the loop deciding not to — which is the behavior that makes the first kind safe to rely on.
It stopped instead of fabricating. On aevatar, the solvers reached consensus, but at implementation time the worker didn't have the real external evidence to make the change safely. Rather than invent the missing piece to produce something, it stopped, changed nothing, and surfaced what it didn't know. A clean stop, not a confident wrong diff. (#2181)
It called a human when it should have. On Ornn, a large feature wouldn't converge after several rounds. The loop didn't force the merge. It opened an escalation, left the half-finished work for review, and flagged it as needing a person. (#1061)
It refused to take credit it couldn't back. On newmath — also ours, written by the same maintainer — the loop ran an experiment and measured a real result, 0.998 on a gap-detection benchmark against a 0.463 baseline. Then it went to claim a separate result, that the model also predicted better, and the statistical gate didn't pass: identical error on both arms. So it marked that claim false and logged why. On a repo where we'd have loved the win, it didn't take a result it couldn't support. (#1687)
Three of those four are the loop choosing not to act. That's the point. An autonomous loop you can actually leave running isn't one that always produces — it's one that produces when it's sure and stops when it isn't.
We've spent the past couple of months building this loop, tuning it, and running it for real on our own repos — using it in production and improving it at the same time. That's 155 billion tokens and 1.6 million model calls of actually living inside the thing, not a weekend prototype. We trade tokens for time, on purpose.
Loop engineering is having its moment right now, and we don't think the versions that actually work should stay locked inside a handful of companies' private repos. So we're putting ours in the open. Come run loops with us — point it at your repo, break it, tell us where it falls over, and let's find out what these things can really do.
It's still early-stage, and a lot of what the loop does is repair itself before anything reaches you. When a test fails or a reviewer rejects the work, it doesn't ship the break: it feeds the error back in, fixes it, and re-checks. The NyxID feature up top went through that four times before it passed. And when it genuinely can't recover on its own, it stops and says so rather than guessing past it.
It's open source, MIT-licensed. We don't sell it and we're not trying to. We built the loop because we needed it, we run it on our own products every day, and we're giving it to you to run on yours. Inject it into your host, write one host.env
, and point it at a repo.
Go break it: https://github.com/ChronoAIProject/consensus-rnd