Show HN: DOS – a referee between AI agents that doesn't believe their "done" A developer released DOS, an open-source tool that verifies AI agents' completion claims by checking git history and other artifacts instead of trusting the agents' own statements. The tool, available as a Python package, can detect false "done" claims, prevent agents from overwriting each other's files, and flag stalled runs. It aims to bring accountability to multi-agent coding workflows. πŸ“Š See it run on real repos:thescores 15 popular AI-built repos roborev, open-interpreter, crewAI, autogen, … β€” how much agents wrote, which ones, and whether each commit's claim is backed by its own diff. Score yours: scoreboard dos commit-audit --sweep --workspace . BASE..HEAD . https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/docs/assets/caught-lie-cast.svg The whole pitch in one recording: the agent claims two features shipped; git backs one. dos verify answers from the commits, the lie exits 1, and a gate on that exit code refuses the false "done". Every line is the real CLI's verbatim output β€” scripts/build caught lie cast.py https://github.com/anthony-chaudhary/dos-kernel/blob/master/scripts/build caught lie cast.py re-records it whenever the output changes. https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/docs/assets/loop-hero.svg Run a fleet of agents on one repo. The left loop just feels like progress; the right one you can steer. The only difference is a verdict DOS reads from the real world β€” here, git β€” never the agent's word. An AI agent will tell you it finished. DOS checks the real world instead of taking its word β€” and the nearest piece of the real world is your git history. An agent says it shipped the login endpoint; did it? Run one command, dos verify , and it answers from the artifacts the work left behind, not from what the agent typed: a commit backs the claim β†’ SHIPPED , exit 0 ; nothing landed β†’ NOT SHIPPED , exit 1 . The agent's story never enters into it. Git is just the first witness DOS reads; the file tree, the clock, a CI status, a test environment's own state are others β€” anything the agent didn't author. dos verify AUTH AUTH1 β†’ SHIPPED AUTH AUTH1 e62f74d exit 0 dos verify AUTH AUTH2 β†’ NOT SHIPPED AUTH AUTH2 exit 1 That's the smallest version. It scales up, too: point a dozen agents at one repo β€” in CI, in a fleet, racing on the same files β€” and DOS also tells you which ones are stepping on each other, which one is spinning in circles, and which claim of "done" is real. Every answer comes from the artifacts git, the file tree, the clock , never the narration. It works on a plain git repo with zero config and gets smarter the more you tell it, and the only thing you ever install is one small Python package. ⚑ Just add it β€” two commands, zero decisions.From the repo where your agent works: pip install dos-kernel dos init --hooks auto finds the agent runtime s you already use, wires in the checks From then on: your agent can't tell you "done"unless the work actually landed, two agents can't silently overwrite each other's files, and a run that stalls gets flagged instead of quietly spinning. Nothing about your workflow changes, and you don't need to learn any of the vocabulary below to be covered. It prints the one config file it wrote; deleting the dos hook entries there undoes it. No runtime detected? It says so and lists the names to pick from β€” it never guesses. v0.28.0 Β· 5,600+ tests Β· CI: Python 3.11–3.13 on Linux + a Windows 3.13 smoke run Β· the only runtime dependency is PyYAML Β· MIT . 🧭 Where to go next:the why & evidence plain-words story, the 20-lines-of-bash answer, what's proven , wire it into your stack MCP Β· hooks Β· install , the syscall + CLI reference , or,reading this as an AI agent?, AGENTS.md β€” build/test/check in three lines. The full map is the router just below. πŸ”€ Five words the rest of this page leans on.Aplanis a named goal AUTH ; aphaseis one shippable step of it AUTH1 ; alaneis the slice of the file tree one agent may touch; theoracleis the part of DOS that reads the evidence and rules; astampis the mark a shipped phase leaves in a commit subject AUTH1: … β€” the thing the oracle greps for. That's the whole vocabulary. A coding agent does work, then tells you how it went. Usually the story is true; sometimes it's the cheerful "all work completed " from a worker that shipped nothing. With one agent you catch that yourself by re-reading its output β€” a real tax you already pay. Run twenty at once and that tax stops being payable: nobody reads everything, each worker grades its own homework, and the unchecked problems pile up quietly until the codebase sorta works and nobody can safely change it. DOS is the referee that never reads the story β€” it reads what happened the commit, the file, the clock and hands you a verdict no narration can move. It costs about an afternoon, has one runtime dependency, and stays in its lane: it tells you what happened , never whether the code is good β€” quality stays with your tests and reviews. The full plain-words version https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/why-a-referee.md the-plain-words-version . Every number here is scored against a fact the agent can't fake a test environment's DB state, git history . A DOS gate caught 15 "I shipped it" lies in 258 tasks across two models with zero false alarms ; the same referee stopped 6 of 8 silent collisions on one shared record; quitting doomed runs at the right moment saved ~11% of fleet compute with 0 of 1,634 winners wrongly killed ; and the reward-set admission label lifted acceptance precision 60% β†’ 100% by purging poison a self-graded collector keeps. The methodology, the two money-moment figures, and the projected-vs-bet honesty gradient are in what's proven and what's still a bet . This page keeps the hook, the demo, and the failure it fixes. Everything deeper lives on a focused page β€” find the question you arrived with and jump: | You're asking… | Go to | |---|---| "What is this in plain words, and why should my team care? Is it real?" | | "Show me it working, fast." Try it in 60 seconds try-it-in-60-seconds , just below β€” one command "I already run agents β€” how do I wire the verdict into my stack?" Wire it in https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/wire-it-in.md β€” MCP, runtime hooks, the exit-code tier, fleet frameworks, and the install matrix "What's the full command / syscall surface?" The syscall ABI & CLI reference https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/cli-reference.md β€” every verb, the three live screens, the verdict journal "I run a fleet every day β€” how do I watch it, triage it, debug it?" Operating a fleet https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/operating-a-fleet.md + Debug a stuck fleet https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/06 debug-a-stuck-fleet.md "How do I bend it to my org without forking it?" Extending it https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/extending.md β€” the seven axes, the docs index, the playbooks "What is actually proven, and can I re-run it?" For researchers https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/for-researchers.md β€” claims β†’ invariants β†’ reproduction "I'm an AI agent orienting in this repo." β€” what DOS is in three lines, build/test/check, the ~5 files worth reading AGENTS.md https://github.com/anthony-chaudhary/dos-kernel/blob/master/AGENTS.md "What surfaces are stable and what's the deprecation window?" β€” the compatibility promise, what the version number means, and what will never break docs/STABILITY.md https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/STABILITY.md Got a terminal? This runs the whole thing in a throwaway repo β€” one command scaffolds it, makes a real commit, verifies it, and cleans up after itself: pip install dos-kernel PyYAML is the only runtime dep dos quickstart β†’ SHIPPED AUTH AUTH1 … then NOT SHIPPED AUTH AUTH2 One SHIPPED , one NOT SHIPPED : the first is a claim git can back, the second is a claim nothing landed for. That contrast is the product. The demo closes with a router to wherever you already run agents β€” a Claude Code / Cursor tab dos init --hooks , an MCP host, a CI step, or a fleet β€” so your next move is one line, not a docs dig. Add --keep ./demo to keep the repo and poke at it. Don't even want the install? uvx --from dos-kernel dos quickstart runs the same demo ephemerally β€” nothing left behind. The same thing by hand, in five lines, is docs/QUICKSTART.md . https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/examples/demo/verify-moment.svg Two equally confident claims, one verdict each β€” SHIPPED for the one git can back, NOT SHIPPED for the one nothing landed for. Every string is verbatim output of examples/demo/verify demo.sh https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/demo/verify demo.sh . Step through it locally https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/demo/verify visual.html for the click-through version it's an HTML file β€” clone the repo and open it in a browser; GitHub shows its source, not the running page . The smallest real win: in a CI step or dispatch loop, replace the line that trusts an agent's "done" with dos verify PLAN PHASE and branch on its exit code 0 shipped / 1 not . No parsing, no plan, no config β€” the CI integration cookbook https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/cookbook-ci-integration.md walks it end-to-end. To run it on a repo shaped like yours, start with Onboard a repo in 10 minutes https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/01 onboard-a-repo.md . Point the same witness at a review queue when commits pile up faster than anyone can read them. Residual review https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/residual review/ folds commit-audit 's per-commit verdict into three bands β€” CLEARED the diff witnessed the claim, so spend ~0 attention re-asking "did it do what it said" , RESIDUAL a claim git couldn't back β€” the human's 100% , and the no-claim rest. On this repo's own last 200 commits it cleared 170 of 171 checkable claims: that's the re-review you skip, proven by git rather than a model's confidence score. CLEARED means the change's shape matched its claim β€” not that the code is correct; correctness review still applies to every commit. The band can only ever ask for more eyes, never fewer. Next level up β€” wire the verdict into your own stack: Wire it in. Run a pile of agents at once with nobody refereeing, and here's how it goes: each worker reports its own success, and you believe the reports, because what else is there to go on? The unchecked problems pile up quietly β€” a lie here, two agents clobbering the same file there, a little scope creep, one worker spinning in circles β€” until the codebase sorta works and nobody can safely change it. The trouble is you launched the agents and then let them grade their own homework. DOS gives you the missing signal β€” a verdict from ground truth β€” so the loop closes. Here is the same fleet under both regimes: The two regimes as a flowchart β€” NO REFEREE: you believe the narration; DOS ADJUDICATES: you steer on a verdict flowchart LR subgraph OPEN "NO REFEREE β€” you believe the narration" direction TB A1 "agent: 'done '" -- B1 "believed" A2 "agent: 'done '" -- B1 A3 "agent: 'done '" -- B1 B1 -- C1 "silent corruption piles up