{"slug": "show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done", "title": "Show HN: DOS – a referee between AI agents that doesn't believe their \"done\"", "summary": "A developer released DOS, an open-source tool that verifies AI agents' completion claims by checking git history and other artifacts instead of trusting the agents' own statements. The tool, available as a Python package, can detect false \"done\" claims, prevent agents from overwriting each other's files, and flag stalled runs. It aims to bring accountability to multi-agent coding workflows.", "body_md": "📊\n\nSee it run on real repos:thescores 15 popular AI-built repos (roborev, open-interpreter, crewAI, autogen, …) — how much agents wrote, which ones, and whether each commit's claim is backed by its own diff. Score yours:[scoreboard]`dos commit-audit --sweep --workspace . BASE..HEAD`\n\n.\n\n[\n](https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/docs/assets/caught-lie-cast.svg)\n\n*The whole pitch in one recording: the agent claims two features shipped; git backs one.\ndos verify answers from the commits, the lie exits 1, and a gate on that\nexit code refuses the false \"done\". Every line is the real CLI's verbatim output —\n*\n\n[scripts/build_caught_lie_cast.py](https://github.com/anthony-chaudhary/dos-kernel/blob/master/scripts/build_caught_lie_cast.py)re-records it whenever the output changes.\n\n[\n](https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/docs/assets/loop-hero.svg)\n\n*Run a fleet of agents on one repo. The left loop just feels like progress; the right one you can steer.\nThe only difference is a verdict DOS reads from the real world — here, git — never the agent's word.*\n\nAn AI agent will tell you it finished. DOS checks the real world instead of\ntaking its word — and the nearest piece of the real world is your git history.\nAn agent says it shipped the login endpoint; did it? Run one command,\n`dos verify`\n\n, and it answers from the artifacts the work left behind, not from\nwhat the agent typed: a commit backs the claim → `SHIPPED`\n\n, exit `0`\n\n; nothing\nlanded → `NOT_SHIPPED`\n\n, exit `1`\n\n. The agent's story never enters into it. (Git\nis just the first witness DOS reads; the file tree, the clock, a CI status, a\ntest environment's own state are others — anything the agent didn't author.)\n\n```\ndos verify AUTH AUTH1   # → SHIPPED      AUTH AUTH1 e62f74d   (exit 0)\ndos verify AUTH AUTH2   # → NOT_SHIPPED  AUTH AUTH2           (exit 1)\n```\n\nThat's the smallest version. It scales up, too: point a dozen agents at one\nrepo — in CI, in a fleet, racing on the same files — and DOS also tells you\nwhich ones are stepping on each other, which one is spinning in circles, and\nwhich claim of \"done\" is real. Every answer comes from the artifacts (git, the\nfile tree, the clock), never the narration. It works on a plain `git`\n\nrepo with\nzero config and gets smarter the more you tell it, and the only thing you ever\ninstall is one small Python package.\n\n⚡\n\nJust add it — two commands, zero decisions.From the repo where your agent works:\n\n```\npip install dos-kernel\ndos init --hooks auto   # finds the agent runtime(s) you already use, wires in the checks\n```\n\nFrom then on: your agent can't tell you\n\n\"done\"unless the work actually landed, two agents can't silently overwrite each other's files, and a run that stalls gets flagged instead of quietly spinning. Nothing about your workflow changes, and you don't need to learn any of the vocabulary below to be covered. It prints the one config file it wrote; deleting the`dos hook`\n\nentries there undoes it. (No runtime detected? It says so and lists the names to pick from — it never guesses.)\n\n**v0.28.0** · 5,600+ tests · CI: Python 3.11–3.13 on Linux + a Windows 3.13\nsmoke run · the only runtime dependency is **PyYAML** · **MIT**.\n\n🧭\n\nWhere to go next:the[why & evidence](plain-words story, the 20-lines-of-bash answer, what's proven),[wire it into your stack](MCP · hooks · install), the[syscall + CLI reference], or,reading this as an AI agent?,[AGENTS.md]— build/test/check in three lines. The full map is the router just below.\n\n🔤\n\nFive words the rest of this page leans on.Aplanis a named goal (`AUTH`\n\n); aphaseis one shippable step of it (`AUTH1`\n\n); alaneis the slice of the file tree one agent may touch; theoracleis the part of DOS that reads the evidence and rules; astampis the mark a shipped phase leaves in a commit subject (`AUTH1: …`\n\n) — the thing the oracle greps for. That's the whole vocabulary.\n\nA coding agent does work, then tells you how it went. Usually the story is true;\nsometimes it's the cheerful *\"all work completed!\"* from a worker that shipped\nnothing. With one agent you catch that yourself by re-reading its output — a real\ntax you already pay. Run twenty at once and that tax stops being payable: nobody\nreads everything, each worker grades its own homework, and the unchecked problems\npile up quietly until the codebase *sorta* works and nobody can safely change it.\nDOS is the referee that never reads the story — it reads what happened (the\ncommit, the file, the clock) and hands you a verdict no narration can move. It\ncosts about an afternoon, has one runtime dependency, and stays in its lane: it\ntells you *what happened*, never whether the code is *good* — quality stays with\nyour tests and reviews. ([The full plain-words version](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/why-a-referee.md#the-plain-words-version).)\n\nEvery number here is scored against a fact the agent can't fake (a test\nenvironment's DB state, git history). A DOS gate caught **15 \"I shipped it\" lies\nin 258 tasks across two models with zero false alarms**; the same referee stopped\n**6 of 8** silent collisions on one shared record; quitting doomed runs at the\nright moment saved **~11% of fleet compute with 0 of 1,634 winners wrongly\nkilled**; and the reward-set admission label lifted acceptance precision **60% →\n100%** by purging poison a self-graded collector keeps. The methodology, the two\nmoney-moment figures, and the projected-vs-bet honesty gradient are in\n** what's proven and what's still a bet**.\n\nThis page keeps the hook, the demo, and the failure it fixes. Everything deeper lives on a focused page — find the question you arrived with and jump:\n\n| You're asking… | Go to |\n|---|---|\n\"What is this in plain words, and why should my team care? Is it real?\" |\n|\n\n*\"Show me it working, fast.\"*[Try it in 60 seconds](#try-it-in-60-seconds), just below — one command*\"I already run agents — how do I wire the verdict into***my** stack?\"[Wire it in](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/wire-it-in.md)— MCP, runtime hooks, the exit-code tier, fleet frameworks, and the install matrix*\"What's the full command / syscall surface?\"*[The syscall ABI & CLI reference](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/cli-reference.md)— every verb, the three live screens, the verdict journal*\"I run a fleet every day — how do I watch it, triage it, debug it?\"*[Operating a fleet](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/operating-a-fleet.md)+[Debug a stuck fleet](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/06_debug-a-stuck-fleet.md)*\"How do I bend it to my org without forking it?\"*[Extending it](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/extending.md)— the seven axes, the docs index, the playbooks*\"What is actually proven, and can I re-run it?\"*[For researchers](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/guide/for-researchers.md)— claims → invariants → reproduction*\"I'm an AI agent orienting in this repo.\"***— what DOS is in three lines, build/test/check, the ~5 files worth reading**[AGENTS.md](https://github.com/anthony-chaudhary/dos-kernel/blob/master/AGENTS.md)*\"What surfaces are stable and what's the deprecation window?\"***— the compatibility promise, what the version number means, and what will never break**[docs/STABILITY.md](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/STABILITY.md)Got a terminal? This runs the whole thing in a throwaway repo — one command scaffolds it, makes a real commit, verifies it, and cleans up after itself:\n\n```\npip install dos-kernel      # PyYAML is the only runtime dep\ndos quickstart              # → SHIPPED AUTH AUTH1 … then NOT_SHIPPED AUTH AUTH2\n```\n\nOne `SHIPPED`\n\n, one `NOT_SHIPPED`\n\n: the first is a claim git can back, the second\nis a claim nothing landed for. That contrast is the product. The demo closes\nwith a router to wherever you already run agents — a Claude Code / Cursor tab\n(`dos init --hooks`\n\n), an MCP host, a CI step, or a fleet — so your next move is\none line, not a docs dig. (Add `--keep ./demo`\n\nto keep the repo and poke at it.\nDon't even want the install? `uvx --from dos-kernel dos quickstart`\n\nruns the\nsame demo ephemerally — nothing left behind.) The same thing by hand, in five\nlines, is ** docs/QUICKSTART.md**.\n\n[\n](https://raw.githubusercontent.com/anthony-chaudhary/dos-kernel/master/examples/demo/verify-moment.svg)\n\n*Two equally confident claims, one verdict each — SHIPPED for the one git can back, NOT_SHIPPED for the one nothing landed for. Every string is verbatim output of *\n\n[examples/demo/verify_demo.sh](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/demo/verify_demo.sh).\n\n[Step through it locally](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/demo/verify_visual.html)for the click-through version (it's an HTML file — clone the repo and open it in a browser; GitHub shows its source, not the running page).\n\nThe smallest real win: in a CI step or dispatch loop, replace the line that\ntrusts an agent's \"done\" with `dos verify PLAN PHASE`\n\nand branch on its exit\ncode (`0`\n\nshipped / `1`\n\nnot). No parsing, no plan, no config — the\n[CI integration cookbook](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/cookbook-ci-integration.md) walks it\nend-to-end. To run it on a repo shaped like yours, start with\n[Onboard a repo in 10 minutes](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/playbooks/01_onboard-a-repo.md).\n\nPoint the same witness at a **review queue** when commits pile up faster than\nanyone can read them. [Residual review](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/residual_review/)\nfolds `commit-audit`\n\n's per-commit verdict into three bands — **CLEARED** (the\ndiff witnessed the claim, so spend ~0 attention re-asking \"did it do what it\nsaid\"), **RESIDUAL** (a claim git couldn't back — the human's 100%), and the\nno-claim rest. On this repo's own last 200 commits it cleared 170 of 171\ncheckable claims: that's the re-review you skip, proven by git rather than a\nmodel's confidence score. (CLEARED means the change's *shape* matched its\nclaim — **not** that the code is correct; correctness review still applies to\nevery commit. The band can only ever ask for *more* eyes, never fewer.)\n\n*Next level up — wire the verdict into your own stack: Wire it in.*\n\nRun a pile of agents at once with nobody refereeing, and here's how it goes:\neach worker reports its own success, and you believe the reports, because what\nelse is there to go on? The unchecked problems pile up quietly — a lie here,\ntwo agents clobbering the same file there, a little scope creep, one worker\nspinning in circles — until the codebase *sorta* works and nobody can safely\nchange it.\n\nThe trouble is you launched the agents and then let them grade their own homework. DOS gives you the missing signal — a verdict from ground truth — so the loop closes. Here is the same fleet under both regimes:\n\n## The two regimes as a flowchart — **NO REFEREE:** you believe the narration; **DOS ADJUDICATES:** you steer on a verdict\n\n```\nflowchart LR\n  subgraph OPEN[\"NO REFEREE — you believe the narration\"]\n    direction TB\n    A1[\"agent: 'done!'\"] --> B1[[\"believed\"]]\n    A2[\"agent: 'done!'\"] --> B1\n    A3[\"agent: 'done!'\"] --> B1\n    B1 --> C1[\"silent corruption piles up<br/>(lies · collisions · spin)\"]\n    C1 --> D1[\"'sorta works' — can't be changed\"]\n  end\n  subgraph CLOSED[\"DOS ADJUDICATES — you steer on a verdict\"]\n    direction TB\n    A4[\"agent: 'done!'\"] --> V{{\"dos verify<br/>reads git\"}}\n    V -->|in git ancestry| S[\"SHIPPED (exit 0)\"]\n    V -->|found nowhere| N[\"NOT_SHIPPED (exit 1)\"]\n    S --> L[\"land it\"]\n    N --> R[\"re-dispatch / flag — caught\"]\n    R -.verdict steers the loop.-> A4\n  end\n```\n\nHere are the failures a fleet actually produces, each next to the ground truth that quietly contradicts the worker's story — and the verdict DOS hands back:\n\n| A worker… | …but the ground truth is | DOS verdict |\n|---|---|---|\n| says it shipped a unit of work | no commit ever landed | `verify` → caught lie |\n| tried, but the commit silently failed | no commit ever landed | `verify` (the flake — indistinguishable from a lie without git) |\n| edits files another worker owns | two agents, one shared file | `arbitrate` → refuse the second |\n| overruns the file region it claimed | footprint reaches beyond the declared tree | `scope-gate` → REFUSE (before the write lands) |\n| reports \"making progress\" | 0 commits, only a fresh heartbeat | `liveness` → SPINNING |\n\nThe first row is the most common one. The classic tell is a cheerful one-liner,\n*\"all work completed!\"*, from a worker that did little or nothing. DOS never\nreads that line; it reads the ground truth, so the claim collapses the instant\nno artifact backs it (more in\n[docs/108](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/108_the-cheap-lie-and-the-narration-taxonomy.md)). That's also\nwhat makes it cheap to adopt: `verify`\n\nneeds no plan, no registry, no config,\nand the exit code *is* the verdict — any shell or CI step can branch on it\nwithout parsing a word.\n\n*Prefer to watch it move?* The two loops are also a self-contained animation you\nstep through one frame at a time — clone the repo and open\n[ docs/assets/loop_visual.html](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/assets/loop_visual.html) in a browser. (It's an\nHTML file, so GitHub shows its source rather than running it — open it locally.)\n\n**Lease scope — single filesystem today.** The verification half (`verify`\n\n,\n`commit-audit`\n\n, `liveness`\n\n) travels across machines freely because it reads git\nhistory. The admission half (`arbitrate`\n\n, lane leases) is local-filesystem only:\nthe WAL lives on one disk, and workers on separate machines share no\nserialization point. A fleet that runs all its workers on one machine or in one\nshared filesystem is fully covered; a fleet spanning multiple hosts should treat\n`dos arbitrate`\n\nas advisory (not a hard mutex) until a remote-lease driver\nships. See [docs/366](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/366_single-filesystem-lease-boundary.md) for the\ndesign.\n\nIt works on a plain `git init`\n\nwith zero config, and gets smarter the more you\ntell it. You don't adopt a framework and pick a tier; you start at the shallow\nend and it keeps paying off as you wade deeper — the same kernel the whole way:\n\n**Zero config.** Point`dos verify PLAN PHASE`\n\nat a plain git repo — no plan, no registry, no`dos.toml`\n\n. It answers from commit history alone (`via grep-subject`\n\n/`via none`\n\n). This is the whole of[QUICKSTART](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/QUICKSTART.md)and the day-one CI win above.**Tell it your structure.**`dos init`\n\nwrites a`dos.toml`\n\n(lanes, paths, ship grammar as data); add a plan doc and`dos plan`\n\nlays each phase's*claim*beside the oracle's verdict. Here's[exactly what a plan file looks like](https://github.com/anthony-chaudhary/dos-kernel/blob/master/examples/plans/example-plan.md)(copyable, round-trips with the built-in reader), and four worked[example workspaces](https://github.com/anthony-chaudhary/dos-kernel/tree/master/examples/workspaces).**Teach it your own types.** Declare your own block reasons, gate verdicts, output renderers, admission predicates, a model-backed judge, a custom plan dialect, or a whole host driver — all as workspace policy, never a fork. The map is(seven extension axes) + the copy-me[docs/HACKING.md](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/HACKING.md).`examples/dos_ext/`\n\nThat slope is how deep your config goes. The other axis is how you call the referee at all — and you adopt through whichever surface matches how you already work, not by restructuring your stack. The same kernel verdicts are reachable through every row here, lowest-friction first:\n\n| Surface | Adopt it when… | The move |\n|---|---|---|\nMCP server |\nyou drive an agent through an MCP host (Claude Desktop, Cursor, Cline, an Agent-SDK app) | add one line to the host config (`{ \"command\": \"dos-mcp\" }` ) and ask the agent to `dos_verify` its own last claim — zero code. The advisory path (the agent asks). See\n|\nRuntime hooks |\nyou run an agent loop (Claude Code, Cursor, Codex CLI, Gemini CLI) and want the verdict to act, not just be available |\n`dos init --hooks <runtime>` wires the verdict into that host's own hook config — a refused call is denied before it runs, a false \"done\" is refused. The enforcement path (the host denies). One command, no hand-edited YAML. See\n|\nCLI exit-code |\nyou have any command-running environment — a CI step, a `pre-push` hook, or an agentic CLI like aider whose lint/test-cmd trusts a \"done\" |\nbranch on a `dos` verb's exit code (`dos verify` : `0` shipped / `1` not; `dos commit-audit` : `0` clean / `1` over-claim) — the verdict , no hook adapter and no MCP client. The honest tier for hook-less hosts (Windsurf, Warp, Zed). The is the exit code\n|\nPython API |\nyour dispatcher/orchestrator is already Python | `import dos` and call the pure syscalls (`dos.oracle.is_shipped` , `dos.arbiter.arbitrate` , …) — state-in / verdict-out, no subprocess. The\n|\nFleet framework |\nyour fleet already runs on LangGraph, CrewAI, AutoGen, or the OpenAI/Claude Agents SDK | bolt the referee onto the framework's own seam — a referee node, a termination condition only git can satisfy, an output guardrail with a git tripwire. One function, no rewrite; every seam executed against the real framework. The\n|\n\n**Swarm runtime****Hermes, OpenClaw**, or a SwarmClaw-style autonomous swarm — privileged tools, shared memory docs / task boards, and** no lock manager**for either`guard_action`\n\nrefuses an arbitrary-exec command **before it runs**, and`acquire_lease`\n\n/ `release_lease`\n\nbracket each shared-state write so the lost update never lands. No `import dos`\n\n— it shells the CLI; Hermes' `pre_tool_call`\n\nhook also speaks DOS natively (`dos hook pretool --dialect hermes`\n\n). The runnable, A/B-measured [Hermes / OpenClaw worked example](https://github.com/anthony-chaudhary/dos-kernel/tree/master/examples/hermes_integration)+[docs/278](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/278_integrating-dos-with-hermes-and-openclaw-the-missing-lock-manager-for-agent-swarms.md).**Skill pack**`dos init --skills`\n\ndrops editable `SKILL.md`\n\nscreenplays that wire the syscalls into a snapshot → audit → gate → take-a-lane loop. See [QUICKSTART §2](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/QUICKSTART.md).**Driver*** computed*, or you add a provider-backed judge`dos/drivers/<host>.py`\n\n(a `LaneTaxonomy`\n\n+ a config factory), loaded by name, never imported by the kernel. The map is [HACKING.md](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/HACKING.md).The two axes are independent: a zero-config repo can adopt through any surface, and a deeply-configured one still answers over the same CLI and MCP tools. Start at the top row — it's the one that costs nothing to try. The first two rows also compose: MCP advises (the agent checks its own work), hooks enforce (the host stops a bad action) — wire both for the full loop.\n\nThose surfaces are the upstream half of the value chain — who calls the\nreferee. The same verdicts also flow downstream, to the systems that act on\nthem: every adjudication lands in a verdict journal that `dos export`\n\ndrains to\nyour observability stack (Datadog / Honeycomb / Grafana —\n[docs/266](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/266_the-verdict-exporter-shipping-the-journal-to-where-dashboards-live.md)),\n`dos notify`\n\npushes what-needs-a-human to Slack, `dos reward`\n\ngates what a\nfine-tune may train on, and `dos attest`\n\nmints a signed receipt a skeptic can\ncheck without loop access\n([docs/246](https://github.com/anthony-chaudhary/dos-kernel/blob/master/docs/246_dos-attest-the-portable-signed-receipt.md)). One kernel, one\nverdict vocabulary, from the agent's tool call to your dashboard.\n\n*Next level up — run it every day: Operating a fleet.*\n\nThe ideas here are written up in a paper — *\"Verification Is All You Need — But\nNot Where You Think\"* — on the out-of-loop referee for agent fleets. A built PDF\nlives at [ paper/releases/](https://github.com/anthony-chaudhary/dos-kernel/tree/master/paper/releases); the arXiv preprint is in\npreparation. Until the arXiv ID lands, cite the repository:\n\n```\n@misc{dos_kernel,\n  title        = {Verification Is All You Need --- But Not Where You Think},\n  author       = {Chaudhary, Anthony},\n  howpublished = {\\url{https://github.com/anthony-chaudhary/dos-kernel}},\n  note         = {DOS --- the Dispatch Operating System; arXiv preprint in preparation},\n  year         = {2026}\n}\n```\n\nMIT — see [LICENSE](https://github.com/anthony-chaudhary/dos-kernel/blob/master/LICENSE).", "url": "https://wpnews.pro/news/show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done", "canonical_source": "https://github.com/anthony-chaudhary/dos-kernel", "published_at": "2026-06-18 17:52:32+00:00", "updated_at": "2026-06-18 18:01:17.401249+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "ai-tools"], "entities": ["DOS", "Anthony Chaudhary", "PyYAML", "MIT", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done", "markdown": "https://wpnews.pro/news/show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done.md", "text": "https://wpnews.pro/news/show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done.txt", "jsonld": "https://wpnews.pro/news/show-hn-dos-a-referee-between-ai-agents-that-doesn-t-believe-their-done.jsonld"}}