{"slug": "i-made-my-ai-models-argue-then-let-hermes-be-the-judge", "title": "I Made My AI Models Argue, Then Let Hermes Be the Judge", "summary": "A developer built Council, a system that pits three different AI models against each other to debate judgment calls, then uses Hermes as a judge to deliver a single verdict with a confidence score and explanation of any disagreement. The system runs two free hosted models and one local model through parallel Hermes subagents, with a second deliberation round where dissenting jurors can change their minds after reading opposing arguments. Every verdict and debate outcome is written into Hermes' memory, allowing the system to learn which juror to trust for which type of question and propose trust adjustments for human approval.", "body_md": "*This is a submission for the Hermes Agent Challenge: Build With Hermes Agent*\n\nTL;DR— Ask any judgment call and three different AI models argue it out, then Hermes hands down one verdict, a confidence score, and exactly why they split. Every verdict, dissent, and mind-changed-in-debate is written into Hermes' own memory, so the next question re-weights the jurors before they ever vote. The judging is a pure function over that memory: no memory, no weights, no verdict. Three models, one verdict, $0.\n\nAn LLM once talked me into the wrong database with total confidence. One smooth, authoritative answer. I shipped it. It cost me a weekend and a migration I'm still not over.\n\nThe villain here is **single-model overconfidence**: you get one polished reply, and the disagreement that should have warned you is invisible. You never see the other opinions, because you only asked one model.\n\n**So I stopped trusting one model. I convened a jury.**\n\nCouncil takes any judgment call (\"Postgres or Mongo?\", \"is this PR safe to merge?\", \"is this clause risky?\") and asks **three different models**, lets them disagree, then has Hermes deliver one verdict, a confidence score, and exactly *why* they split. Three models, one verdict, $0.\n\nYou ask a question. Council fans it out to three jurors (two free OpenRouter models from different families and one local model via Ollama), each takes a position with reasons. Then, if they disagree, a **second deliberation round** runs: each juror sees the others' answers and either holds or changes its mind, so the council *debates* instead of just voting once. Hermes then judges the deliberated opinions: a single verdict, a **confidence score** (high when they agree, low when they split 2-1), and a \"why they disagreed\" panel. Every verdict is remembered, a `council`\n\nskill learns which juror to trust for which kind of question, and the agent can even **propose its own** trust adjustments for you to approve.\n\n*The whole product is one question box. Everything interesting happens behind it, and the rest of this post is mostly pictures of that \"behind.\"*\n\n**Repo:** [https://github.com/ArqamWaheed/council](https://github.com/ArqamWaheed/council)\n\n**Live demo:** [https://council-jet-kappa.vercel.app/](https://council-jet-kappa.vercel.app/)\n\nTry \"Should a 3-person startup use microservices?\" and open the dissent panel.\n\nLocal, one command (runs at $0 in offline mock mode, no key needed):\n\n```\ngit clone https://github.com/ArqamWaheed/council && cd council && ./setup_hermes.sh && python server.py\n```\n\nI think the design is easiest to *see*, so here's the system as a sequence of images. Each caption is the explanation.\n\n*The core loop. One question, three independent Hermes subagents (2 hosted + 1 local) fanned out in parallel, then a fourth Hermes run (the foreman) synthesizes one verdict. Every arrow is the same hermes -z interface; nothing talks to a model directly.*\n\n*The bet. A hosted model and an on-device model sit on the same jury, swapped with a single --provider/--model flag, no code change. This model-agnosticism is the one Hermes property the whole project is built on.*\n\n*The UX surface. Confidence is high when jurors agree and drops on a 2-1 split. The dissent panel is collapsed by default, and you expand it exactly when the confidence number makes you nervous.*\n\n*The actual product. A confident single answer hides this; Council makes the disagreement the headline. Getting the clustering right here was subtle (see \"What I learned\" below).*\n\n*The headline feature: a council that **deliberates, not just votes**. After round 1, disagreeing jurors get a second Hermes pass where they read each other's arguments and may hold or change their vote. A \"⇄ changed\" badge marks the ones that moved, and the confidence dial actually climbs when a 2-1 split is talked into agreement.*\n\n*The agentic learning loop, human-in-the-loop. Hermes proposes; you approve or dismiss. Approved rules persist client-side and ride along with the next convene call.*\n\n*Persistence the judge can verify. Verdicts are mirrored into Hermes' own memory, so recall is Hermes doing the work; proof lives in docs/hermes-proof/04-memory-recall.txt.*\n\n**Repo:** [https://github.com/ArqamWaheed/council](https://github.com/ArqamWaheed/council)\n\n**Interesting files:**\n\n`hermes_run.py`\n\n(the Hermes CLI driver every juror/judge call goes through)`run_council.py`\n\n(orchestration + the deterministic judge + Hermes foreman + the `--reflect`\n\nloop)`skills/council/SKILL.md`\n\n(the juror-weighting brain Hermes edits)`server.py`\n\n(the `/api/reflect`\n\n+ `/api/learn`\n\nendpoints) `index.html`\n\n(the designed verdict UI with the foreman TTS readout and localStorage persistence). Proof that Hermes is genuinely in the loop (subagent transcripts, skill diff, memory recall) is in [ docs/hermes-proof/](https://github.com/ArqamWaheed/council/tree/main/docs/hermes-proof).\n\n``` python\n# hermes_run.py: every juror/judge call is a real Hermes run\ndef ask(prompt, provider, model, skills=None, timeout=120):\n    cmd = [binary(), \"--provider\", provider, \"--model\", model]\n    if skills: cmd += [\"--skills\", skills]\n    cmd += [\"-z\", prompt]                       # -z = one-shot, final answer on stdout\n    return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout).stdout\n\n# jurors.py: fan out one Hermes subagent per juror, in parallel\nwith ThreadPoolExecutor(max_workers=len(roster())) as pool:\n    opinions = list(pool.map(lambda c: ask_juror(*c), enumerate(roster())))\n```\n\n**Why Hermes at all: the model-agnostic core.** Hermes lets you point at any provider and swap with a flag, no code change. Council is built *on top of that one property*: the jurors are different models, and Hermes is the only piece that makes \"different models\" cheap. The clearest proof is the third juror: it runs **locally** via Ollama while the other two are **hosted** on OpenRouter, and all three answer through the exact same `hermes -z`\n\ninterface (the model-agnostic diagram above). A hosted model and an on-device model, sitting on the same jury, no code change: that's model-agnosticism you can see. I genuinely didn't see another entry in this challenge exploit it; everyone picked one model and moved on. That's the whole bet.\n\n**Subagents: one real Hermes run per juror.** Each juror is a genuine, isolated Hermes invocation on a *different* provider+model (`hermes -z --provider openrouter --model …`\n\nfor the two hosted jurors, `--provider ollama-local …`\n\nfor the on-device one), fanned out **in parallel** so no model's reasoning anchors another's (the convene-flow diagram above). Hermes does the inference; my Python (`jurors.py`\n\nto `hermes_run.py`\n\n) is just the fan-out plumbing, and every juror in the output JSON is tagged `\"via\": \"hermes\"`\n\n. The gotcha worth flagging: Hermes enforces a **64K-context floor**, which for the local model meant setting both `ollama_num_ctx`\n\n*and* a named `custom_providers`\n\nentry; without the named provider, `--provider ollama`\n\nsilently routed to the wrong base URL. `setup_hermes.sh`\n\nencodes the working config so a judge can reproduce it in one command.\n\n**A true debate, not just a vote (round 2 is real Hermes work).** This is the feature I'm proudest of. After round 1, if the jurors disagree, each one gets a *second* Hermes run that shows it the others' positions and lead reasons and asks it to hold or change its mind. Real jurors reconsider through the same `hermes -z`\n\npath as round 1, so the debate is genuine extra agentic work, not a UI flourish; mock jurors reconsider deterministically so the offline demo stays reproducible. The judge then synthesizes the verdict from the **deliberated** opinions, so a juror that's talked round actually moves the outcome (the deliberation diagram above). It's gated on disagreement (a unanimous round 1 skips it) and toggled with `COUNCIL_DEBATE=0`\n\n.\n\n**Why a skill, not a prompt, for judging.** The foreman's verdict is itself a Hermes run (`hermes -z --skills council`\n\n) grounded in `skills/council/SKILL.md`\n\n, which is **installed into Hermes** (`hermes skills list`\n\nshows it). The weighting logic lives in a machine-readable `weights`\n\nblock.\n\n*The judging brain is data, not a buried prompt. --learn and --reflect both edit this block, and the installed Hermes copy is kept in sync.*\n\nAfter a string of security questions, `--learn`\n\nappended a rule to upweight the local model on that topic (*and synced the installed Hermes copy*) because it had caught issues the hosted models missed:\n\n```\npython run_council.py --learn \"Local Juror | security | 1.5\"\n```\n\nOn the next security question that juror's vote counts 1.5×, read straight back by the judge. Counterfactual: a static synthesis prompt can't get better; this does. (The before/after skill diff is in [ docs/hermes-proof/03-skill-learning.txt](https://github.com/ArqamWaheed/council/blob/main/docs/hermes-proof/03-skill-learning.txt).)\n\n**Letting the agent propose its own learning, now on the web and grounded in evidence.** `python run_council.py --reflect`\n\n(and the **\"Should the council reweight itself?\"** button in the UI) hands Hermes its *own* memory of past verdicts and asks it to propose one weight change, e.g. \"the local juror has dissented on three database calls; upweight it.\" The key fix this round: the proposal is **evidence-grounded**, since Hermes is fed the actual dissent tally and any rule backed by fewer than two real dissents is rejected, so it can't just parrot the example baked into the skill. You then **Approve or Dismiss** it (the reflect-flow diagram above). That's the agentic loop done honestly: a single verdict has no ground truth, so the agent surfaces a *pattern* and a human confirms it's signal, not overfitting (the exact tension this post closes on). (Offline, it falls back to a deterministic heuristic so it never breaks.)\n\n**Making learning survive a stateless deploy.** On a hosted demo the filesystem is read-only, so an approved rule can't be written back to `SKILL.md`\n\n. Council handles this honestly: approved rules are stored in the browser's **localStorage** and re-sent with every `/api/convene`\n\ncall, where they're merged into the judge's weights for that request. Locally you get a persistent `SKILL.md`\n\n; on the web you get per-browser persistence, and either way the learning sticks.\n\n**Why memory.** Each verdict is appended to a log *and mirrored into Hermes' own MEMORY.md*, so I can ask\n\n`hermes -z \"what did the council decide about auth?\"`\n\nand Hermes recalls it from its memory, not from my code (the memory-recall image above). Proof: `docs/hermes-proof/04-memory-recall.txt`\n\n**The foreman reads the verdict aloud.** The verdict card has a \"the foreman reads the verdict\" button (browser SpeechSynthesis, $0); Hermes also ships native TTS via `hermes setup tts`\n\n. On-theme and memorable: a jury foreman *announcing* the decision.\n\n**The build itself was agent-run.** I kept a `memory.md`\n\nthe coding agent read before each task and updated after (so context stayed cheap), committed every increment with Conventional Commits, and built the verdict UI with the **frontend-design** skill, which is why the confidence dial and colour-coded juror chips read as *designed*, not default-template AI slop. The repo's `AGENTS.md`\n\n+ commit history show the process, not just the result.\n\n**Why these models, and the concession.** Two free OpenRouter models from different families (≥64K context, since Hermes rejects smaller at startup) plus a local Ollama juror. Two honest concessions: (1) free models are slower and three calls add latency (~10-20s/verdict); (2) the free tier is *aggressively* rate-limited, so I hit 429s constantly while building, and Council retries and, if a juror still won't answer, falls back (Hermes to direct API to deterministic stand-in) rather than crashing the verdict, which also means the demo runs **fully offline at $0**. For a once-a-decision tool, I'll take it. Cost: $0.\n\n**License.** MIT. Fork it, add your own jurors.", "url": "https://wpnews.pro/news/i-made-my-ai-models-argue-then-let-hermes-be-the-judge", "canonical_source": "https://dev.to/arqamwd/i-made-my-ai-models-argue-then-let-hermes-be-the-judge-5e6c", "published_at": "2026-05-30 16:00:54+00:00", "updated_at": "2026-05-30 16:43:13.088229+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-safety", "ai-products"], "entities": ["Hermes", "Council", "OpenRouter", "Ollama"], "alternates": {"html": "https://wpnews.pro/news/i-made-my-ai-models-argue-then-let-hermes-be-the-judge", "markdown": "https://wpnews.pro/news/i-made-my-ai-models-argue-then-let-hermes-be-the-judge.md", "text": "https://wpnews.pro/news/i-made-my-ai-models-argue-then-let-hermes-be-the-judge.txt", "jsonld": "https://wpnews.pro/news/i-made-my-ai-models-argue-then-let-hermes-be-the-judge.jsonld"}}