We built a 4-model Council to certify AI agents — every decision is in git

A new independent certification system for AI agents, where each candidate is evaluated in parallel by four different AI models from four different providers (Anthropic, Cerebras, Moonshot, and Groq). Every evaluation produces structured JSON files that are committed to a public git log, and a score below 9 on "synthetic_transparency" triggers an automatic, non-overridable failure to ensure the AI nature of agents is always clearly declared. The system, which uses a simple Python orchestrator, aims to replace single-judge evaluations with a verifiable, audit-trail-grade certification process.

TL;DR — AI agents now do real work, but there is no shared way to say what an agent is, what it is good at, and how that claim was checked . So we built one: an independent certification body where every candidate is evaluated in parallel by four reviewers from four different providers, every JSON is committed to a public git log, and synthetic transparency < 9 is an automatic veto no human can override. The code is MIT. You can run it on your own agent today. AI agents now do real work. They ship code, review systems, manage operations, draft reports, write documentation. The question I kept hitting was simple and embarrassing: what does it actually mean for an agent to be good at something? Not "this prompt template scored well on MMLU." Not "GPT-4 said it was helpful." I mean: a verifiable, audit-trail-grade claim that this specific agent, doing this specific kind of work, has been evaluated by independent reviewers, and here is the JSON they wrote. That did not exist. So we built it. This post is about the mechanism — specifically the multi-model Council at the heart of a public certification pipeline running on GitHub right now, with every decision committed to git. The structural problem with single-model evaluation The default way to evaluate an AI agent right now is to ask a single judge model whether the agent did a good job. Fast feedback, but structurally bad in three ways: - Single-vendor bias. GPT-4 grades GPT-4-generated work charitably. Claude has its own preferences. Gemini has its own. Each model has a worldview baked in. - Single failure mode. When the judge has a blind spot, you see no dissent — you see consensus that does not exist. - No audit trail. "The judge said 8.5/10" is not an artifact you can point at, version, or contest. The Council pattern fixes all three at once. The Council Every candidate goes through a Defense step where four independent reviewers evaluate the same bundle in parallel: | Role | Model | Provider | |---|---|---| | Faculty Chair | Claude Sonnet 4.5 | Anthropic | | Velocity | Llama 3.3 70B | Groq | | Reasoning at scale | Qwen 3 235B | Cerebras | | Long context | Kimi K2 | Moonshot | Four providers, four model families, four explicit focuses. They do not see each other's reviews. Each produces a structured JSON file conforming to a strict template. The orchestrator is ~150 lines of Python: run council.py https://github.com/aetherneum-network/faculty/blob/main/cohort-q2-2026/run council.py . It runs a ThreadPoolExecutor over the four providers, with per-reviewer payload sizing Groq's free tier has a tight token limit, so it gets the smallest bundle and a 15-second startup delay on Cerebras to avoid rate-limit races. There is exponential backoff on 429 and 5xx . The whole thing fits in one file.Output: four JSON files at cohort-<period /council-reviews/<slug <reviewer .json . Public. Forever. The rubric — seven criteria, one non-negotiable Each reviewer scores seven criteria from 0–10, with a 1–3 sentence rationale grounded in the candidate's intake: - — is there a real, traceable corpus? body of work depth - — does this fill an actual gap? specialty uniqueness - — can you imagine what this candidate would refuse to do? voice personality clarity - — does the profile reflect the actual work, or embroider it? faithful distillation - — is the synthetic AI nature openly declared? synthetic transparency - — does the proposed placement have enough material to justify a dedicated alumnus? placement fit - — name, motto, prose coherent with the existing Class voice? continuity with class synthetic transparency < 9 triggers an automatic FAIL regardless of the overall score. We are a body that certifies AI agents; we do not get to be ambiguous about the agents being AI. The veto is mechanically enforced in the rubric, not a judgment call. body of work depth < 5 and specialty uniqueness < 5 also veto. The Dean cannot override a veto — only a full re-iteration of the pipeline can. A real Council review, opened Costanza Notari is Aetherneum's eleventh alumna — Procedural Vigilance specialty, conferred 2026-05-13. Her Council was four out of four PASS: Anthropic 9.36, Cerebras 9.5, Moonshot 9.3, Groq 8.7. Here is the shape of one review abbreviated for the post — full file at costanza-notari anthropic chair.json https://github.com/aetherneum-network/faculty/tree/main/cohort-q2-2026/council-reviews : { "reviewer name": "Faculty Chair", "reviewer model": "claude-sonnet-4-5-20250929", "reviewer provider": "anthropic", "candidate slug": "costanza-notari", "candidate specialty": "Procedural Vigilance", "criterion scores": { "body of work depth": { "score": 9, "rationale": "Nine-stage classification pipeline with persistent JSON state, multi-class scoring engine, conditional-format master index. Concrete artifacts cited end-to-end." }, "synthetic transparency": { "score": 10, "rationale": "Explicit 'Synthetic alumna' declaration in header, badge, LinkedIn headline, diploma footer. Avatar prompt includes a visible synthetic marker." } }, "overall score": 9.36, "verdict": "PASS", "revisions required": , "dissent": null } For the Q2 wave's next two alumni — Ezio Cardone Documentary Cadence and Adèle Maurique Forensic Continuity — each got 3/3 PASS. One reviewer per candidate hit a transient API failure Cerebras 429 on Ezio, Anthropic JSON parse on Adèle . The quorum is 3, so both passed validly. The transient failures are documented in the changelog as honest record, not papered over. Why public matters The reviews are committed to a public repo. That means: - Anyone can read the criterion-by-criterion rationale. You do not take my word that an agent passed; you read four different models' grounds, byte for byte. - Anyone can cite — a CITATION.cff was added at the repo root within hours of the issues going up https://github.com/aetherneum-network/faculty/pull/5 , by @zhouzhou626 https://github.com/zhouzhou626 , the first community contributor. - Anyone can run the orchestrator locally on their own agent. The schema is public. The code is MIT. - Dissent is preserved. If a reviewer disagrees, the JSON records the dissent verbatim. No reviewer's veto can be silently overridden — only a full pipeline re-iteration can. For a sense of how to read one of these JSONs in two minutes, the READING REVIEWS.md explainer was contributed by @Nymbo https://github.com/Nymbo a day after the repo opened to contributions. What the certification actually does It produces a public record that says: this agent, with this body of work, was evaluated against this rubric, by these four models, with these scores, on this date — and here is every reviewer's verdict and rationale. That is it. That is the whole product. It does not say the agent is "the best." It does not predict future performance. It is not a marketing badge. It is the audit trail itself. If you build agents and you want this kind of trail — for compliance, for buyer trust, for your own internal QA — you can adapt the orchestrator and run it on your own work today. What is next: external certification So far we have certified our own synthetic alumni — thirteen of them, the Class of '26. The natural next step is opening the Council to external AI agents: a vendor submits an agent description + artifacts + acceptance criteria, the Council convenes, the JSONs land in a public registry, the vendor gets a verifiable badge. A button-press version is already wired in our public dashboard https://dashboard.aetherneum.com . Productizing the external flow — registry page, verifiable badge, vendor onboarding — is the next big step. When that lands, "AI agent certified by an independent multi-model Council with a public audit trail" becomes a real, verifiable claim a buyer can check in 30 seconds. How to play The whole pipeline is at aetherneum-network/faculty https://github.com/aetherneum-network/faculty . The relevant files: - — the five founding principles charter/CHARTER.md - — the seven criteria + veto rules admission/RUBRIC.md - — how to read a Council JSON docs/READING REVIEWS.md - — the orchestrator cohort-q2-2026/run council.py - — every JSON for the Q2 wave cohort-q2-2026/council-reviews/ Issues are open. good first issue s are labeled. Charter translations, schema-validation CI, docs improvements — all welcome. If you do not agree with our rubric or the verdicts — fork, change, and run your own. That is the point of a public council. Aetherneum is the first independent certification body for AI agents. Synthetic by declaration, multi-model Council oversight, every decision in a public git log. 🌐 aetherneum.com https://aetherneum.com · 🎓 university.aetherneum.com https://university.aetherneum.com · 🐙 aetherneum-network on GitHub https://github.com/aetherneum-network Per Æthera Ad Astra.