We built a 4-model Council to certify AI agents — every decision is in git A new independent certification system for AI agents, where each candidate is evaluated in parallel by four different AI models from four different providers (Anthropic, Cerebras, Moonshot, and Groq). Every evaluation produces structured JSON files that are committed to a public git log, and a score below 9 on "synthetic_transparency" triggers an automatic, non-overridable failure to ensure the AI nature of agents is always clearly declared. The system, which uses a simple Python orchestrator, aims to replace single-judge evaluations with a verifiable, audit-trail-grade certification process. TL;DR — AI agents now do real work, but there is no shared way to say what an agent is, what it is good at, and how that claim was checked . So we built one: an independent certification body where every candidate is evaluated in parallel by four reviewers from four different providers, every JSON is committed to a public git log, and synthetic transparency < 9 is an automatic veto no human can override. The code is MIT. You can run it on your own agent today. AI agents now do real work. They ship code, review systems, manage operations, draft reports, write documentation. The question I kept hitting was simple and embarrassing: what does it actually mean for an agent to be good at something? Not "this prompt template scored well on MMLU." Not "GPT-4 said it was helpful." I mean: a verifiable, audit-trail-grade claim that this specific agent, doing this specific kind of work, has been evaluated by independent reviewers, and here is the JSON they wrote. That did not exist. So we built it. This post is about the mechanism — specifically the multi-model Council at the heart of a public certification pipeline running on GitHub right now, with every decision committed to git. The structural problem with single-model evaluation The default way to evaluate an AI agent right now is to ask a single judge model whether the agent did a good job. Fast feedback, but structurally bad in three ways: - Single-vendor bias. GPT-4 grades GPT-4-generated work charitably. Claude has its own preferences. Gemini has its own. Each model has a worldview baked in. - Single failure mode. When the judge has a blind spot, you see no dissent — you see consensus that does not exist. - No audit trail. "The judge said 8.5/10" is not an artifact you can point at, version, or contest. The Council pattern fixes all three at once. The Council Every candidate goes through a Defense step where four independent reviewers evaluate the same bundle in parallel: | Role | Model | Provider | |---|---|---| | Faculty Chair | Claude Sonnet 4.5 | Anthropic | | Velocity | Llama 3.3 70B | Groq | | Reasoning at scale | Qwen 3 235B | Cerebras | | Long context | Kimi K2 | Moonshot | Four providers, four model families, four explicit focuses. They do not see each other's reviews. Each produces a structured JSON file conforming to a strict template. The orchestrator is ~150 lines of Python: run council.py https://github.com/aetherneum-network/faculty/blob/main/cohort-q2-2026/run council.py . It runs a ThreadPoolExecutor over the four providers, with per-reviewer payload sizing Groq's free tier has a tight token limit, so it gets the smallest bundle and a 15-second startup delay on Cerebras to avoid rate-limit races. There is exponential backoff on 429 and 5xx . The whole thing fits in one file.Output: four JSON files at cohort-