Two Claude Code skills that run a hard coding task through a multi-agent harness instead of a single model pass: plan → N parallel implementations → adversarial verification → judge. The point isn't a smarter model — it's that a second (and third) implementation, plus an independent reviewer whose job is to break the result, catches bugs a single pass ships green.
It's a packaging of well-worn techniques — best-of-N sampling, tool-integrated self-correction, and LLM-as-judge / adversarial verification — wired into one /pantheon
command so you don't reassemble them by hand each time. This is scaffolding around the model, not a change to it: it won't rescue a task the model fundamentally can't reason about, but it reliably tightens correctness on coding work whose answer you can express as tests.
The harness runs a deterministic pipeline:
Plan ──▶ Implement (×N parallel) ──▶ Verify (adversarial ×V) ──▶ Synthesize
│ │ each self-corrects │ try to BREAK each │ judge picks winner
1 planner │ against its own tests (T1) │ green build │ + grafts best ideas
N builders reviewers
Plan— derive a tight spec, a test plan thatdefinescorrectness, and N distinct strategies (before any code).Implement— N builders implement different strategies in parallel; each runs its own tests and self-corrects on failure (tool-integrated self-verification, up to 5 iterations).Verify— independent adversarial reviewers try tobreakeach green build; a build refuted by a majority is dropped.Synthesize— a judge picks the winner and lists superior ideas worth grafting from the runners-up.
The value: a build can pass its own tests yet still be wrong. The adversarial layer catches defects the self-written tests miss, instead of rubber-stamping a green build.
| Skill | Adversarial verifier | Requirements |
|---|---|---|
pantheon |
||
| Claude itself (independent agents) | Paid Claude Code plan + Workflows (see below) | |
pantheon-x |
||
| GPT-5.5 via Codex plugin (cross-model) | ||
Above + OpenAI Codex plugin (codex:codex-rescue ) |
pantheon-x
is the stronger setting: the implementation written by Claude is attacked by a different model, which shrinks single-model blind spots (the same mistake slipping past a same-model verifier). If you don't have Codex/GPT-5.5, use pantheon
.
Both skills share the same harness (pantheon-class.js
); they differ only in the crossModelVerify
flag.
These skills drive Claude Code's Workflow orchestration engine, so a stock/Free setup is not enough:
Claude Code ≥ v2.1.154 on apaid plan— Pro, Max, Team, or Enterprise (also Bedrock / Vertex / Foundry).** Not available on the Free tier.**- On
Pro, enable it once:/config
→ turn onDynamic workflows. the cross-model verifier runs as thepantheon-x
only:codex:codex-rescue
subagent, which ships in OpenAI'sCodex plugin—notstock Claude Code. A logged-incodex
CLI alone doesnot register it. Install the plugin:plus a ChatGPT subscription (or
/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
OPENAI_API_KEY
) and thecodex
CLI on PATH.If—codex:codex-rescue
isn't installed, usepantheon
insteadpantheon-x
would otherwise silently skip the adversarial pass and pass every build.
Skills and subagents themselves are stock Claude Code features; no extra setup beyond the above.
Clone into your Claude Code skills directory (personal install):
git clone https://github.com/lolu1032/pantheon-skills.git
cp -R pantheon-skills/pantheon ~/.claude/skills/pantheon
cp -R pantheon-skills/pantheon-x ~/.claude/skills/pantheon-x
Or for a single project, copy into <project>/.claude/skills/
.
In Claude Code:
/pantheon <a hard implementation task whose correctness is testable>
/pantheon-x <same, but GPT-5.5 does the adversarial verification>
Example:
/pantheon Add idempotency-key handling to the payments module so concurrent requests can't double-charge. Tests: pnpm test (vitest)
Claude collects the parameters (task
, workdir
, lang
- test command,
variants
, verifiers
) and launches the harness as a background Workflow, then reports: per-variant test results, which builds the adversarial pass broke, and the final winner with its rationale and grafting suggestions.
| arg | default | notes |
|---|---|---|
task |
||
| — | one-paragraph requirement + acceptance criteria (expressible as tests) | |
workdir |
||
/tmp/pantheon-<name> |
||
| absolute path; a real repo or a scratch dir | ||
lang |
||
| Python/unittest | language + the exact test command for your stack | |
variants |
||
| 3 | bump to 5 for harder problems | |
verifiers |
||
| 2 | bump to 3 to be stricter (majority refutation drops a build) | |
crossModelVerify |
||
false (pantheon ) / true (pantheon-x ) |
||
| route adversarial verify to GPT-5.5/Codex |
Not a daemon. Each invocation runs once to completion and exits — zero cost when idle.- A run spends real tokens. A representative run is ~11 subagents and a few hundred K to ~1M tokens end-to-end, ~6–10 min wall-clock; heavier settings (
variants=5
,verifiers=3
, cross-model) cost more. On Pro/Max it draws from your usage quota; on metered API access, budget a few dollars per run and up.Route only the hardest 10–20% of tasks here— use plain Opus for the rest. - This buys correctness on testable work, not raw model intelligence. If a task isn't expressible as tests, the adversarial layer has little to grip and the overhead isn't worth it. - Coding/agentic productivity only. Not a tool for bypassing safety gates (cybersecurity/biology capability restrictions).
Isn't this just a prompt wrapper?
There's no model change — it's orchestration, yes. The non-trivial part is the adversarial step: an independent agent (a different model in pantheon-x
) whose job is to break a build rather than confirm it. That's what catches defects the builder's own green tests rubber-stamp. The value is the harness shape, not a secret prompt.
Do you have benchmarks vs. plain Opus? No formal benchmark yet — treat the description as mechanism, not a measured delta. The value is in the adversarial step: a build can pass its own tests and still be wrong, and an independent reviewer catches what the self-written tests rubber-stamp. If you run a head-to-head, I'd genuinely like to see the numbers.
What does a run cost?
A few hundred K to ~1M tokens and ~6–10 min at default settings; more for variants=5
/ verifiers=3
/ cross-model. It's meant for the hardest 10–20% of tasks, not everyday edits. See Cost & scope.
It says "Workflow tool not found" / nothing happens.
You're likely on the Free tier, or haven't enabled workflows. See Requirements — needs a paid plan and, on Pro, /config
→ Dynamic workflows.
Why route verification to GPT-5.5 / another vendor's model?
Same-model verifiers share blind spots — a mistake the builder makes, a same-model reviewer tends to miss too. A different model is a cheap way to break that correlation. It's optional: pantheon
runs Claude-on-Claude and still helps.
Solo project, as-is, best-effort. Issues and PRs are welcome, but maintenance comes with no guarantees or SLA — I may not get to everything. It's MIT-licensed, so forking is a first-class option if you want to take it further.