{"slug": "seven-models-judged-each-other-the-mistakes-were-the-most-useful-part", "title": "Seven Models Judged Each Other. The Mistakes Were the Most Useful Part.", "summary": "Seven AI models were each given the same prompt in blind OpenCode sessions and asked to rank all seven, including themselves, on fixed factors. The mistakes — silent model substitutions, hallucinated benchmark scores, and drifting price denominators — proved more revealing than the leaderboard. The blended winner was DeepSeek V4 Flash, with the audit showing that evaluator behavior is now part of model capability.", "body_md": "# Seven Models Judged Each Other. The Mistakes Were the Most Useful Part.\n\nSeven AI models were given the same prompt in blind OpenCode sessions and asked to rank all seven, including themselves. The mistakes — silent model substitutions, hallucinated benchmark scores, drifting price denominators — turned out to be more interesting than the leaderboard.\n\nThe useful result was not the leaderboard. The useful result was the failure mode: when models evaluate other models, the evaluator becomes part of the system under test. Identity resolution, source labels, denominator choice, and scoring schema matter as much as the model's raw reasoning ability.\n\nI ran this as an engineering audit, not as a benchmark claim. The question was narrower: if seven capable models are asked to evaluate the same model set under the same scoring schema, where does the evaluation process break?\n\n**TL;DR — Key Takeaways:**\n\n- Seven frontier AI models were each given the same prompt in a blind OpenCode session and asked to rank all seven, including themselves, on seven fixed factors\n- Codex 5.5 and Claude Sonnet 4.6 then independently reconciled the seven reports\n- The blended winner was DeepSeek V4 Flash, with V4 Pro as the raw-capability leader and Qwen3.6 Plus as the strongest commercial package\n- Three models silently substituted MiMo-V2-Flash for MiMo-V2.5; price denominators drifted across reports; \"verified\" sometimes meant \"looked plausible from an aggregator\"\n- Evaluator behavior is now part of model capability.\n[The Supervisor Pattern](https://arizenai.com/the-supervisor-pattern/)is the architectural response — coordination that does not trust any single evaluator. Treating models as benchmark sources without scrutiny will produce confident, wrong leaderboards.\n\n## What I Did\n\nI opened a fresh OpenCode session with each of seven frontier models. Each session was independent: no shared context, no access to other sessions' outputs.\n\nThe same prompt went to every model:\n\nRank all seven models in this set, including yourself, using a fixed seven-factor scoring framework. Publish the raw data matrix with source labels before you score. Do not change the weights. Flag any factor where you cannot find reliable public data.\n\nThe seven models were:\n\n- GLM-5.1\n- Kimi K2.6\n- MiMo-V2.5\n- MiniMax M2.7\n- Qwen3.6 Plus\n- DeepSeek V4 Pro\n- DeepSeek V4 Flash\n\nAfter the seven self-and-peer reports came in, I had Codex 5.5 and Claude Sonnet 4.6 independently reconcile the outputs. Neither arbiter was in the ranked set, which removed self-bias from the final pass.\n\nThis is not a lab-grade leaderboard. It is a test of how models behave when asked to do messy technical evaluation under ambiguity. The mistakes are part of the result.\n\nThe prior art is old measurement hygiene: define the target, pin the schema, publish the raw matrix, and separate observation from score. The AI-specific twist is that the evaluator can silently substitute facts while producing a cleaner report than a human would.\n\n## The Scoring Framework\n\n| Factor | Weight | Normalization |\n|---|---|---|\n| Intelligence | 25% | Composite of HLE, MMLU-Pro, ARC-AGI |\n| Reasoning | 20% | Composite of AIME, GPQA, SWE-Bench Pro |\n| Openness | 15% | Open weights + permissive license = full credit |\n| Ecosystem / MoE | 10% | Tooling, integrations, agentic tooling support |\n| Context window | 10% | Normalized: 2M tokens = 100 pts |\n| Speed | 10% | Normalized: 200 tok/s = 100 pts |\n| Price | 10% | Normalized: $0.30/M input = 100 pts |\n\nThe framework rewards deployable utility, not peak intelligence. Intelligence and reasoning together are 45% of the score, but openness, context, speed, and price together are 45% as well. A frontier model that is twice as expensive as a near-frontier model has to be substantially smarter to win.\n\n## Final Blended Ranking\n\nTwo arbiters scored independently. Their results agreed on positions 1, 2, 3, and 5. They diverged on Kimi K2.6 — Codex placed it 4th, Claude placed it 6th — because Claude penalized 256K context and Kimi's price more aggressively.\n\n| # | Model | Codex score | Claude score | Notes |\n|---|---|---|---|---|\n| 1 | DeepSeek V4 Flash | 81.4 | 82.6 | 1M context, $0.14/M, MIT license, fast |\n| 2 | DeepSeek V4 Pro | 74.1 | 73.3 | Best raw reasoning, expensive |\n| 3 | Qwen3.6 Plus | 72.6 | 72.7 | Best commercial ecosystem, 1M context |\n| 4 | Kimi K2.6 | 70.7 | 63.7 | Best AIME (96.4%), multimodal |\n| 5 | GLM-5.1 | 67.7 | 67.3 | Co-leads SWE-Bench Pro (58.4%) |\n| 6 | MiniMax M2.7 | 65.9 | 60.5 | Strong agent tooling, thin benchmark transparency |\n| 7 | MiMo-V2.5 | 65.7 | n/a | Model identity unresolved (more on this below) |\n\nTop three were tight across both arbiters. The interesting structural fact: the consensus winner was not the smartest model. The consensus winner was the model with the best deployable tradeoff.\n\n## Why Flash Beats Pro\n\n**The consensus winner of this benchmark was not the smartest model. It was the model with the best deployable tradeoff.**\n\nDeepSeek V4 Flash and V4 Pro come from the same lab, share architecture, and target the same use cases. Pro is meaningfully smarter on reasoning benchmarks. Flash still won. The arithmetic is worth working through.\n\nFlash earns nearly maximal credit on four of the seven factors:\n\n- 1M context window: 50/100 normalized against the 2M-token ceiling\n- cache-miss input price near $0.14/M: about 95/100 under this framework\n- speed in the high range\n- MIT-style permissive open-weights posture\n\nTogether those four factors contribute roughly 22 weighted points to Flash before any benchmark data enters the picture. Pro's reasoning lead is worth roughly 4 weighted points under the fixed framework.\n\nThe takeaway is not \"Flash is smarter than Pro.\" It is that the framework's weighting captures something real about deployable economics. For iterative coding-agent work in OpenCode, you run the model thousands of times per project. Cost and latency compound.\n\nYou can change the weights. If you weight intelligence at 40% and price at 5%, Pro wins. The point is that the framework you choose determines the leaderboard you get. There is no neutral framework.\n\n## The MiMo Identity Problem\n\nThis is the finding I think about most.\n\nThe prompt asked every model to score MiMo-V2.5. Three of seven evaluators silently scored MiMo-V2-Flash instead, without flagging the substitution.\n\nThese are not the same model. MiMo-V2-Flash is open-weight, 309B/15B MoE, 256K context, MIT-licensed, with strong public benchmark data. MiMo-V2.5 is a newer proprietary/omnimodal model with 1M context and very thin public benchmark coverage.\n\nA senior engineer reading the model reports would not catch this without checking the raw matrices. The model names look similar enough that \"MiMo-V2.5\" and \"MiMo-V2-Flash\" appear interchangeable on first read. Some evaluators apparently searched for \"MiMo\" and used whichever variant returned cleaner data.\n\nThe structural lesson: when the market ships near-similarly-named models on a fast cadence, model identity resolution becomes a real source of evaluation error. This is not a model failure in the usual sense — the model gave a confident, well-organized answer. It just answered a slightly different question than the one asked.\n\nThere is a publication-level fix and an architectural fix.\n\nThe publication-level fix is to publish two MiMo interpretations:\n\n| Interpretation | MiMo lands at | Why |\n|---|---|---|\n| Strict target: MiMo-V2.5 | #7 | Thin public data, proprietary posture, identity uncertainty |\n| Substituted target: MiMo-V2-Flash | #4 | Open weights, strong coding/math data, excellent self-host economics |\n\nThe architectural fix is harder: when delegating evaluation work to a model, supply the canonical identifier (HuggingFace model card URL, or the lab's specific release page) and require the model to cite it before scoring. \"MiMo\" is not a model. \"MiMo-V2.5 from the model card at https://...\" is. Most current benchmarking workflows skip this step.\n\n## The Price-Denominator Problem\n\nPricing should be the easiest factor to score. It is a number. The lab publishes it.\n\nPricing was the second-most-unstable factor in the experiment.\n\nAcross the seven reports, evaluators used at least five different price denominators:\n\n- input price per million tokens\n- output price per million tokens\n- blended price with no agreed weighting\n- cache-hit input price\n- self-host estimated cost for open-weight models\n\nA model evaluated at \"blended $0.40/M\" looks worse than the same model evaluated at \"cache-hit $0.05/M.\" Both numbers can be defended. Neither is wrong. The framework just becomes incoherent when different evaluators pick different denominators silently.\n\nThis is a real-world modeling failure that gets worse, not better, as the market matures. As more models add prompt caching, batch APIs, off-peak pricing, and capability-tiered pricing, the denominator question becomes more important.\n\nThe experiment's lesson: any cross-model price comparison needs to specify the denominator explicitly. \"Cheaper\" is not a property of a model. It is a property of a model under a specific usage profile.\n\n## The Openness vs API-Availability Problem\n\nOpenness was supposed to be a binary-ish factor: open weights + permissive license = full credit; proprietary API-only = zero credit; partial cases between.\n\nSeveral evaluators conflated \"available on OpenRouter\" with \"open weights with permissive license.\" These are entirely different.\n\nA model can be:\n\n- available on OpenRouter while the weights remain proprietary\n- available on HuggingFace with a non-commercial license\n- available with full open weights and MIT/Apache licensing\n- available only through the lab's own API\n\nThese are sometimes treated as the same in practice (\"I can use it!\"), but they have completely different implications for cost, deployment, reproducibility, and licensing.\n\nGLM-5.1 was the only evaluator that correctly identified all three MIT-licensed models in the set. It also ranked itself fifth despite GLM-5.1 co-leading SWE-Bench Pro. The discipline showed up in both places.\n\n## What This Teaches About Using Models As Evaluators\n\n**Evaluator behavior is now part of model capability.**\n\nThis is the part that matters more than the leaderboard.\n\nThe seven models all completed the task. They all produced organized reports with tables, citations, and reasoning. None of them returned an obviously degenerate output. By the standard \"did it follow instructions\" measure, they all passed.\n\nUnderneath that, they made systematically different mistakes:\n\n**Model identity confusion.** Three of seven mishandled \"MiMo-V2.5.\"**Hallucinated benchmark scores.** DeepSeek V4 Pro reported Kimi's HLE score as 54.0%, labeled \"verified.\" Every other evaluator cited 36.4%.**Price denominator drift.** No two evaluators used the same price normalization.**Conflated availability with openness.** Several treated OpenRouter availability as equivalent to open weights.**Stale data without flagging.** Some pricing and benchmark numbers were from earlier model versions, with no caveat.\n\nSelf-bias was less obvious than expected. Several models did not rank themselves first. Some ranked themselves poorly. The bigger problem was not ego — it was research discipline. Models that confidently substitute a similar-named model for the requested one will pass any \"did it complete the task\" test. They will fail the \"is this answer correct\" test.\n\nThe implication for anyone using models as evaluators: the evaluator's research discipline is part of the answer. A confident, well-organized output is not evidence of correctness. The question to ask is \"would a senior practitioner catch this if they audited the raw data?\"\n\nFor most current model-evaluator workflows, the answer is no. The evaluator produces a cleaner output than the practitioner would, which makes the practitioner less likely to audit it.\n\n## Evaluator Discipline Was Not Uniform\n\nThe most methodologically careful evaluator was GLM-5.1.\n\nIt ranked itself fifth despite co-leading SWE-Bench Pro. It correctly identified all three MIT-licensed models in the set. It used consistent price denominators across all seven scored models. It flagged the MiMo identity question rather than silently substituting.\n\nThe least methodologically careful evaluator was DeepSeek V4 Pro on the Kimi data point — the \"verified 54.0%\" HLE score that no other source confirmed. Pro is a strong model. It produced strong reports. It also produced one of the worst single data points in the experiment.\n\nThis is the harder lesson: evaluator quality is not a property of \"smart vs. not smart.\" A frontier model can hallucinate confidently in one factor while being rigorous in another. There is no single number that captures evaluator discipline.\n\nPractical implication: when delegating evaluation work, ask the model to publish its raw matrix with source labels *before* it scores. The act of writing down the source label catches a lot of these errors. Several of the experiment's worst data points came from evaluators that compressed the source-citation step.\n\n## Practical Recommendations for OpenCode Users\n\nFor daily-driver coding-agent work, the experiment suggests:\n\n**DeepSeek V4 Flash** is the practical default when cost, latency, and long context matter more than peak one-shot reasoning.**DeepSeek V4 Pro** fits one-shot frontier reasoning where the value of the answer dwarfs cost-per-call.**Qwen3.6 Plus** is the commercial-ecosystem choice when Alibaba Cloud integration, multimodal work, and long context are central.**GLM-5.1** is attractive when you need a strong open coding model and 200K context is enough.**Kimi K2.6** belongs in the set when multimodal plus agentic coding is the specific work shape.\n\nThese are starting points, not absolutes. Your weights are not my weights. The experiment's framework was a deliberate choice, and changing it shifts the leaderboard. If you do daily evaluation, build your own framework with the weights that match your actual workload, then re-run.\n\n## What Surprised Me\n\nThree things changed how I think about model evaluation:\n\n**The ego effect was smaller than expected.** Going in, I thought self-bias would be the dominant evaluator failure. It was not. The bigger failures were research discipline failures: confusing similar models, drifting denominators, and treating aggregators as primary sources.\n\n**The framework choice was more decisive than the evaluator choice.** Two arbiters using the same fixed weights agreed on six of seven positions. The disagreement was concentrated where the framework left the most ambiguity. Spend more time on the framework than on the choice of evaluator.\n\n**Cheap models won by being good enough on the right axes.** I expected the experiment to surface cases where Flash was close to Pro. I did not expect Flash to win cleanly. The framework rewarded utility, and utility compounds over real workloads in ways that peak intelligence does not.\n\n## Replication Package\n\nThe full replication package is available with the article:\n\n- The exact prompt every model received\n- Seven individual model reports, one per evaluator\n- Two independent arbiter reports\n- The consolidated comparison with bias flags\n- The raw scoring matrices\n\nIf you want to re-weight and re-rank, all the data is there. If you find errors, please publish them. This experiment was designed to be auditable, not authoritative.\n\n**Model leaderboards are not the future. Architecture decisions about which model does what work, under what review gate, are the future.**\n\nThe leaderboard is the input. The architecture is the output.\n\nThis is the same pattern that shows up in [the validator asymmetry principle](https://arizenai.com/validator-asymmetry-principle/) — verification is cheaper than generation, but only when the verifier is bounded correctly. It is the same pattern that shows up in [temperature 0 is not deterministic](https://arizenai.com/temperature-zero-is-not-determinism/) — surface confidence in a model's output is not evidence the output is correct.\n\nIf you take one thing from this experiment, take this: the next time a model gives you a confident answer about another model's properties, ask it to publish the raw matrix with source labels first. Most evaluator failures get caught at that step. The ones that do not get caught at that step will not get caught at all.\n\n## Related Reading\n\n[Temperature 0 Doesn't Mean Deterministic](https://arizenai.com/temperature-zero-is-not-determinism/)— The stochasticity that makes cross-model evaluation unreliable[The Validator Asymmetry Principle](https://arizenai.com/validator-asymmetry-principle/)— Why verification catches what fluent generation hides[The Supervisor Pattern](https://arizenai.com/the-supervisor-pattern/)— The architecture that fixes multi-model coordination failures", "url": "https://wpnews.pro/news/seven-models-judged-each-other-the-mistakes-were-the-most-useful-part", "canonical_source": "https://arizenai.com/seven-models-judged-each-other/", "published_at": "2026-05-01 12:08:58+00:00", "updated_at": "2026-05-26 13:15:14.842317+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-research", "ai-agents", "ai-safety"], "entities": ["Codex 5.5", "Claude Sonnet 4.6", "DeepSeek V4 Flash", "DeepSeek V4 Pro", "Qwen3.6 Plus", "MiMo-V2-Flash", "MiMo-V2.5", "OpenCode"], "alternates": {"html": "https://wpnews.pro/news/seven-models-judged-each-other-the-mistakes-were-the-most-useful-part", "markdown": "https://wpnews.pro/news/seven-models-judged-each-other-the-mistakes-were-the-most-useful-part.md", "text": "https://wpnews.pro/news/seven-models-judged-each-other-the-mistakes-were-the-most-useful-part.txt", "jsonld": "https://wpnews.pro/news/seven-models-judged-each-other-the-mistakes-were-the-most-useful-part.jsonld"}}