I put 12 AI models into a public World Cup prediction arena.
Not because I think anyone should use LLMs for betting. They should not. The page says entertainment only for a reason.
I did it because sports prediction is a surprisingly clean stress test for models:
After 169 predictions and 21 settled scoring entries, the leaderboard is technically tied.
But the misses are already more useful than the winners.
Full live scoreboard: WorldCup AI Arena
The public dashboard tracks model forecasts, match results, team context, and prediction accuracy.
Snapshot used here: 2026-06-18 05:53 UTC.
| Metric | Value |
|---|---|
| Models tracked | 12 |
| Total predictions | 169 |
| Settled scoring entries | 21 |
| Total leaderboard points | 36 |
| Exact score hits | 0 |
| Correct-winner hits | 12 |
| Average winner accuracy | 62.5% |
The model list includes Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, and Grok variants.
Important caveat: I count pre-match predictions only for accuracy. Post-match reviews are useful for explanation, but they know the result. They are not forecasts.
Every model has 3 points right now.
That sounds boring until you look at the sample size.
| Model | Tier | Predictions | Settled | Winner hits | Points | Accuracy |
|---|---|---|---|---|---|---|
| Qwen3.5 Flash | wildcard | 13 | 1 | 1 | 3 | 100% |
| Claude Opus 4.7 | flagship | 14 | 1 | 1 | 3 | 100% |
| Claude Sonnet 4.6 | flagship | 14 | 1 | 1 | 3 | 100% |
| GPT-5.4 | flagship | 15 | 2 | 1 | 3 | 50% |
| Gemini 3.1 Pro | flagship | 15 | 2 | 1 | 3 | 50% |
| DeepSeek V4 Pro | value | 15 | 2 | 1 | 3 | 50% |
| Qwen 3.7 Plus | value | 14 | 2 | 1 | 3 | 50% |
| Kimi K2.6 | value | 14 | 2 | 1 | 3 | 50% |
| Gemini 2.5 Flash | value | 14 | 2 | 1 | 3 | 50% |
| Grok 4.1 Fast Reasoning | wildcard | 14 | 2 | 1 | 3 | 50% |
| DeepSeek V4 Flash | wildcard | 14 | 2 | 1 | 3 | 50% |
| GPT-5 Nano | wildcard | 13 | 2 | 1 | 3 | 50% |
My read: the leaderboard is not mature enough to crown a winner.
The first useful signal is elsewhere.
Uzbekistan vs Colombia ended 1-3.
All 12 models picked Colombia.
None got the exact score.
| Model | Prediction | Final | Winner hit |
|---|---|---|---|
| Claude Opus 4.7 | 0-2 Colombia | 1-3 Colombia | Yes |
| Claude Sonnet 4.6 | 1-2 Colombia | 1-3 Colombia | Yes |
| GPT-5.4 | 1-2 Colombia | 1-3 Colombia | Yes |
| Gemini 3.1 Pro | 0-2 Colombia | 1-3 Colombia | Yes |
| DeepSeek V4 Pro | 0-2 Colombia | 1-3 Colombia | Yes |
| Qwen 3.7 Plus | 0-2 Colombia | 1-3 Colombia | Yes |
| Kimi K2.6 | 0-2 Colombia | 1-3 Colombia | Yes |
| Gemini 2.5 Flash | 0-2 Colombia | 1-3 Colombia | Yes |
| Grok 4.1 Fast Reasoning | 0-2 Colombia | 1-3 Colombia | Yes |
| DeepSeek V4 Flash | 0-2 Colombia | 1-3 Colombia | Yes |
| GPT-5 Nano | 0-1 Colombia | 1-3 Colombia | Yes |
| Qwen3.5 Flash | 0-1 Colombia | 1-3 Colombia | Yes |
This is the kind of match where a cheap model can be enough.
If all you need is "which side is more likely," then polling cheap models may beat paying a flagship model for every pick.
Portugal vs Congo DR ended 1-1.
Every valid pre-match model picked Portugal.
| Model | Prediction | Final | Outcome |
|---|---|---|---|
| GPT-5.4 | 2-0 Portugal | 1-1 | Miss |
| Gemini 3.1 Pro | 2-0 Portugal | 1-1 | Miss |
| DeepSeek V4 Pro | 2-0 Portugal | 1-1 | Miss |
| Qwen 3.7 Plus | 2-0 Portugal | 1-1 | Miss |
| Kimi K2.6 | 2-0 Portugal | 1-1 | Miss |
| Gemini 2.5 Flash | 2-0 Portugal | 1-1 | Miss |
| Grok 4.1 Fast Reasoning | 3-0 Portugal | 1-1 | Miss |
| DeepSeek V4 Flash | 2-0 Portugal | 1-1 | Miss |
| GPT-5 Nano | 2-1 Portugal | 1-1 | Miss |
That is the part I care about.
The models did not just get unlucky independently. They shared the same prior: Portugal strong, Congo DR weaker, therefore Portugal win.
That is a classic LLM failure mode.
It shows up outside sports too:
In other words, the World Cup is a cute interface for a serious eval problem: models are often too willing to convert reputation into certainty.
The dashboard includes listed price tiers for each model.
Here is the funny part: the cheapest model currently has the cleanest-looking row.
| Model | Listed input / output price | Current result |
|---|---|---|
| Qwen3.5 Flash | $0.026 / $0.263 per 1M | 1/1 winner hit |
| GPT-5 Nano | $0.049 / $0.388 per 1M | 1/2 winner hit |
| Claude Opus 4.7 | $5 / $25 per 1M | 1/1 winner hit |
| GPT-5.4 | $2.45 / $14.7 per 1M | 1/2 winner hit |
Do not overread that. One match is not proof.
But the unit economics are hard to ignore.
Suppose a prediction prompt uses 10K input tokens and 1K output tokens.
Approximate cost:
Qwen3.5 Flash:
10K * $0.026 / 1M + 1K * $0.263 / 1M = $0.000526
Claude Opus 4.7:
10K * $5 / 1M + 1K * $25 / 1M = $0.075
That is roughly a 143x spread for one prediction-shaped call.
If I were building a prediction system, I would not send every match to the most expensive model. I would route it.
def pick_prediction_route(match_uncertainty, model_disagreement, budget_mode):
if budget_mode == "cheap_poll":
return ["qwen3.5-flash", "gpt-5-nano", "deepseek-v4-flash"]
if match_uncertainty == "low" and model_disagreement == "low":
return ["qwen3.5-flash"]
if match_uncertainty == "high" or model_disagreement == "high":
return [
"qwen3.5-flash",
"deepseek-v4-pro",
"gemini-3.1-pro",
"claude-sonnet-4.6",
]
return ["qwen3.5-flash", "claude-sonnet-4.6"]
Cheap models for breadth. Expensive models for disagreement.
That is the same routing logic I use for normal API workloads.
Winner accuracy is not enough.
I want these metrics:
| Metric | Why it matters |
|---|---|
| Winner accuracy | Basic direction |
| Exact score | Hard mode |
| Goal difference | More informative than exact score alone |
| Brier score | Calibration |
| Confidence bucket accuracy | Overconfidence detection |
| Cost per correct winner | Production routing |
| Draw recall | Favorite-bias detector |
| Disagreement value | Whether ensembles help |
The biggest one is draw recall.
Portugal-Congo DR already suggests the models may underpredict draws when a prestigious team is involved.
If that pattern holds, it is more important than the leaderboard.
I would not declare a winner until at least 30-50 settled pre-match predictions per model.
For now:
If you want the full data-cited writeup and live links, I wrote the original breakdown here: AI World Cup Predictions 2026: 12 Models, Early Leaderboard.
Disclosure: I work on the research side at TokenMix, which is why I can wire this kind of multi-model scoreboard quickly.
The early World Cup AI leaderboard does not tell us which model is best yet.
It does tell us something useful: cheap models can match flagship consensus on obvious favorites, and all models can share the same bad prior on a draw.
That is a model-evaluation lesson, not betting advice.
If you were scoring this, would you reward exact score heavily, or focus on calibrated probabilities instead?