How far behind are open models?

wpnews.pro

Open models, AI models where you can download the weights online, are generally not as capable as the best closed models (models only available through an API), but how large is the gap, and how does it change over time? We try to answer this question by using data from 17 selected benchmarks (8 private, 9 public, ~110 datapoints) measuring various capabilities. All the data and code needed to reproduce this can be found on github.

We find that, as of today, on private benchmarks, where the data is not publicly accessible, open models are roughly 8-10 months behind the closed frontier, while for public benchmarks the gap is roughly 4-6 months. We also find that the gap was smallest around the time of DeepSeek R1, in Jan 2025, and since then the gap has been growing.

These numbers are backward-looking, meaning that, on private benchmarks, the best open models now perform roughly at the level of the best closed models from 8-10 months ago.

The old data from 2023 and 2024 is partially self reported scores. Newer data is mostly better, but there are still major caveats (discussed in an appendix) including several of the "private" benchmarks not being fully private. These data are not perfect, but it's the best data that we were able to find with medium effort.

The fact that we see essentially the same trend in both the private and public data, completely disjoint sets of benchmarks, suggests (but does not demonstrate) that the trend in both is real. It also suggests that, while public benchmarks significantly underestimate the gap between open and closed models, almost by a factor of two, public benchmarks still provide useful information about model capabilities.

People running a private benchmark on open Chinese models might use third-party providers, with zero-data-retention, to protect their private data. We know that both we (who run WeirdML), METR (time-horizons) and Epoch AI (Frontiermath) are careful to use third-party providers for this reason, not sure about the others. Sometimes, due to bugs or implementation issues, third-party providers can have subtly degraded performance when serving open models. This can often be adressed by testing and comparing different providers, but it can be hard to detect subtle degradation, and it's also hard to rule it out completely. If present, such degradation would bias the gap to be larger, especially for the private benchmarks.

This is a speculation we're adding here because it's an important consideration, not because it's based much on these data. The difference in results on the private vs public benchmarks suggests that open model developers are doing some combination of not fully filtering out benchmark data and training to the test (or hillclimbing on the test).

Something like that is probably true, only to a lesser extent, for the private benchmarks as well. Model developers train on the kind of tasks they are likely to meet in benchmarks, even if only inadvertently by training on verifiable tasks, which are more easy to make benchmarks for. Big well-resourced closed labs probably have more access to varied data, more enterprise customers (and feedback from real use) and are relatively less focused on benchmark scores. This suggests that the gap on real-world tasks is probably even larger than that measured by private benchmarks.

We define a set of threshold scores for each benchmark, for most benchmarks we define those at 5% intervals from 0.05 and upwards. Then, the first time an open model crosses each of these thresholds we find out how many months earlier a closed model first crossed the threshold, and use that as an estimate of the gap.

For example, o1-preview was released 12. September 2024, and crossed several thresholds in various benchmarks. When DeepSeek R1 crossed several of the same thresholds in 20. Jan 2025, we count each crossing as a datapoint measuring the gap at 20. Jan 2025 to be about 4.3 months. This methodology is fairly simple and well-defined, but it assumes that all the benchmarks have tested all the major both open and closed models, which is not typically the case. In practice what we do is to find benchmarks that are high quality and have a good set of results for both open and closed models for some period of time. We then go into each benchmark and look at the different thresholds and the open and closed models that crossed the threshold first and ask if it's plausible that each of those would have been the first to cross the threshold if the benchmark had tested all the relevant models. If a major model that probably would have changed the gap significantly if it was there is not included in the data, then we reject the datapoint from this specific threshold. These judgements were made by Claude Opus 4.7, and the justifications are provided in the git repo. We separately went through manually and overruled some of the judgements, in all cases to accept some datapoints where we thought Opus was a bit too conservative.

In general we were fairly conservative in selecting benchmarks and relatively more liberal in including marginal datapoints from the selected benchmarks, especially high quality ones.

This methodology does have a winner's-curse bias, in that the first models to cross a certain threshold will tend to be a positive fluctuation. This could favor closed models if the benchmarks run more of them (which is typically the case). A more careful analysis could try to estimate this effect based, for example, on the ECI framework.

If we take the results from a single threshold that's first crossed by a certain closed model and then later crossed by an open model, say in the example above with o1-preview and DeepSeek R1, we have a clean measurement of the gap (4.3 months), but what time should we associate this gap with? Is this the gap in Sept 2024, when o1-preview was released, or is it the gap in Jan 2025, when R1 was released? These are the forward looking and backward looking perspectives, respectively, and they answer two somewhat different questions. The forward looking question takes the best closed models now, and asks when open models will be at the same level. The backward looking perspective asks how long do I have to go back in time for the best closed models to be at the same level as the best open models today. While we often are more interested in the forward-looking question, what we can actually answer today (for todays top open models) is the backward looking question, and that is the perspective we are using in this analysis. Specifically the question our method answers are "How long-lived are the gaps that a top open model closes when it's released?". We then associate the length of these gaps (in months) with the release date of the open model. By defining the gap in this way we ensure that our estimate of the current gap is not biased by the exclusion of currently-open gaps (thresholds that closed models have crossed, but open models have not yet), and the current gap can be fairly compared to the gaps back in time.

It is clear from our main figure above that private vs public is a very important variable for understanding the gap between open and closed models. However we wanted to see if benchmark category was an important variable as well, so we grouped the benchmarks into four categories and here we show the corresponding trend curves. The "reasoning" category clearly has a larger gap than the others, but all the three benchmarks that make up this category are private, so that's probably the more important factor. I don't think we have enough data to say much meaningful about the categories.

We did the same analysis as the main results only restricting ourselves to Chinese open models. The results are basically the same, with only a few exceptions, back to Llama 3.1 (in July 2024), but before this the gap is notably larger in the Chinese-only analysis.

Almost all the data used here are from the Epoch AI Benchmarking Hub, their work in curating and connecting all the data make these analyses much easier.

Claude Opus 4.7 wrote essentially all the code, and did the research into the different benchmarks and data, directed by us. Opus made suggestions and initial justifications for inclusion/exclusion of data, while we had the final say/judgement and overruled Opus in several cases. We also did several spot checks to see if the final data matched the raw data.

We wrote this blog post, with the exception of Appendix B, which is written entirely by Opus and lightly edited by us.

Here are some additional figures showing accepted and rejected thresholds for some of the benchmarks. Similar figures for all the benchmarks and reasoning behind the choices are on github.

To measure when open-weight models first matched the closed frontier on each benchmark, we need the scores being compared to be trustworthy and comparable — ideally produced by a single independent party running every model through one evaluation harness, rather than a grab-bag of numbers each lab reports for itself under its own favourable settings. We audited all 17 accepted benchmarks on this point (one independent web-research pass per benchmark). The results vary a lot, and we think it's worth being upfront about it.

The table below records, for each benchmark: who actually ran the evaluations, whether Epoch AI's Benchmarking Hub (our main data source) runs the eval itself or merely mirrors an external leaderboard, and our verdict on whether the scores come from a single independent evaluator with no self-reported numbers and comparable settings.

Legend: ✅ one independent evaluator ran every model in a fixed harness · ⚠️ mostly, but with a real caveat · ❌ scores are largely self-reported / submitted, or not run comparably.

Benchmark	Access used	Who ran the evaluations	Epoch Hub	Independent, no self-report, comparable?
GPQA Diamond	public	Epoch AI (Inspect, 16 runs/model)	runs	✅
MATH Level 5	public	Epoch AI (Inspect, 8 runs/model)	runs	✅
OTIS Mock AIME 2024-25	public	Epoch AI (Inspect, 16 runs/model)	runs	✅
GSM8K	public	No single evaluator — ~70% vendor tech-report numbers, mixed shot counts	mirrors	❌
MMLU	public	No single evaluator — mostly developer self-reported, varying n-shot	mirrors	❌
MMLU-Pro	public	TIGER-Lab harness + community submissions (Epoch blends w/ Artificial Analysis)	mirrors¹	❌
Aider Polyglot	public	Aider (P. Gauthier) + PR-submitted results; per-model configs vary	mirrors	⚠️
Terminal-Bench	public	harbor-framework (Stanford/Laude); PR-submitted, scaffolds vary	mirrors	❌
Humanity's Last Exam	public	CAIS + Scale run the official board (one harness)…	mirrors	⚠️²
FrontierMath	private	Epoch AI	runs	⚠️³
FrontierMath Tier 4	private	Epoch AI	runs	⚠️³
WeirdML	private	Håvard Tveit Ihle (one harness, all models)	mirrors	✅
SimpleBench	private	AI Explained team (private set, AVG@5)	mirrors	✅
METR Time Horizons	private	METR (own task suite + scaffold)	mirrors	✅
FictionLiveBench (120k)	private	fiction.live (single platform)	mirrors	⚠️⁴
ARC-AGI	private	ARC Prize Foundation (semi-private set; not verified by default)	mirrors	⚠️⁵

The benchmarks split into a clean core and a softer periphery. Independently and comparably run: GPQA Diamond, MATH Level 5, OTIS Mock AIME (all Epoch-run), plus WeirdML, SimpleBench and METR (each run end-to-end by a single party). Self-reported or submission-based aggregations: GSM8K, MMLU, MMLU-Pro, Aider Polyglot, Terminal-Bench, and HLE's open side. The private/contamination-resistant set we lean on most is itself mixed — FrontierMath, WeirdML, SimpleBench and METR are cleanly run, while ARC-AGI/-2 are semi-private and partially API-exposed. Read the provenance benchmark-by-benchmark rather than as one reassuring story: the two clearest contamination biases (FrontierMath's OpenAI access, ARC's API exposure) both act on the closed side, and inflating closed scores makes the closed frontier cross thresholds earlier — so on those benchmarks they would, if anything, make the gap look larger than it is (the private-side numbers from FrontierMath/ARC may be overstated). They do not make open look artificially good; the risk is over- not under-statement of the gap.

source & further reading

lesswrong.com — original article Independent alignment of language models From wantons to moral agents The Conservation Ethic in AI 2040

How far behind are open models?

Run your AI side-project on zahid.host