I saw this Twitter post today and really liked the idea. But I think the AA Index is a rather crude way and much prefer ECI from Epoch, which uses IRT. The resulting graph does meaningfully diverge from the Twitter post (which seems to weirdly collapse at the end, maybe because of no logistic assumptions being taken into consideration):
[see linkpost to actually interact with graphs, like seeing what model is what, etc]
For context, the two raw frontiers - the running best ECI over time for open-weight vs closed models: Sadly, GLM-5.2 has not been scored yet, but I'll update the website when it is.
You can also generalize to other criteria (though this is probably the most interesting one). One such example would be the OpenAI vs Anthropic rivalry: