How GLM-5.2 beat Fable 5 At Website Design

wpnews.pro

GLM 5.2 ranks 1st overall on Design Arena’s single-turn, HTML Web Design (Non-Agentic) evaluation, 5 places higher than its predecessor GLM-5.1. To do so, it beat Claude Fable 5, Opus 4.6, and Opus 4.7, a model line that has held the top spots for months on our leaderboards and won more head-to-head matchups than any other model we track. This is the first model to do so, and it’s MIT licensed. Especially impressive is that Z.ai has achieved this result with a model that is the same size as GLM-5.1 at 744 billion parameters, without vision capabilities, while its nearest competitors are speculated to be as much as 6.7x larger.

GLM 5.2 also establishes a new Pareto frontier for preference vs price at $1.40/$4.40 per 1 milllion tokens, versus Claude Fable 5’s $10 / $50 per 1 million tokens.

GLM-5.2 doesn’t surpass Fable 5 in all tasks. It’s ranked second place after Fable 5 on our Game Dev, Data Visualization, and 3D design leaderboards, and 4th place on our UI Component leaderboard.

What changed in GLM-5.2’s website outputs? #

To answer this question, we run a case-by-case analysis on single-turn deployments of GLM-5.2 and observe how its optimizations have improved its performance across frontend coding tasks. This allows us to determine not only which optimizations are most effective, but also which error cases the model avoids.The overarching takeaway is that GLM-5.2 avoids common error cases that most AI models fail to handle, generates more intricate websites, and specializes in designing structures that users prefer over other results.

Model Behavior #1: Outputs seem to indicate a beautiful set of starting templates #

We can see why the Web Dev leaderboard is notable by looking at 1000 randomly sampled websites generated by both GLM-5.2 and Fable 5. This lets us see if a model produces similar designs for different prompts by screenshotting each of the generated websites and grouping by similarity. Below, we can see a visualization of this for GLM-5.2.

If we zoom in, we find that GLM-5.2 has a tendency to produce templated, similar responses even if the prompts are very different.

This is normal for frontier-level models, and it’s the result of many factors from a model’s architecture down to its training data. While the templates aren’t visible in day-to-day work and random activity, when aggregated and compared, they can come to light. The difference with GLM-5.2 is that the templates it uses perform much better than those of other frontier models, causing it to outperform many of its peers, as its templates don’t contain antipatterns like the infamous purple gradients that plagued early AI models.

Compare this to Fable 5, whose outputs are more widespread than GLM-5.2’s. It's more difficult to find exact templates like the ones in GLM-5.2.

This is indicative of Fable 5 being a more general model that produces a wider array of diversified outputs.

This customized technique does not seem to perform better on website generation just yet; the "expert template" base method employed by GLM 5.2 is favored among users as being a higher bar for the average output quality.

Model Behavior #2: Avoids common error cases #

Much of GLM-5.2’s improvement can be explained by the fact that it generates code that... just works. This can be most clearly seen in GLM-5.2’s use of dependencies, such as chart.js and three.js. While other models often fail to effectively use these libraries, GLM-5.2 calls and uses them naturally, resulting in a 6.0 percentage point win rate increase for the 21% of sessions that use them.

The usage of these libraries is especially helpful for the Dashboard and 3D Design categories, where the usage of these libraries dramatically improves performance.

It also uses TailwindCSS in 91% of sessions, and font-awesome in 51% to increase win rates by 1.2 percentage points by crafting intricate design interactions and websites. Compare this to Opus 4.8, which only uses TailwindCSS in 57% of sessions and potentially sees a drop because of it.

GLM-5.2 also has improved layout skills, especially when it comes to above-the-fold design. It generally makes use of beautiful outside CDN images instead of building its own visuals, and also has an improved sense of layout over its other competitors.

GLM-5.2’s ability to use outside dependencies is crucial for improving its performance in Design Arena, as it avoids the error cases that cause other models to fall behind.

Model Behavior #3: More intricate, detailed outputs #

GLM-5.2 also generates animated, elaborate websites with more variation in typography, visual layout, and animation. These perform especially well for marketing and landing page websites, crafting customized user experiences that feel thoughtful and well-designed.

This strategy comes at the cost of longer generation times as the model outputs more tokens, making it slower but also producing immensely more intricate websites. GLM-5.2 produced 25% more characters and lines of code in our testing, with an average generation time of 304.7 seconds, twice as long as Claude Fable 5.

This places GLM-5.2 at the Pareto Frontier of Website Preference vs Speed, but on the wrong side, where it trades preference for speed. While this does boost win rate, it has diminishing returns, with the ideal range being between 46K-57K characters long.

This is a significant divergence from other models like Fable 5, which generate much less than their competitors, up to 38% less lines of code on average.

Fable 5 generates 38% less lines of code and 29% less characters than its competitors.

This increase in generation length extends to agentic settings, where GLM-5.2 generates 11% more files and calls 17% more tools than its competitors, but actually generates slightly less code than its competitors overall.

Since GLM-5.2 is able to generate functional code for most libraries on the first try, it’s often able to add additional functionality and generate more interactive, functional websites that have higher prompt fidelity than other models.

What this means for model selection #

GLM-5.2 is a step forward in design improvements and a huge leap forward for open source models as a whole, with its combination of agent trace distillation and token level improvements optimizing it for performance in single turn tasks. Launches like this are a reminder of how fast the open source frontier is moving, and how what was state of the art a few months ago is now being matched and exceeded by models anyone can build on, fine tune, and deploy freely. Every release like this gives researchers and developers more powerful foundations to push from.

We will continue monitoring GLM-5.2 performance and how it compares to other models. Congratulations to the ZAI team on the launch, and see if you like GLM-5.2 more than other models on DesignArena.ai.

*— Written by Anmay Gupta, founding member of @Intelligence_ai *

source & further reading

notes.designarena.ai — original article