GLM-5.2 vs Kimi K2.7 Code: Which Model Is Better at Planning vs Building?

Z.ai's GLM-5.2 scored 9.0 to Moonshot AI's Kimi K2.7 Code's 8.1 in planning a feature flag service, but both built near-identical working code from the winning plan, making GLM-5.2 the preferred model for both phases at similar prices.

GLM-5.2 vs Kimi K2.7 Code: Which Model Is Better at Planning vs Building? We tested both models on the same backend task and found the biggest difference was not in writing code, but in deciding what code should be written. Z.ai released GLM-5.2 https://docs.z.ai/guides/llm/glm-5.2 and Moonshot AI released Kimi K2.7 Code https://huggingface.co/moonshotai/Kimi-K2.7-Code within days of each other in June 2026. The two are often compared as similarly priced rivals in the open-weight space, so with the latest version of each out in the same week, we wanted to see how they stack up head to head. We ran both models through the same two-phase test we have been using lately. First, each model planned a backend service. We scored the plans, picked the stronger one, and then had both models build that exact plan from scratch in Kilo Code CLI. https://kilo.ai/cli TL;DR: GLM-5.2 and Kimi K2.7 Code split on planning and matched on building. GLM’s plan scored 9.0 to Kimi’s 8.1 , but both built a near-identical, fully working service, so at their similar prices GLM-5.2 is the one we would reach for in both the planning and execution phases. Why Planning Is the Harder Skill Now Coding agents have gotten pretty good at following a plan. Give a model a detailed spec and, most of the time, it will build roughly what the spec asks for. For the kind of agentic backend work we are testing here, that means the place where models really start to separate is no longer just execution, but is the plan itself. The harder skill now is taking a vague set of requirements, reasoning through the system behind them, anticipating the edge cases, and turning all of that into a plan another model can build from without having to guess. That is the part worth measuring on its own, which is why we test the two phases separately. Both models plan the same service, we score the plans and pick a winner, and then both models build that winning plan from the same blank starting point. A single combined run can blur a weak plan and a weak build together, while splitting the two phases makes it clearer where the result actually came from. We ran this test last on Claude Fable 5 and GPT-5.5 https://blog.kilo.ai/p/claude-fable-5-vs-gpt-5-5 . The planning phase was where the two models clearly diverged. Once we handed the winning plan to both models for the build phase, the results were almost indistinguishable. The practical takeaway was that you can use the stronger, more expensive model for planning, then hand the build to the cheaper model and still get a top-tier result, because much of the quality has already been decided by the plan. This time we ran the same test on two open-weight models, keeping the task and rubric identical so the results line up with the last run. Our Feature Flag Service We asked each model to plan a feature flag service: a small backend that decides whether a feature should be on for a specific user. The service also needed to support gradual rollouts, where a team can enable a feature for a small slice of users and then increase that slice over time. For example, a team shipping a new checkout flow might turn it on for 5% of users during a beta, watch for errors, then ramp it to 25% and eventually to everyone, without deploying new code each time. We use this task because it looks routine, but it hides an important trap. The rollout has to be deterministic: if a user is included in the first 20%, they should still be included when the rollout grows to 40%. The service also cannot solve that by storing every user assignment in a database. A weak plan tends to wave this away. A strong plan explains the exact math that makes it work. That one requirement separates a plan that can be built as written from a plan that leaves the hardest decision to whoever has to implement it. Each model got the prompt from our last run, unchanged, in a fresh Kilo Code CLI session. Planning: GLM-5.2 Was the Sharper of the Two Both planning runs finished in a few minutes. Both nailed the hard part. Each landed on the same kind of rollout math, the kind that grows a rollout without dropping anyone already in it. To grade the rest, we used a weighted rubric whose criteria were fixed back when we wrote the prompt, one per requirement, so neither plan got judged against a bar invented after we read it. What separated the two plans was judgment. Both plans had to settle a handful of questions the prompt left open, and that is where they pulled apart. The clearest example was how each model handled a lookup for a flag that does not exist. GLM’s plan avoided unnecessary database hits by caching the “no such flag” result, but it also caught the follow-up trap: if that flag gets created later, the cached negative result has to be cleared. It even called this out as the easy case to miss. Kimi’s plan never raised that scenario, which means a builder following the plan would not know they needed to handle it. Two other forks showed the same pattern. On rollout bucketing, GLM kept the environment out of the bucketing math and explained why: unless you deliberately change the inputs, a user should land in the same rollout slot in staging and production. Kimi included the environment in the calculation without calling out the trade-off. On API key storage, Kimi reached for bcrypt, which is the standard answer for passwords. GLM used a single fast SHA-256 hash and explained the choice: these keys are long, random strings that cannot realistically be brute-forced, so a slow hash would add cost to every authenticated request without adding meaningful security. In both cases, GLM made a decision and showed its reasoning. Kimi either followed the default convention or left the harder call to the builder. Kimi K2.7 Code’s plan was longer and included more ready-to-paste code, which has value on its own. The purpose of a planning document is not just to list implementation steps, but also to make the hard decisions before building starts. By that standard, the plan that decides is more useful than the plan that lists, and that is why GLM’s plan won. That also matched what we saw in our previous run with Claude Fable 5 and GPT-5.5. Fable won the planning round there for the same reason: it answered the hard questions up front, while GPT-5.5 left more of them for the builder to resolve later. Across both tests, the stronger planner was not the model that wrote the most, but the model that left fewer decisions unresolved. Building: The Plan Did Most of the Work For the build phase, we put GLM’s winning plan into an empty folder as plan.md , then gave each model a fresh Kilo Code CLI session with nothing else to work from. For Kimi, that meant building from a spec written by a rival model, without any of its own planning context carried over. Both models ran into the same known issue: a SQLite library that does not work under Bun. Both fixed it the expected way, by switching to Bun’s built-in SQLite driver. We graded the builds in two passes. First, we ran each build’s own test suite, and both passed. Then we ran a separate 15-check script that had been locked before either build existed. That script hits a live server and checks whether the implementation actually keeps the promises made in the plan. For example: Does auth reject a bad key? Does the same user get the same answer across repeated calls and after a restart? Does a config change show up immediately despite the cache? Does the audit log capture the right before and after values? Are API keys unreadable on disk? GLM passed all 15 checks. Kimi passed 14. Then we ran the test that mattered most. We took the same 200 user IDs, evaluated them against a flag set to a 35% rollout on both finished services, and compared the answers one by one. Both services turned the flag on for the same 77 users, down to the individual IDs. Two different models, given the same plan, produced services you could swap for each other without any user seeing a different result. That is the payoff from splitting the two phases. Because the plan pinned down the exact rollout math, both builds behaved the same where it mattered. That held even in the three places where Kimi’s own Round 1 plan had made a different call. When Kimi K2.7 Code built from GLM’s plan, it followed GLM’s decisions instead of carrying over its own earlier ones. Kimi’s build session only saw GLM’s plan, with none of its own planning context carried over, so this was not a model consciously overruling a preference it still remembered. What it shows is that Kimi did not fall back to its usual defaults on the exact decisions where the two models had disagreed. The plan drove the implementation more than the model’s habits did. The builds were not identical in every respect. GLM-5.2 wrote more code and more tests, and it handled one edge case Kimi missed: clearing the cache the moment a brand-new flag is created. The actual on/off answer a user receives is the same either way, and the stale state clears as soon as the flag is configured, so the real-world impact is limited. Still, it was the same kind of second-order case that separated the two plans in Round 1, showing up again during the build, which fits the broader pattern of GLM being the more careful model. Across both phases, the result points to the same conclusion. GLM won the planning round because it made the decisions Kimi either left implicit or handled by convention. Once those decisions were written into the plan, the model doing the build mattered much less, and two capable models produced nearly the same service from the same spec. What Open Weights Change Both of these models are open-weight models, and that is worth more than the lower price. On June 12, 2026, a US export-control order forced Anthropic to suspend access to Claude Fable 5 and Claude Mythos 5, and because the restriction could not be enforced per user, Anthropic disabled both models for everyone https://www.anthropic.com/news/fable-mythos-access . The specifics of the order matter less here than the outcome. A frontier model that teams had already built on became unavailable within days, for reasons that had nothing to do with those teams or their apps. Open weights change that exposure. GLM-5.2 ships under an MIT license and Kimi K2.7 Code under a modified MIT license, so the weights are downloadable today, and more than one provider can host them, which keeps prices competitive and gives you more than one place to run the same model. For most teams the practical win is less provider lock-in. Access at any single provider can still change, but a copy of the weights you have already downloaded does not get recalled. GLM-5.2 vs Claude Fable 5 While we had the plans open, we pulled GLM-5.2’s up against Claude Fable 5’s, the plan that won the frontier round in our earlier blog post. https://blog.kilo.ai/p/claude-fable-5-vs-gpt-5-5 The two models were never in the same test, but the prompt, task, and rubric were identical, so the scores sit on the same scale. Claude Fable 5 scored 9.1, GLM-5.2 scored 9.0 . Both plans made the same hard calls for the same reasons: environment kept out of the rollout hash, a fast SHA-256 for API keys, and unknown-flag lookups cached. Fable’s plan was sharper in exactly one spot, since it spelled out the create-time cache trap that GLM’s plan left implicit, which is the same gap Kimi fell into during the build. The reason that near-tie is worth a mention is the gap behind it. Claude Fable 5 lists at $10 per million input tokens and $50 per million output. GLM-5.2 lists at $1.40 and $4.40, roughly a tenth of the price. An open-weight model wrote a plan within a rounding error of the frontier model that won last time, at a fraction of the cost. We did not run this comparison to crown anything, but it is a marker for how quickly the open-weight side of this market is closing the distance. Of course, this is one plan on one task, so we would not treat it as proof that GLM-5.2 plans at Fable’s level across the board. The sturdier finding across both posts is about the build phase: once a plan is detailed enough, the executor starts to matter less. We used Kilo BYOK to lower our API costs We did not run this test through metered APIs. We used the GLM Coding Plan from Z.ai and Kimi Code, the coding tier of Moonshot’s Kimi membership. Both connect to Kilo Code’s Gatewa https://blog.kilo.ai/p/kilo-gateway-now-supports-byok-20-providers y, which is why this post reports tokens and time but not dollars per task. You can go to the Kilo BYOK page https://app.kilo.ai/byok and connect your Z.ai and Kimi coding plans in under a minute. You can now also purchase MiniMax token plans with your Kilo Credits. Conclusion For planning, GLM-5.2 was the stronger model. It decided the open questions, explained its choices, and caught failure modes Kimi’s plan left out. For building a detailed plan, both models held up. Given GLM’s plan, both produced services that answered every user the same way and passed almost every check. GLM’s build was tested more deeply and avoided one small cache bug, but neither difference changed how the service behaves for a user. Keep in mind that this is one task, with one run per model, so the exact numbers are a data point rather than a fixed ratio. The broader trend is what matters more. In this test, the open-weight models were good enough to plan and build a real backend service, they cost far less than frontier models, and they come with an availability story that closed models cannot currently match.