{"slug": "glm-5-2s-code-reviews-are-only-as-good-as-your-prompt", "title": "GLM-5.2’s Code Reviews Are Only as Good as Your Prompt", "summary": "GLM-5.2 from Z.ai shows inconsistent code review quality depending on prompt phrasing, according to a controlled test by Kilo Code CLI. On a straightforward codebase with 16 planted bugs, the model caught 13-15 bugs consistently regardless of prompt or reasoning effort. On a harder codebase with 10 subtle bugs, performance varied significantly, with the model missing critical issues like a delete function that didn't actually delete and a backwards optimistic-lock check.", "body_md": "# GLM-5.2’s Code Reviews Are Only as Good as Your Prompt\n\n[GLM-5.2 from Z.ai](https://z.ai/blog/glm-5.2) has been one of the most talked-about open-weight models since it launched, and we have made it our daily driver to see how it performs on various coding tasks. We already put it [head to head with Kimi K2.7](https://blog.kilo.ai/p/glm-52-vs-kimi-k27-code-which-model) on planning and building a backend service. What gets talked about far less, and what we kept bumping into, is how much its code review quality swings from run to run.\n\nWe had mixed results reviewing code with it. Sometimes it read like a sharp senior engineer, and sometimes it skimmed right past a real bug. In this experiment, we ran a controlled test to find out whether that was a prompt problem or something deeper.\n\n## How We Tested The Model\n\nWe built a small backend in TypeScript: a task management API on Bun, Hono, Drizzle, and SQLite. It had standard pieces like users, authentication, tasks, search, bulk operations, and CSV export. We wrote a test suite that locked in the correct behavior first, then we went into the code and planted bugs. We used that suite as the reference for grading the reviews. A bug counted as caught only when our agent flagged the actual, specific problem.\n\nWe handed the broken codebase to GLM-5.2 in Kilo Code CLI and asked it to audit the code. We ran every reasoning effort the model offers (low, medium, and high) against three prompt framings:\n\n**Casual:**“I just finished this Bun + Hono + Drizzle task API. I think the implementation is pretty clean and consistent with the rest of the codebase. Can you take a look and let me know what you think?”**Consistency-focused:**“Please review this repository for real bugs, security issues, data consistency problems, and production edge cases. Pay attention to whether behavior is consistent across routes.”**Strict production:**“Review this repository as if you are blocking or approving a production PR.”\n\nThe code never changed, the only things we varied were reasoning effort and the wording of the request.\n\n# Round 1: GLM-5.2 Did Well, and Did It Consistently\n\nThe first codebase carried 16 planted bugs across the usual categories: SQL injection in a search query, a user search that returned password hashes, a missing authentication check on an admin-only export, an authorization hole that let any user modify another user’s tasks, CSV formula injection, a pagination off-by-one, and a handful of bulk-operation correctness bugs.\n\nGLM-5.2 handled this cleanly. It caught every serious security bug in every run, and the spread between the worst and best run was small.\n\nWhether we asked casually or strictly, at low effort or high, it landed between 13 and 15 of 16. **On a straightforward codebase, GLM-5.2 reviewed code about as well as we would want, and the prompt barely mattered.**\n\nEvery one of these bugs is the kind that reaches production and causes a real incident, and GLM-5.2 caught them consistently no matter how we asked. We wanted to find where it starts to break down, so we made the next codebase considerably harder.\n\n# Round 2: A Harder Codebase With Subtler Bugs\n\nWe grew the same project into a larger product. We added soft deletion (a `deletedAt`\n\ntimestamp that hides a row everywhere), an archive flag (a softer “move it out of the way” state), optimistic concurrency with a version number, a status state machine for tasks, and an audit log that records who changed what. Then we planted 10 bugs that were far subtler than Round 1. None of them are the kind of thing a scanner flags. Most require understanding what the feature is supposed to do.\n\nFive of the planted bugs, in plain terms:\n\n**Delete did not actually delete.** The delete endpoint marked a task as archived but never set the`deletedAt`\n\ntimestamp the rest of the app uses to hide deleted rows, so “deleted” tasks kept showing up.**The optimistic-lock check was backwards.** The version comparison was written so a stale client (someone editing an out-of-date copy) passed straight through, which is the exact case the check exists to stop.**A permission guard that could never fire.** The rule meant to stop regular users from reopening a finished task had a condition that is always false, so it did nothing at all.**The audit log blamed the wrong person.** Bulk assignment recorded the assignee as the actor instead of the user who actually performed the action.**Archived tasks leaked into normal views.** Archived tasks still appeared in the default search results, in CSV exports, and in the overdue list, even though archiving is supposed to move them out of the way.\n\nWe planted these so they got progressively harder to catch. Some are local bugs you can find by reading a single function carefully, like the backwards lock check or the permission guard that never fires. But the rest got gradually more complicated. For example, the last one is a product rule spread across several endpoints, and the rule a careful reviewer has to infer is easy to state but hard to see. **Archived tasks should drop out of normal views, and deleted tasks should disappear everywhere.** No single line of code says that, so you have to hold the whole system in your head to notice it is broken.\n\nHere is how GLM-5.2 did on the 10 planted bugs.\n\nCoverage dropped, and it moved with the wording of the prompt.\n\n# Wording Beat Reasoning\n\n**Across both rounds, the wording of the prompt changed GLM-5.2’s review more than the reasoning effort did.**\n\nThe strict “block or approve this production PR” framing did not produce the best bug coverage. It pushed GLM-5.2 into a security and hardening review. The model went and found a hardcoded fallback secret, weak password hashing, missing rate limiting, and missing transactions. Those are real bugs worth fixing, but they were not the planted product bugs, and chasing them pulled attention away from the behavior we had actually broken. The casual and consistency-focused framings scored a little better on the planted set, because they kept the model looking at how the code behaves instead of working down a security checklist.\n\nThose extra findings were not noise. The hardcoded secret, the low bcrypt cost, the missing transactions, and a registration race it flagged were all legitimate problems we would want fixed. Even on the runs where GLM-5.2 missed planted bugs, the review still turned up real issues beyond the ones we were grading.\n\nReasoning effort, by comparison, made far less difference. High reasoning was sometimes slightly better and sometimes slightly worse. The swing from prompt wording was consistently larger than the swing from reasoning effort. **This matches what we have seen with GLM-5.2 in code review more broadly. The framing of the request shapes the review more than how long you let it think.**\n\n# What It Caught and What It Missed\n\nThe split was clean and repeatable across runs.\n\n**Caught reliably (local bugs you can spot by reading one function):**\n\nThe delete that archived instead of soft-deleting\n\nThe backwards version check\n\nThe permission guard that could never fire\n\nThe wrong actor in the audit log\n\nThe inconsistent audit action naming on bulk archive\n\n**Kept missing (cross-route rules you only catch by understanding the whole system):**\n\nArchived tasks showing up in the default search\n\nArchived tasks showing up in exports\n\nArchived tasks showing up in the overdue list\n\nGLM-5.2 is strong at local bugs and much weaker at product rules that live across multiple endpoints, and that lines up with the rest of our experience running it.\n\nThe model does its best work when everything it needs sits in one place. A bug that lives inside a single function, where the mistake and the fix are a few lines apart, plays to its strengths. The problem starts as the relevant context spreads out. When catching a bug means pulling together how several files behave and reasoning about them at once, GLM-5.2 gets less reliable, and that is the same point where models from labs like OpenAI and Anthropic hold steady. **The experience you get with GLM-5.2 depends heavily on how much dot-connecting your code forces it to do.** On tight, self-contained changes it holds up with the best of them. On changes whose correctness is spread across the system, the quality starts to wobble.\n\n# How Frontier Models Compared\n\nWe ran the same harder codebase past GPT-5.5 and Opus 4.8, one pass each, using the consistency-focused prompt.\n\nGPT-5.5 went straight at the cross-route problem. In a single pass it wrote out a table of which endpoints filtered deleted and archived rows and which did not, which is exactly the reasoning GLM-5.2 kept skipping. Opus 4.8 was the only model to state the exact intended rule on the reopen-a-finished-task bug rather than an approximation of it.\n\n**GLM-5.2’s best single run reached 7 of 10, one behind Opus 4.8 and two behind GPT-5.5.** It can clearly reach that level. The problem is that we could not predict which run would get there, and that unpredictability is what set it apart from the two frontier models.\n\n# Conclusion\n\nGLM-5.2 is a capable code reviewer with a higher ceiling than its price suggests and more variance than the frontier models we put it next to. The honest way to use it is to match it to the kind of review the change needs.\n\n**Lean on it when the bug lives in one place.** Security holes, broken authentication, and logic errors that sit inside a single function are where it reviewed at frontier level, and the wording of the prompt barely mattered.**Tell it what kind of review you want.** Its output moves more with how you phrase the request than with how long you let it think. Ask for a review of product behavior directly, because a generic “be strict” turns it into a security checklist.**Reach for a state-of-the-art model when correctness is spread across the codebase.** Cross-route rules, where the bug only shows up once you connect several files, were exactly where GLM-5.2 broke down and where GPT-5.5 and Opus 4.8 stayed reliable in a single pass.**Do not rely on one GLM-5.2 pass for changes you cannot afford to get wrong.** It can match the frontier on its best run, but you cannot predict which run that will be, so a second pass or a stronger model is worth it on high-stakes review.\n\nIf you are writing the prompt yourself, a few things helped in our runs. Ask for a review of behavior and consistency across routes instead of a generic production sign-off, since the strict “block or approve this release” framing pushed GLM-5.2 into a security and hardening pass and away from the product bugs we had planted. Name the cross-route checks you care about, for example whether search, export, and the overdue list all filter rows the same way, because that is the reasoning it tends to skip on its own. Do not lean on reasoning effort to close the gap, since moving from low to high barely changed coverage. And because its catches ranged from 4 to 7 of 10 across runs, run it more than once on anything that matters, or follow it with a frontier model when correctness is spread across the codebase.", "url": "https://wpnews.pro/news/glm-5-2s-code-reviews-are-only-as-good-as-your-prompt", "canonical_source": "https://blog.kilo.ai/p/glm-52s-code-reviews-are-only-as", "published_at": "2026-06-26 12:51:16+00:00", "updated_at": "2026-06-26 13:10:33.700578+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-research"], "entities": ["Z.ai", "GLM-5.2", "Kilo Code CLI", "Bun", "Hono", "Drizzle", "SQLite", "TypeScript"], "alternates": {"html": "https://wpnews.pro/news/glm-5-2s-code-reviews-are-only-as-good-as-your-prompt", "markdown": "https://wpnews.pro/news/glm-5-2s-code-reviews-are-only-as-good-as-your-prompt.md", "text": "https://wpnews.pro/news/glm-5-2s-code-reviews-are-only-as-good-as-your-prompt.txt", "jsonld": "https://wpnews.pro/news/glm-5-2s-code-reviews-are-only-as-good-as-your-prompt.jsonld"}}