{"slug": "glm-5-2-beats-claude-in-security-benchmark", "title": "GLM 5.2 Beats Claude in Security Benchmark", "summary": "Zhipu AI's open-weight GLM 5.2 model achieved a 39% F1 score in Semgrep's IDOR detection benchmark, outperforming Claude Code's 32% and Claude Opus 4.8. The MIT-licensed model runs locally, enabling secure code analysis for organizations with compliance constraints, and its Mixture-of-Experts architecture keeps inference costs low at $0.17 per bug.", "body_md": "[AI](https://www.devclubhouse.com/c/ai)Article\n\n# GLM 5.2 Beats Claude in Security Benchmark\n\nZhipu AI's open-weight model outshines proprietary giants in detecting complex access control vulnerabilities without leaking code.\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)\n\nFinding security vulnerabilities in code is one of the most demanding tasks we hand over to large language models. Unlike simple syntax checks, identifying logical flaws like Insecure Direct Object References (IDORs) requires a deep understanding of authorization boundaries, routing, and state. For a long time, conventional wisdom said you needed massive, proprietary frontier models behind expensive APIs to even stand a chance.\n\nA recent benchmark from security platform [Semgrep](https://semgrep.dev) turned that assumption on its head. In a head-to-head evaluation of IDOR detection, GLM 5.2, an open-weight model from Zhipu AI, scored a 39% F1 score. This performance comfortably bypassed Claude Code, which posted a 32% F1 score, and even outpaced Claude Opus 4.8 in raw prompting scenarios.\n\nThis is a massive moment for security teams. For organizations that cannot leak proprietary codebases to external APIs due to compliance or privacy constraints, the arrival of a highly capable, MIT-licensed model that runs locally changes the math entirely.\n\n## The Architecture of GLM 5.2\n\nZhipu AI rolled out GLM 5.2 to its coding plan members on June 13, 2026, and released the open weights under an MIT license on June 16, 2026.\n\nUnder the hood, GLM 5.2 is a Mixture-of-Experts (MoE) model. It boasts roughly 750 billion total parameters, but only activates about 40 billion parameters per token. This design keeps inference costs remarkably low. In Semgrep's testing, GLM 5.2 found vulnerabilities at an estimated cost of just $0.17 per bug.\n\nEqually important for security audits is the model's expanded context window, which now stretches to 1 million tokens, up from 200K. Security analysis is rarely self-contained. To find an IDOR, a model must trace a request from an HTTP controller, through middleware checks, down to the database query, often spanning dozens of files. Zhipu AI designed this context window to remain reliable across long, complex agent trajectories, ensuring the model does not lose the thread when parsing deeply nested codebases.\n\n## Raw Prompting vs. The Harness\n\nWhile GLM 5.2's victory over Claude Code is impressive, the benchmark highlights a critical architectural lesson: the model is only as good as the scaffolding around it.\n\nIn this evaluation, both models were tested using a basic [Pydantic AI](https://ai.pydantic.dev) harness. They received the same IDOR prompt, a basic search strategy, and pointers on what IDORs look like, but no advanced assistance like endpoint discovery or guided navigation.\n\nWhen we look at the broader picture, Semgrep's own multimodal pipeline scored between 53% and 61% F1. The difference? Semgrep's pipeline runs inside a custom harness designed specifically for static analysis. This harness does the heavy lifting: it enumerates application endpoints, prunes irrelevant code, and feeds the model only the most critical context.\n\n```\nxychart-beta\n    title \"IDOR Detection Performance (F1 Score %)\"\n    x-axis [\"Claude Code\", \"GLM 5.2\", \"Semgrep Pipeline (Max)\"]\n    y-axis \"F1 Score (%)\" 0 --> 70\n    bar [32, 39, 61]\n```\n\nThe data shows that while a superior model provides a better baseline, building a smart, agentic harness around the model is what moves the needle from experimental to production-ready.\n\n## What This Means for Your Security Workflow\n\nFor developers looking to adopt AI-driven security scanning, GLM 5.2 offers a compelling path forward.\n\nFirst, the MIT license means you can host this model on your own infrastructure. If you are working in fintech, healthcare, or any sector with strict data sovereignty rules, sending code to external APIs is often a non-starter. Running GLM 5.2 locally solves this bottleneck.\n\nHowever, hosting a 750-billion-parameter MoE model is not trivial. Even though only 40 billion parameters are active per token, you still need enough VRAM to hold the active weights and manage the massive 1-million-token context window. Teams will need to balance the infrastructure costs of running high-end GPUs against the API costs of proprietary models.\n\nTo get started, developers should avoid throwing raw code at the model in a single prompt. Instead, mimic the success of Semgrep's multimodal pipeline. Build an agentic workflow that maps out API endpoints, identifies authorization middleware, and extracts only the relevant controller code before feeding it to GLM 5.2.\n\nThe success of GLM 5.2 proves that open-weight models are no longer the underdogs in specialized, highly complex domains like cybersecurity. By combining the privacy of local execution with performance that rivals or exceeds proprietary giants, GLM 5.2 gives developers a powerful new tool to secure their codebases on their own terms.\n\n## Sources & further reading\n\n-\n[GLM 5.2 beats Claude in our benchmarks](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/)— semgrep.dev\n\n[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor\n\nMariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/glm-5-2-beats-claude-in-security-benchmark", "canonical_source": "https://www.devclubhouse.com/a/glm-52-beats-claude-in-security-benchmark", "published_at": "2026-06-28 23:04:04+00:00", "updated_at": "2026-06-28 23:30:47.820637+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "ai-products", "ai-infrastructure"], "entities": ["Zhipu AI", "GLM 5.2", "Claude Code", "Claude Opus", "Semgrep", "Pydantic AI"], "alternates": {"html": "https://wpnews.pro/news/glm-5-2-beats-claude-in-security-benchmark", "markdown": "https://wpnews.pro/news/glm-5-2-beats-claude-in-security-benchmark.md", "text": "https://wpnews.pro/news/glm-5-2-beats-claude-in-security-benchmark.txt", "jsonld": "https://wpnews.pro/news/glm-5-2-beats-claude-in-security-benchmark.jsonld"}}