# Semgrep says GLM 5.2 beat Claude in a narrow security benchmark

> Source: <https://runtimewire.com/article/semgrep-glm-52-claude-cyber-benchmark>
> Published: 2026-06-29 01:30:45+00:00

[Semgrep](https://semgrep.dev/?ref=runtimewire) co-founders Isaac Evans, Drew Dennison and Luke O'Malley have spent years arguing that application security tooling should meet developers inside their normal code workflow. Six days ago, Semgrep published a benchmark that puts that thesis into sharper focus: in a [June 22 benchmark post](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/?ref=runtimewire), Semgrep said GLM 5.2, an open-weight model from Zhipu AI, scored 39% F1 on an IDOR vulnerability-detection test when given only a prompt, ahead of Claude Code in Semgrep's setup.

That number is useful, but only if read correctly. Semgrep's own Multimodal pipeline still led the table, scoring 61% F1 with GPT 5.5 and 53% F1 with Opus 4.8, according to the company's post. GLM 5.2's showing matters less as a model leaderboard trophy than as evidence for the argument Semgrep wants customers to buy: the model is only one part of the detection system. The harness around it - what code it sees, how endpoints are surfaced, how findings are parsed, and how the workflow routes the model through a repository - can matter as much as the base model.

That is the same founder thesis Evans, Dennison and O'Malley carried out of MIT. [MIT News](https://news.mit.edu/2022/r2c-software-security-0210?ref=runtimewire) described the three as MIT EECS alumni who had lived near one another in Simmons Hall, worked together through MIT's Gordon Engineering Leadership Program, and collaborated in 2011 on Android security work for U.S. Army users. The team that became Semgrep was built around a plain problem: strong software-security analysis existed, but too much of it was hard for ordinary developers and small security teams to use at speed.

### What Semgrep actually tested

Semgrep's post focused on IDOR, or Insecure Direct Object Reference, a class of access-control bug where an application exposes an object identifier and fails to check whether the requester is allowed to access that object. These bugs are difficult for both static analysis and LLMs because the failure is often the absence of an authorization check, not a dangerous function call.

Semgrep said it held constant the dataset, the evaluation method and the IDOR prompt, then varied the model and harness. Semgrep Multimodal ran inside Semgrep's custom harness, which enumerates application endpoints and directs the model toward relevant code. By contrast, prompt-only runs used the same IDOR prompt without that endpoint-discovery scaffolding.

As reported in Semgrep's post, Semgrep Multimodal with GPT 5.5 reached 61% F1, Multimodal with Opus 4.8 scored 53%, and GLM 5.2 scored 39% F1 in the prompt-only setup, ahead of Claude Code in Semgrep's chart. Semgrep also estimated GLM 5.2 at roughly $0.17 per vulnerability found in the test.

There is a built-in limitation here: this is Semgrep's benchmark, published by Semgrep, on a specific vulnerability class, with Semgrep's own product architecture sitting at the top of the table. The result should not be read as a general claim that GLM 5.2 is better than Claude at security work, or that prompt-only open-weight models are ready to replace integrated AppSec systems. It is a narrower claim: on this IDOR benchmark, under Semgrep's evaluation, an open-weight model performed better than some frontier-code-agent configurations when both were used without Semgrep's full scaffolding.

That narrower claim is still commercially important. If an open-weight model can reach 39% F1 at low cost on a reasoning-heavy security task, application-security vendors lose the ability to sell AI security as a simple wrapper around the most expensive frontier model. The product value shifts toward orchestration, context selection, policy, and developer workflow - exactly where Semgrep has been positioning itself.

### The product bet behind the benchmark

Semgrep introduced [Semgrep Multimodal](https://semgrep.dev/blog/2026/attackers-cant-have-all-the-advantage-introducing-semgrep-multimodal/?ref=runtimewire) in March 2026, around RSAC, pitching it as a way to combine AI reasoning with rule-based detection for security findings that conventional static analysis often misses. Semgrep's product lineup now spans [Semgrep Code](https://semgrep.dev/products/semgrep-code/?ref=runtimewire) for SAST, [Semgrep Supply Chain](https://semgrep.dev/products/semgrep-supply-chain/?ref=runtimewire) for open-source dependency risk, [Semgrep Secrets](https://semgrep.dev/products/semgrep-secrets/?ref=runtimewire) for hardcoded secrets, [Semgrep Guardian](https://semgrep.dev/products/semgrep-guardian/?ref=runtimewire) for AI-generated code, and [Semgrep AppSec Platform](https://semgrep.dev/products/semgrep-appsec-platform/?ref=runtimewire) for broader application-security management.

Semgrep is trying to sell a security system for an engineering environment where code volume is rising and AI coding agents are becoming part of the development loop. Semgrep's benchmark post explicitly frames the issue as a customer question: how much vulnerability-detection performance comes from the model, and how much comes from the harness?

That question goes straight to budget. If performance were only a function of model quality, AppSec teams would be pushed toward premium frontier models as the main lever. If performance comes from the harness, then the winning vendor is the one that can cheaply route models through the right code context, reduce false positives, and land fixes in pull requests without turning every scan into a manual audit.

Semgrep has been building toward that version of the market for years. Its homepage positions the company as a unified AppSec platform for SAST, SCA, secrets detection, AI-generated-code guardrails, workflows, and remediation. It lists logos including Lyft, Dropbox, Figma, Slack, GitLab, HashiCorp, Trail of Bits, and Vanta, though those customer and user claims are Semgrep's own marketing unless backed by individual case studies.

### The funding context

The benchmark also lands after Semgrep raised a large late-stage round to fund this AI-security push. In February 2025, the company [announced a $100 million Series D](https://www.prnewswire.com/news-releases/semgrep-announces-100m-series-d-funding-to-advance-ai-powered-code-security-302367780.html?ref=runtimewire) led by Menlo Ventures, with Felicis, Harpoon, Lightspeed, Redpoint and Sequoia participating. Semgrep said at the time that the round brought total funding to $204 million. The valuation was not disclosed.

That financing matters because the competitive field around developer security is crowded and moving quickly. Semgrep competes against established application-security vendors such as Snyk, Checkmarx, Veracode, Sonar, GitHub's CodeQL-based security tooling, and newer startups focused on software supply chain and AI-era code risk. The common pressure is the same: AI coding tools are increasing the amount of code written, while security teams remain constrained by review capacity.

For Evans and Semgrep, GLM 5.2 beating Claude in one prompt-only benchmark is not the end point. It is a proof point in a larger market argument. Open-weight models are improving fast enough that the durable advantage may not be proprietary access to a closed model. It may be the system that knows where to send the model, which files matter, how to translate output into an actionable finding, and when a developer should trust the result.

That is a founder-friendly result, but not an unqualified one. Semgrep's reported top score was still 61% F1, which leaves plenty of missed bugs and false positives in a high-stakes workflow. A 39% F1 open-weight result is impressive relative to the other prompt-only runs, but it is not a security guarantee. The operating question for buyers is not whether GLM 5.2 beat Claude on Semgrep's chart. It is whether Semgrep can keep converting better model economics into fewer ignored findings and faster fixes inside real engineering teams.
