# Claude vs Gemini Across 4 Security Domains: A Dead Heat — and the Hardening 63% of AI Code Skips

> Source: <https://dev.to/ofri-peretz/claude-vs-gemini-across-4-security-domains-a-dead-heat-and-the-hardening-63-of-ai-code-skips-mpp>
> Published: 2026-05-31 03:39:06+00:00

The interesting result isn't who won. It's that across four security domains, Claude and Gemini missed **the same hardening steps** — and if you've shipped AI-generated auth middleware this year, your code almost certainly has the same gaps, and your review didn't catch them either.

For the record, the scoreboard: **one Gemini win, two ties, one split — a statistical dead heat.** That's the last time the *winner* matters in this article.

Here's the number that should bother you more than any leaderboard: across 700 AI-generated functions scored by the rules I'm about to use, **63% shipped a vulnerability**. So "which model writes more secure code?" is mostly the wrong question — I've [run that leaderboard myself](https://dev.to/ofri-peretz/we-ranked-5-ai-models-by-security-the-leaderboard-is-wrong-5a4o) and argued it's the wrong frame. But people keep asking it, so I ran it properly — on the ESLint security plugins I wrote specifically to catch these bugs, each mapped to a CWE — to show you what actually matters.

Four domains, four of my plugins. For each, the *same* feature-only prompt (no "make it secure" hint — that's how people actually use these tools), generated once by **Gemini 2.5 Flash via the Gemini CLI** and once by **Claude Sonnet 4.6 via the Claude CLI**, then linted with the domain's plugin on `recommended`

.

*Method honesty: this is Gemini Flash vs Claude Sonnet — the comparable price/latency tier each vendor's CLI defaults to (Pro and Opus are a separate bracket; more on that below). It compares CLI tooling, system prompt included, not raw models under controlled decoding. n=1 per domain — but I re-ran the JWT round, and both models landed on 5 findings again with the same core misses, so treat these as directional with stable failure modes, not ±0 gospel.*

| Domain | Prompt | Plugin | Gemini | Claude |
|---|---|---|---|---|
NestJS service |
users + auth + admin | `nestjs-security` |
2 |
6 |
JWT auth |
login + verify middleware | `jwt` |
5 | 5 |
MongoDB data layer |
Mongoose model + search | `mongodb-security` |
8 | 8 |
General API (injection) |
import + search + reset | `secure-coding` |
9 | 13* |

One Gemini win, two dead heats, one split. The frontier security gap is **smaller than the discourse suggests** — and the count is the least interesting number here.

*Table legend below: ✗ = one violation of that rule, ✗✗ = two, ✗✗✗ = three, — = rule didn't fire (clean).*

The one clean win, [written up in full separately](https://dev.to/ofri-peretz/i-ran-the-same-nestjs-prompt-on-claude-and-gemini-one-got-6-security-errors-heres-what-both-1fnf). Short version: asked for a users service, Gemini's CLI reached for idiomatic NestJS — class-level `@UseGuards`

, `@Exclude()`

on the password field, `class-validator`

on every DTO. `nestjs-security`

found **2** issues. Claude wrote functionally identical code with none of that scaffolding and drew **6**.

In an opinionated framework, Gemini defaults to the secure idiom. Hold that thought.

Both wrote clean `jsonwebtoken`

code: a signed login token, middleware that *verifies* (no `jwt.decode`

shortcut, no `alg: none`

, no hardcoded secret — every catastrophic JWT footgun avoided by both). Then both stopped at exactly the same place:

`jwt` rule |
CWE | Gemini | Claude |
|---|---|---|---|
`require-algorithm-whitelist` |

`require-audience-validation`

`require-issuer-validation`

`require-max-age`

`no-sensitive-payload`

Here's *why it survives review*: a reviewer reading `jwt.verify(token, secret)`

sees a verify call and ships it. Nobody asks the next question — verifies *for whom?* Without an `audience`

option, a token your service minted for a *different* API sails straight through. That blind spot is exactly what `require-audience-validation`

encodes, and it's why both models — and most human review — walk past it. Call the round 5–5.

The finding that should make you check your own repo first: both models wrote the search to return **whole documents — password hashes included — with no projection**.

``` js
// Both models, essentially:
const results = await User.find(filter);   // ships passwordHash to the caller
// the fix neither wrote:
const results = await User.find(filter).select('-passwordHash').lean();
```

That's `require-projection`

(CWE-200) and `no-select-sensitive-fields`

firing on both sides. The pleasant surprise: the prompt hands a user-supplied search object straight into a Mongoose query — a textbook `$where`

/operator-injection trap — and **both models sidestepped it.** Zero `no-operator-injection`

, zero `no-unsafe-where`

, zero `no-unsafe-query`

on either side. The frontier has internalized "don't interpolate untrusted input into a query." It just hasn't internalized "don't hand back the password column."

`mongodb-security` rule |
CWE | Gemini | Claude |
|---|---|---|---|
`require-schema-validation` |
CWE-20 | ✗✗✗ | ✗ |
`require-projection` |

`require-lean-queries`

`no-select-sensitive-fields`

`no-unbounded-find`

`no-bypass-middleware`

Different distribution, same total (8–8) — but one cell deserves an honest call-out, because it cuts *against* my own headline: `require-schema-validation`

fired **three times on Gemini and once on Claude**. Here, Claude was the more disciplined one — it wired up more of Mongoose's schema-level validation, where Gemini leaned on looser typing. "Gemini is frontier-grade" doesn't mean "Gemini wins every cell"; this is a cell it lost. (And yes, `require-lean-queries`

is CWE-400, not classic injection — `.lean()`

returns plain objects instead of hydrated Mongoose documents, and on an unbounded search that's a real memory-exhaustion lever, which is why it's scored as a resource control, not a nice-to-have.)

*The asterisk. On a raw injection-prone API (JSON/XML import, dynamic search, password reset), `secure-coding`

flagged Gemini **9** and Claude **13** — but that count is backwards. Claude's extra findings came from Claude *doing more*: it explicitly rejected XML `DOCTYPE`

/`ENTITY`

(XXE-hardened), allowlisted the search field, and actually implemented token verification. And here's the honest part — it implemented some of that *insecurely*:

```
// Claude's reset flow — CWE-208, timing-unsafe:
if (providedToken === storedToken) { /* ...reset... */ }

// The fix — hash both to a fixed length first, then compare:
import { createHash, timingSafeEqual } from 'crypto';
const hash = (s: string) => createHash('sha256').update(s).digest();
if (timingSafeEqual(hash(providedToken), hash(storedToken))) { /* ...reset... */ }
// Direct timingSafeEqual(Buffer.from(a), Buffer.from(b)) throws if lengths differ,
// leaking token length to an attacker — always normalise lengths first.
```

Claude wrote that `===`

comparison **five times** (`no-insecure-comparison`

, CWE-208). It's the one *real* vulnerability either model introduced across this entire benchmark — and it exists precisely *because* Claude built the verification surface at all. Gemini's leaner 97 lines issued a token and never compared one, so it had no surface to get wrong. Count favored Gemini; substance is genuinely mixed: Claude hardened more **and** shipped the only real bug.

Before anyone screenshots "Gemini ties Claude on security" — that holds for *realistic, structured* tasks. On **isolated, security-sensitive functions** it inverts. In a [separate 700-function run](https://dev.to/ofri-peretz/aggregate-benchmarks-lie-heres-what-700-ai-functions-look-like-by-security-domain-1hgj) scored by these same plugins, the average vulnerability rate was **63%** — and **Gemini 2.5 Pro was the most vulnerable model at 72.9%** (Flash sat mid-pack at 63.6%). Build a

(The whole method rests on "scored by the plugins I wrote," so a fair question is whether the *scorer* is trustworthy — [here's what ground truth caught that my own unit tests missed](https://dev.to/ofri-peretz/what-ground-truth-caught-that-unit-tests-missed-3-real-bugs-in-9-flagship-lint-rules-o0b).)

Strip out the leaderboard and two things are left:

`alg: none`

, no `jwt.decode`

-without-verify, no `eval`

, no hardcoded credentials, in any domain. (The lone `aud`

/`iss`

validation — is the one most appsec engineers would patch first. "Hardening" undersells it; I'm flagging it as the missing control, not as harmless.) If you're building with Gemini, you're starting from a credible security baseline.Which is the whole point of static analysis: it asks the questions your prompt didn't.

``` python
// eslint.config.mjs
import jwt from 'eslint-plugin-jwt';
import mongodbSecurity from 'eslint-plugin-mongodb-security';
import nestjsSecurity from 'eslint-plugin-nestjs-security';
import secureCoding from 'eslint-plugin-secure-coding';
import tsParser from '@typescript-eslint/parser';

export default [
  // TypeScript parser so decorators and types resolve
  { files: ['**/*.ts'], languageOptions: { parser: tsParser } },
  // Each plugin ships a flat `recommended` preset (plugin + rules)
  jwt.configs.recommended,
  mongodbSecurity.configs.recommended,
  nestjsSecurity.configs.recommended,
  secureCoding.configs.recommended,
];
npm install --save-dev eslint-plugin-jwt eslint-plugin-mongodb-security \
  eslint-plugin-nestjs-security eslint-plugin-secure-coding
npx eslint src/
```

Every rule maps to a CWE so an AI agent and a human read the same signal. Full docs at [eslint.interlace.tools](https://eslint.interlace.tools).

Which hardening step does *your* AI-generated code skip most — the algorithm allowlist, the audience check, or the query projection? Open the file and look. I'll bet it's at least two of the three. Tell me which ones — I'm collecting scorecards.

*Part of the AI Security Benchmark Series:*

📦 [ eslint-plugin-jwt](https://www.npmjs.com/package/eslint-plugin-jwt) ·

`eslint-plugin-mongodb-security`

`eslint-plugin-nestjs-security`

`eslint-plugin-secure-coding`

[GitHub](https://github.com/ofri-peretz) | [X](https://x.com/ofriperetzdev) | [LinkedIn](https://linkedin.com/in/ofri-peretz) | [Dev.to](https://dev.to/ofri-peretz) | [ofriperetz.dev](https://ofriperetz.dev)

👇 **Drop your scorecard below** — algorithm allowlist, audience check, or query projection: which one does your AI-generated code skip? I'm collecting them.