I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file.

Here is a factual summary of the article:

The author built a code-review agent using Google's newly announced Gemini 3.5 Flash model and tested it on three real production pull requests. The model successfully identified three legitimate bugs with zero hallucinations, including an unrelated regex bug in the same file as a patch for a known Fastify security vulnerability (CVE-2026-25223). The entire agent was built in roughly two hours using approximately 80 lines of TypeScript and the `@google/genai` SDK with structured JSON output.

This is a submission for the Google I/O Writing Challenge Across 3 real production PRs , I asked Gemini 3.5 Flash to do a code review. The model — announced this week at Google I/O 2026 — caught 3 legitimate bugs, hallucinated 0 , in roughly 4 seconds per PR. The middle PR was the patch for a known security vulnerability in Fastify CVE-2026-25223, a validation-bypass . The model flagged a second, unrelated regex bug in the exact file being patched . Here's what I learned building a code-review agent in about 2 hours with Google's new model. Why I tested this At the I/O keynote, Sundar Pichai pitched Gemini 3.5 Flash as "frontier intelligence combined with action" — optimized for agentic coding and long-horizon tasks. Code review is the perfect stress test: it requires reasoning about code semantics, cross-file context, and judgment about what matters. Reading another 50 hype threads on X felt pointless. So I built the smallest possible agent that could actually use the model on real code, ran it on three concrete PRs, and counted what it got right, what it made up, and what it missed. The architecture Three stages, ~80 lines of TypeScript, runs on Node 20+: INPUT PROCESSING OUTPUT ───── ────────── ────── owner/repo N → 1. fetch the .diff URL → stdout colored summary 2. truncate if 150k chars out/{slug}.json 3. build prompt + schema out/{slug}.md 4. Gemini 3.5 Flash call 5. Zod-parse the response No GitHub token public PRs use the unauthenticated .diff URL . No octokit. No frameworks. Just the new @google/genai SDK with structured output. The core The heart of the pipeline is a single review function — pass it a diff, get back a typed array of issues: js import { GoogleGenAI } from "@google/genai"; import { z } from "zod"; import { zodToJsonSchema } from "zod-to-json-schema"; const ai = new GoogleGenAI { apiKey: process.env.GEMINI API KEY } ; const IssueSchema = z.object { file: z.string , line: z.number .nullable , severity: z.enum "low", "medium", "high", "critical" , category: z.enum "bug", "security", "performance", "style", "logic", "maintainability" , message: z.string , suggestion: z.string .nullable , } ; const ReviewSchema = z.object { summary: z.string , issues: z.array IssueSchema , } ; const SYSTEM PROMPT = You are a senior code reviewer. Analyze the unified git diff below and produce a JSON review. Rules: - Flag REAL issues only — no nitpicks, no style preferences. - Prefer fewer, higher-quality issues over volume. - Each "message" must explain WHY it matters impact, not just observation . - If you cannot see enough context to be sure, lower the severity. Return the full review as JSON matching the provided schema. ; async function review diff: string { const res = await ai.models.generateContent { model: "gemini-3.5-flash", contents: ${SYSTEM PROMPT}\n\n--- DIFF ---\n${diff} , config: { responseMimeType: "application/json", responseJsonSchema: zodToJsonSchema ReviewSchema , }, } ; return ReviewSchema.parse JSON.parse res.text ?? "{}" ; } A few details worth flagging: - Model string: "gemini-3.5-flash" . GA since May 19, 2026. - Structured output: use responseJsonSchema not the older responseSchema . It validates against the Zod-derived schema and returns conformant JSON. No regex-parsing the response, no try/catch for malformed output. - No temperature tuning: Google explicitly recommends not setting temperature , top p , or top k on the 3.5 family — the model handles sampling internally. Full repo at the end. Now the interesting part. The three PRs I picked PRs with very different shapes to see how the model behaved across contexts. | PR | Type | Lines | Why | |---|---|---|---| | fastify 6414 https://github.com/fastify/fastify/pull/6414 express 6100 https://github.com/expressjs/express/pull/6100 Final scorecard PR 1 express 6190 : +0 −0 Model agreed: no issues PR 2 fastify 6414 : +3 −0 3 hits, 0 hallucinations PR 3 express 6100 : +0 −0 Model agreed: no issues ────────────────────────────────────────────────────────────── Total: +3 −0 Zero false positives. What it caught — the headline PR 2 is the one that mattered. Fastify pull 6414 rewrote the entire content-type parser to fix a security flaw CVE-2026-25223 where attackers could bypass body validation by appending a tab character to Content-Type e.g. application/json\tx . The fix introduced a new ContentType class and replaced the old loose string-matching logic. This is exactly the kind of high-stakes, security-sensitive refactor where an automated reviewer either earns its place or doesn't. The model flagged three issues. Here's each one, verified against the actual code. Hit 1: inconsistent variable use in existingParser MEDIUM · logic— The existingParser method checks contentType === "application/json" and this.customParsers.has contentType using the original contentType string instead of the newly calculated, normalized ct variable. Looking at the new code in lib/content-type-parser.js : ContentTypeParser.prototype.existingParser = function contentType { if typeof contentType === 'string' { const ct = new ContentType contentType .toString if contentType === 'application/json' && this.customParsers.has contentType { return this.customParsers.get ct .fn == this kDefaultJsonParse } if contentType === 'text/plain' && this.customParsers.has contentType { return this.customParsers.get ct .fn == defaultPlainTextParser } } return this.hasParser contentType } The model is right. ct is the normalized version, but the conditional guards still test the raw contentType . Since customParsers only holds normalized keys see line 85: this.customParsers.set normalizedContentType, parser , any header with a different case or trailing parameters silently skips the fast path. Subtle, easy to miss in review. Hit 2: a regex missing its end anchor HIGH · security— The subtypeNameReg regular expression is missing a trailing $ anchor. Consequently, any string starting with a valid subtype will match successfully. This one is the headline. In the brand new file lib/content-type.js , the patch defines two parallel regexes: js const typeNameReg = /^ \w $%&' +.^ |~- +$/ // has $ const subtypeNameReg = /^ \w $%&' +.^ |~- +\s / // no $ The subtype regex anchors at the start but not at the end. Inputs like application/json/extra pass the validation gate where they shouldn't. In a PR whose entire purpose is fixing a validation-bypass CVE, a senior reviewer would put this in red on the first pass. The model put it in HIGH on the first pass. I am not claiming this is itself exploitable at the same severity as the original CVE — the downstream parsers may not be reachable in a way that materializes the bug. But the pattern is exactly the class of issue that did materialize as CVE-2026-25223. Pattern-recognition of dangerous shapes is half of what code review is. Hit 3: stateful global regex MEDIUM · bug— The keyValuePairsReg regex is defined globally with the /g flag. Because of this, it is stateful and relies on lastIndex . If parsing throws an exception or future modifications exit the loop early, lastIndex will not reset to 0. Confirmed at the top of lib/content-type.js : js const keyValuePairsReg = / \w $%&' +.^ |~- + = ^; /gm Used inside a class constructor with .exec in a loop. In healthy execution, lastIndex resets to 0 when exec returns null . But the failure mode — exception inside the loop body, or any future break — silently corrupts every subsequent parse for the lifetime of the process. The model's suggested fix use matchAll instead is exactly the JavaScript-idiomatic answer. This is a latent footgun, not a live bug. Severity MEDIUM is arguably high. But it's a real thing the model saw. What it didn't catch — the honest part Two failure modes worth being honest about. Cross-file context. The model only sees the diff. It can't tell whether a function called by the changed code is safe, whether a removed branch was load-bearing somewhere else, or whether tests actually cover the new behavior. For PR 6414 in particular, the upstream callers of the new ContentType class are not in the diff, and the model never reasoned about them. Severity calibration is rough. The regex-without-anchor is HIGH. The stateful /g is MEDIUM. In practice, those probably want to swap — the regex one is a clear pattern with security relevance, the global-regex one is a latent footgun unlikely to fire. Junior-reviewer instincts. I also can't conclusively measure what the model missed without reviewing every comment thread on the PR by hand. The merged commit went through multiple rounds of feedback commits like "address feedback", "refactor algorithm", "appease coverage" , so reviewers did catch things, but how many of those are in-diff issues a tool could have seen versus broader design decisions — I'd need another afternoon to know. What I'd actually use this for Three takeaways after running this on real code: - It earns a place as a first-layer pre-review. Specifically: PRs that touch parsers, validators, or anything that consumes external input. The cost is around $0.003 per PR. The cost of not running it is shipping a regex without an anchor on a security-sensitive code path. - It does not replace human reviewers. It cannot reason about distributed state, concurrency, transactions, or anything that requires understanding multiple files in concert. - Hallucination rate was zero in this sample — but the sample is tiny. The literature on similar models suggests false positives in the 15-25% range on real-world PRs. Three out of three being valid is great but is not a benchmark. The 80 lines of TypeScript that produced this run are on GitHub https://github.com/vicente-r-junior/gemini-code-review . Two things that are non-obvious about the setup: - @google/genai v2 uses responseJsonSchema , not responseSchema . Easy to get wrong if you're translating tutorial code from an older Gemini. - Public GitHub PRs expose a .diff endpoint that requires no auth. You don't need octokit for an MVP. If you try it on PRs with shapes I didn't test — concurrency-heavy, multi-file, generated code — tell me what you find. The interesting question is where the model breaks, not where it works. Built and tested in May 2026 with Gemini 3.5 Flash, GA two days before publication.