AI For Security Review In Application Code In a 2025 benchmark comparing three industry static analysis tools against three frontier LLMs on sixty-three real vulnerabilities in ten C# projects, the LLMs achieved F1 scores between 0.75 and 0.80, significantly outperforming the best static tool (Snyk Code at 0.55) and the worst (SonarQube at 0.26). However, the LLMs win on recall while losing badly on precision, with a separate IDOR detection analysis finding that 88% of issues flagged by a popular AI coding agent were false positives. The trade-off means AI security review catches more real vulnerabilities but also generates substantial noise, requiring careful pipeline design to separate signal from false alarms. A 2025 benchmark ran three industry static analysis tools SonarQube, CodeQL, and Snyk Code against sixty-three real vulnerabilities planted in ten real-world C projects. The best of them, Snyk Code, finished with an F1 of about 0.55. The worst, SonarQube, landed at 0.26. Then the same researchers ran the same set through three frontier LLMs. GPT-4.1, Mistral Large, and DeepSeek V3 all landed between 0.75 and 0.80, mostly by catching things the static tools just walked past. If you read that as "AI wins, replace the SAST" , you'd be wrong. The same study, and a pile of others like it, show that LLMs win on recall they catch more while losing badly on precision . A separate analysis of IDOR detection found that 88% of the issues a popular AI coding agent flagged as IDORs weren't actually IDORs. So you can hand your AI a 50-file pull request, and it'll find the SQL injection you missed. It'll also find six injection bugs that aren't injection bugs, two race conditions that aren't races, and a "potential authorization bypass" in code that has no authorization in it. That tension is what AI security review really is. You're trading a reviewer that misses confidently for a reviewer that finds things confidently, including things that don't exist. The point of this article is to walk through where that trade pays off across the four classic vuln classes SQL injection, XSS, auth bugs, unsafe deserialization and how to wire AI into a security review pipeline so the noise doesn't drown the signal. Let's strip out the marketing copy first. When people say AI for security review , they're usually describing one of three things, and they're not interchangeable. The first is a chat-style review . You paste a function or a diff into a model and ask it to find security issues. This is what most engineers actually do day to day. It's cheap, it has zero infrastructure, and it has zero memory of your codebase. The model sees what you paste and nothing else. The second is an agent-style review that has tools file read, grep, sometimes shell and a system prompt telling it to scan for a vulnerability class. Claude Code's security review, Gemini CLI Action, GitHub Copilot Agent's security mode all fit here. The agent decides what to look at; the prompt decides what counts as a finding. The third is a hybrid pipeline . A deterministic static analysis tool finds candidate locations, then an LLM is invoked on each candidate to triage. Semgrep's AI assistant works this way. So do the more recent academic frameworks like SAST-Genius. The LLM never sees the raw codebase; it sees a candidate finding plus surrounding context. These three look similar from the outside and behave very differently in practice. Pure chat is high-noise, high-flexibility, no memory. Agent is medium-noise, scoped to what the agent chose to look at. Hybrid is low-noise because the SAST already did the heavy lifting, and the LLM is just being asked "is this actually exploitable?" . When somebody says "we use AI for security review" , find out which of the three they mean before you draw any conclusions about the result. A static analyzer like CodeQL is doing taint analysis. It builds a data-flow graph of your program, marks any input from a source HTTP query parameter, request body, environment variable as tainted, and then traces that taint through assignments, function calls, and field accesses to see whether it reaches a sink a SQL query string, an HTML template, a deserialization call . If a tainted value reaches a sink without passing through a sanitizer the tool knows about, that's a finding. It's syntactic. It can prove things; it can also miss anything that flows through an indirection it can't follow: a callback, a dynamic dispatch, a string built across files. An LLM doesn't do that. It pattern-matches. When you paste in a function that takes req.query.id and concatenates it into a SQL string, the model has seen ten thousand variations of that pattern in its training set, including the labeled ones. It will tell you the same thing CodeQL would tell you, plus often why and how to fix it . But it has no formal data-flow graph; it's reasoning as if it does. That's why it catches more on the easy stuff the patterns are saturated in training data and why it makes things up on the hard stuff it pattern-matches "this looks dangerous" without being able to prove the flow . Keep that distinction in your head as we walk through the four vuln classes. The further a class drifts from "a recognizable syntactic shape near tainted input", the worse the LLM does. The ordering matters: it's roughly most syntactic and pattern-shaped at the top, most semantic and context-dependent at the bottom. AI security review tracks that ordering closely. This is the class AI does best on, because the dangerous functions are short, named, well-known, and there's no clever way to make them safe. Two cases dominate in practice. The first is Python's pickle module. Calling pickle.loads on data you don't completely control is a remote-code-execution primitive. The pickle format includes opcodes that can construct arbitrary objects and call arbitrary callables during deserialization. That's not a bug in pickle. It's documented in the module's own warning at the top of the docs page. The fix is don't do it . Use JSON if your data is JSON-shaped. Use a typed format like Protocol Buffers or MessagePack if you need richer structure. There's no version of "pickle but safe with untrusted data". The second is Java's ObjectInputStream . Same idea: deserialization can instantiate arbitrary classes that have side effects in their readObject method. The 2015 Apache Commons Collections "gadget chain" attack turned this from a theoretical risk into a we're patching production right now risk. Java 9 released in 2017 added JEP 290, which gives you ObjectInputFilter , a per-stream or per-JVM allowlist of classes permitted to deserialize. If you have to use Java serialization, you set the filter to the smallest possible class list and refuse everything else. Here's what the bug looks like in both: :::tabs vulnerable pickle.py python import pickle from flask import Flask, request app = Flask name @app.route "/restore", methods= "POST" def restore : Anything in the body becomes a live Python object. An attacker who controls the body controls the process. state = pickle.loads request.data return {"restored": True} VulnerableDeserialization.java python import java.io.ObjectInputStream; import java.io.InputStream; public class SessionRestorer { public Object restore InputStream in throws Exception { // No filter set. Any class on the classpath can be instantiated. // Library gadget chains turn this into RCE. ObjectInputStream ois = new ObjectInputStream in ; return ois.readObject ; } } ::: An LLM, asked "review this for security issues" , will catch both of these reliably. The string pickle.loads next to anything that resembles HTTP input is a saturated training signal. Same for new ObjectInputStream ... .readObject without a filter. You can drop this in any current frontier model and it will return a confident, correct finding with a fix. Where it gets harder is the indirect version: a helper function called loadState that wraps pickle.loads three files away, called from a route handler that doesn't mention pickle at all. SAST tools follow that chain. LLMs follow it if everything is in the context window and they bother to. A chat-style review with only the route handler pasted in will miss it. An agent that can grep the codebase will probably catch it. This is where "which kind of AI review" matters more than "AI or not AI". Tip If you have a codebase with any Python or Java in it, run a one-off grep for pickle.loads , pickle.load , marshal.loads , ObjectInputStream , XMLDecoder , and yaml.load without Loader=SafeLoader . It's a five-minute audit that catches a remarkable number of accidents. SQL injection is the textbook case for AI review. Every model has seen the pattern at saturation: tainted input + string concatenation + SQL execution. Drop in this Node code and any model will tell you what's wrong: vulnerable.js js app.get "/user", async req, res = { const { id } = req.query; const rows = await db.query SELECT FROM users WHERE id = ${id} ; res.json rows ; } ; Now make it slightly harder. Move the query into a helper, build the SQL with a template tag that looks parameterized, but isn't: looks-fine-but-isnt.js js const sql = strings, ...values = strings.reduce acc, s, i = acc + s + values i ?? "" , "" ; async function getUser id { return db.query sql SELECT FROM users WHERE id = ${id} ; } The sql tag here is decorative. It pastes the interpolated value straight into the query. It looks like a tagged template literal that does parameter binding, because that's the convention with libraries like slonik or sql-template-strings . A junior reviewer would skim past it. An LLM might miss it on a chat-style review too, because the shape looks like a safe library. An agent-style review that follows the definition of sql catches it; a hybrid pipeline catches it because SAST traces the data flow regardless of what the helper is called. A few more cases where the LLM does worse than its average: knex.raw ${col} = ? is fine in form and dangerous if col is user-controlled. { $ne: null } . Different syntactic shape, much weaker training signal. LLM accuracy drops noticeably here.The take is the same shape as deserialization: the simple case is excellent, the indirect case needs an agent or a hybrid, and the dynamic case raw fragments, stored procs, NoSQL operators is where you don't trust an LLM alone. XSS is where AI review starts to slip noticeably. The class is bigger than "user input ends up on a page". There are at least four distinct sub-shapes reflected, stored, DOM-based, and template-based , and the safety of any given output depends on which HTML context the value lands in. The same string can be safe in element text, dangerous in an attribute, and a code execution primitive in a