The unreasonable effectiveness of LLMs for auditing Rust code

wpnews.pro

This is a mirror of my article which Medium apparently refuses to show to some people.

As a lead of the Rust Secure Code Working Group, I got free access to GPT-5.5 via the Codex for Open Source. Since then I’ve found and reported dozens of issues of varying severity in widely used Rust crates.

Separately, the Rust Foundation security initiative got access to Mythos via Project Glasswing, and their report should also be coming soon. I’ve coordinated with them so that our audit targets would not overlap.

While I haven’t found any truly devastating vulnerabilities, I am very impressed with GPT-5.5 for auditing Rust source code, and I’ll absolutely be adding it to my toolkit alongside fuzzers. Note:* All opinions expressed in this article are my own, not that of any organizations I am a part of.*

Since maintainers could already be dealing with a large amount of vulnerability reports, it is imperative that I do not submit any invalid vulnerability reports and waste maintainers’ already limited time.

So I’ve decided to look for unambiguously problematic class of vulnerability that’s easy to verify: memory safety bugs.

Yes, with an asterisk.

Most code you’d write in Rust is memory-safe, but at some point you have to talk to the operating system or a C library or implement things like intrusive data structures, all of which involves raw pointers.

Most languages implement these parts in C (e.g. CPython), provide unsafe interoperability with C, and have you write C for your own unsafe code, while Rust has its own unsafe subset where you can muck about with raw pointers.

This puts Rust’s memory safety properties on par with Python’s, ahead of Go which violates safety on data races, and behind browser-sandboxed JavaScript (but you can match that by compiling Rust to WebAssembly).

As for the amount of safe vs unsafe code in the wild, my own scan from 2020 showed that 95% of the code on crates.io is memory safe. The authors of the 2020 paper “How do programmers use unsafe Rust?” independently arrived to the 95% number, although they didn’t put it into the final paper because they weren’t confident in their methodology for it. My own scan is also rather crude, but two completely different measurements arriving to the same number is encouraging.

In practice memory safety vulnerability rate reduction compared to C++ is about 1000x, which is more than you’d expect based on the above figures.

Rust has a tool called miri that runs Rust code in an interpreter and tells you precisely whether it committed any crimes against the language rules or not. Safe Rust cannot violate them by construction, but unsafe Rust can, and a validator that immediate tells you whether you messed up or not instead of having to parse dozens of pages of dense prose is indispensable.

It also completely eliminates false positives from LLM vulnerability findings.

If the LLM can construct a unit test that causes miri to fail, I can report that to the maintainers and be certain that it’s a bug. I don’t ever have to argue if it’s a real issue or not, either — the proof is right there. And if miri says the execution is completely fine, then the LLM false positive gets discarded before anyone even sees it. To the best of my knowledge, no other language has a practical tool with this level of precision. Sanitizers are very nice, but can’t catch everything, so verifying against them does not prove absence of issues.

Sadly miri is not without limitations — execution with extra checks is slow, calling into C is not supported, and syscall support is limited. When miri is not applicable, you can fall back on the sanitizers and get some filtering.

I also had to switch miri to the newer Tree Borrows aliasing model (as opposed to the older Stacked Borrows) to avoid false positives, but fortunately that’s just one flag, -Zmiri-tree-borrows.

My setup was very basic: just Codex and a prompt, with GPT-5.5 set to xhigh reasoning effort. It is important for the model to be able to write and run unit tests to try and trigger the issue under miri, so I consider something like Codex essential.

It would be interesting to try a more elaborate harness like metis, but even this basic setup was enough to discover interesting bugs.

The most serious issue I’ve found is an out-of-bounds write in jxl-grid crate. It is a part of jxl-oxide, a JPEG XL decoder in Rust (not to be confused with jxl-rs, which Firefox and Chromium are adopting for JPEG XL decoding; that one came up clean in my audit).

I’ve already fuzzed this crate earlier, but the fuzzer didn’t catch this issue because it only happens on 32-bit platforms and requires very large image dimensions. I wasn’t running the fuzzer on 32-bit, and the fuzzer is limited to small image dimensions to avoid exhausting my computer’s RAM, so it never had a chance to trigger this condition at runtime.

The initial demonstrator showed this as a jxl-grid issue, but it was not clear if it’s a theoretical problem or if it can be triggered by decoding a crafted image. GPT-5.5 helped analyze that too, and it turned out to be reachable. This was very valuable information to correctly prioritize the bug.

This still isn’t that big a deal in practice because old 32-bit devices and very recent and computationally intensive image formats rarely meet, but it does showcase the capability of the tool quite well.

Here are some more samples of bugs GPT-5.5 has discovered, showing the breadth of the kinds issues it has found:

Use-after-free,data racesandpanic safety issuesin intrusive-collections -

[Out-of-bounds reads](https://github.com/rkyv/rkyv/issues/670)on deserializing crafted archives in rkyv -
[Construction of invalid enum values](https://github.com/unicode-org/icu4x/pull/7940)in zerovec -
[Data races for types with interior mutability](https://github.com/djc/hashlink/issues/42)in hashlink -
[Soundness issue in Gecko FFI](https://github.com/mozilla/thin-vec/issues/86)in thin_vec (crossing languages!) -
[Out-of-bounds reads](https://github.com/jaredforth/webp/pull/51)in webp and[another one](https://github.com/imazen/webpx/commit/373015705ec84460ddc8722550805520478a2d57)inwebpx (C library wrappers) -

Multiple &mut pointing to the same memoryin nihav-core -

[Aliasing violation](https://github.com/andylokandy/arraydeque/issues/34)in arraydeque -
[Incorrect alignment handling](https://github.com/rust-av/v_frame/pull/74)in v_frame -

An interesting type confusion issue that’s not yet public

This demonstrates not just understanding of generic issues like out-of-bounds accesses, but also the ability to reason about Rust-specific concepts such as panic safety, aliasing, and the Send/Sync traits that enforce thread safety.

I also got some results out of Claude before I got GPT-5.5 access:

[Use-after-free](https://github.com/servo/rust-smallvec/pull/407)in one function on a zero-capacity SmallVec (Opus 4.6) -
[Multiple out-of-bounds reads](https://rustsec.org/packages/imageproc.html)in imageproc (Opus 4.7)

I haven’t used Claude enough to be able to compare the models. I also cannot compare GPT-5.5 to Mythos, as interesting as that would be, because I deliberately picked different targets to avoid duplicate vulnerability reports putting extra load on maintainers.

Once the issue is identified and explained, the model usually can also fix it autonomously. I avoid “fix the bugs you found” style prompts and instead discuss possible solutions with the model first, then have the model implement one of them.

Submitting a possible fix alongside the vulnerability report puts less pressure on the maintainers. If you look up my pull requests, the first commit usually adds proof-of-concept tests that cause miri to complain, and the subsequent commit fixes the issues and turns the proof-of-concept snippets into regression tests.

GPT-5.5 has also assisted me in locating the version where the bug was introduced, which is essential for security advisories.

In my experience Rust code is a lot easier to audit than C code. In C, if I look at a line like data[a + b], I have to trace through the entire codebase and find all the possible values it can be set to just to validate this one line and make sure it doesn't have out-of-bounds accesses.

Rust, even unsafe Rust, still relies on local reasoning: if I see unsafe { data.get_unchecked(a + b) }, then the function it's in must either validate a and b to make sure their addition is in-bounds, or be itself unsafe to call. Either way, there is a clear point where verification must happen - and if it's not there, it's a bug.

I don’t have to chase the data flow through the entire JPEG XL decoder by hand, and neither does an LLM. This reduces the complexity of auditing the code from combinatorial (all possible combinations of call trees) to linear (each function in isolation).

In that light, it’s not terribly surprising that LLMs are so good at auditing Rust code. And it’s also not terribly surprising that I haven’t found any devastating vulnerabilities after all.

Just like fuzzers before them, LLMs surface numerous bugs that weren’t economical to discover previously. But at the end of the day, no heuristic tool can prove the absence of vulnerabilities.

For example, a GPT-5.5 alone didn’t discover several bugs that a combination of a simpler LLM with a fuzzer did. But we don’t have to rely on heuristics. Rust without unsafe does guarantee the absence of memory safety bugs.

So I’ll keep shrinking the unsafe surface where I can, and I'm glad to have these tools for when I can't.

source & further reading

gist.github.com — original article Your AI Agent Remembers Your Secrets LFM2.5 8B A1B synthetic data. Qwen3.6 35B A3B query model, LFM2.5 response model. Formatted in LFM2.5 chat template. Not checked for safety or alignment. Gemini Flash Lite transcription of the HBM explainer video

The unreasonable effectiveness of LLMs for auditing Rust code

Run your AI side-project on zahid.host