cd /news/ai-safety/flipping-the-eval-on-its-head · home topics ai-safety article
[ARTICLE · art-41855] src=lesswrong.com ↗ pub= topic=ai-safety verified=true sentiment=↑ positive

Flipping the eval on its head

A new approach to cybersecurity evaluations proposes using language models as constant red-team oracles to empirically compare the attack surfaces of different software implementations, such as OpenSSL forks or formally verified runtimes. The method flips traditional evals by varying the codebase under test rather than the AI model, enabling leaderboards for empirical security properties. The author argues this is measurable and viable as token costs decrease.

read3 min views1 publishedJun 27, 2026

An eval is a product. Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we’d like to see k x n x m evals, or however many dimensions.

This post is pitching an ambitious way to spend tokens on cyberhardening. If its not viable at current capabilities/costs, it may be viable next year or in six months.

There are broadly three approaches to cyberhardening with secure program synthesis or uplifted formal methods.

It is very good and very exciting that Glasswing (as well as companies like AISLE, I think) are doing the first thing. Are the other two things desirable? The point of the current post is to say that this is measurable.

Typically, we vary the model to test different models on all the samples, where the samples might involve some scoring process. In the case of cyberhardening evals, the scoring process involves the codebase under study itself. If you can swap out any implementation of a spec/contract to use the LLM redteam to inspect and compare security properties, much like you would use a CPU to inspect and compare performance characteristics.

The SPS team [1] at Apart’s AI Control hackathon

This is cool because tokens are cheap, and soon new OCI runtimes will be cheap.

The takeaway from our (super preliminary, low lift, not peer reviewed) data is to use gVisor (for now). But soon, when Alok prompts “write me an OCI runtime in Lean and prove it correct”, it’d be great to stress test it (instead of assuming “i have only proven it correct” [2] actually counts in real life).

The Lean FRO shipped zip. They FFI to the reference implementation and prove their new implementation agrees up to the FFI.

We don’t have an arena, and language model capabilities that could find vulns in zip are not yet widely available. But overtime, you want to be able to check a dashboard detailing the relative propensity toward 0-days and vulns in the reference implementation vs the Lean FRO’s implementation.

Soon [3], people building software will be able to choose between different OpenSSLs. Some of these will be forks of legacy with different governance policies about synthetic patches. Some of these will have a proof stack like

How do we know which of these are best? Besides the traditional bench/eval for performance characteristics like time and space usage, we can use language models as blackhat oracles, where their redteam capabilities are held constant and the things they’re trying to break are what varies. Ideally you’d have a leaderboard, detailing the empirical attack surface of each of these contestants.

Of course, openssl is just an example. You could say the same thing about curl, docker (as above), postgres, linux, whatever etc.

Evals, especially those in cybersecurity and program synthesis, typically hold reference implementations of primitives constant and vary the provider API key. In other words, the artifact under evaluation is the language model. Please fund these arenas, in which we turn that on its head and evaluate the primitives that the language models are interacting with. If you have 17 different versions of OpenSSL to put in your build or distro, some of them forks and some of them rerolled, identifying the one that has the least attack surface is an empirical question (not a question of QA toolstack ideology).

The well-known Knuth quote: Beware of bugs in the above code; I have only proved it correct, not tried it.

80% forecast conditional on tokens being cheap

── more in #ai-safety 4 stories · sorted by recency
── more on @glasswing 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/flipping-the-eval-on…] indexed:0 read:3min 2026-06-27 ·