Flipping the eval on its head

wpnews.pro

cd /news/ai-safety/flipping-the-eval-on-its-head · home › topics › ai-safety › article

[ARTICLE · art-41855] src=lesswrong.com ↗ pub=2026-06-27T13:34Z topic=ai-safety verified=true sentiment=↑ positive

Flipping the eval on its head

A new approach to cybersecurity evaluations proposes using language models as constant red-team oracles to empirically compare the attack surfaces of different software implementations, such as OpenSSL forks or formally verified runtimes. The method flips traditional evals by varying the codebase under test rather than the AI model, enabling leaderboards for empirical security properties. The author argues this is measurable and viable as token costs decrease.

read3 min views1 publishedJun 27, 2026

An eval is a product. Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we’d like to see k x n x m evals, or however many dimensions.

This post is pitching an ambitious way to spend tokens on cyberhardening. If its not viable at current capabilities/costs, it may be viable next year or in six months.

There are broadly three approaches to cyberhardening with secure program synthesis or uplifted formal methods.

It is very good and very exciting that Glasswing (as well as companies like AISLE, I think) are doing the first thing. Are the other two things desirable? The point of the current post is to say that this is measurable.

Typically, we vary the model to test different models on all the samples, where the samples might involve some scoring process. In the case of cyberhardening evals, the scoring process involves the codebase under study itself. If you can swap out any implementation of a spec/contract to use the LLM redteam to inspect and compare security properties, much like you would use a CPU to inspect and compare performance characteristics.

The SPS team [1] at Apart’s AI Control hackathon

This is cool because tokens are cheap, and soon new OCI runtimes will be cheap.

The takeaway from our (super preliminary, low lift, not peer reviewed) data is to use gVisor (for now). But soon, when Alok prompts “write me an OCI runtime in Lean and prove it correct”, it’d be great to stress test it (instead of assuming “i have only proven it correct” [2] actually counts in real life).

The Lean FRO shipped zip. They FFI to the reference implementation and prove their new implementation agrees up to the FFI.

We don’t have an arena, and language model capabilities that could find vulns in zip are not yet widely available. But overtime, you want to be able to check a dashboard detailing the relative propensity toward 0-days and vulns in the reference implementation vs the Lean FRO’s implementation.

Soon [3], people building software will be able to choose between different OpenSSLs. Some of these will be forks of legacy with different governance policies about synthetic patches. Some of these will have a proof stack like

How do we know which of these are best? Besides the traditional bench/eval for performance characteristics like time and space usage, we can use language models as blackhat oracles, where their redteam capabilities are held constant and the things they’re trying to break are what varies. Ideally you’d have a leaderboard, detailing the empirical attack surface of each of these contestants.

Of course, openssl is just an example. You could say the same thing about curl, docker (as above), postgres, linux, whatever etc.

Evals, especially those in cybersecurity and program synthesis, typically hold reference implementations of primitives constant and vary the provider API key. In other words, the artifact under evaluation is the language model. Please fund these arenas, in which we turn that on its head and evaluate the primitives that the language models are interacting with. If you have 17 different versions of OpenSSL to put in your build or distro, some of them forks and some of them rerolled, identifying the one that has the least attack surface is an empirical question (not a question of QA toolstack ideology).

The well-known Knuth quote: Beware of bugs in the above code; I have only proved it correct, not tried it.

80% forecast conditional on tokens being cheap

source & further reading

lesswrong.com — original article Deployment Awareness Matters More Than Evaluation Awareness Just a Wrapper? How Much Do Scaffolds Matter? What did "scheming" and "mech interp" mean pre-2023?

~/api · this article 200

$curl api.wpnews.pro/v1/news/flipping-the-eval-on-its…

Read original on lesswrong.com → www.lesswrong.com/posts/RK2wJFhmZHXvmzjBE/flippi…

mentioned entities

Glasswing

AISLE

Apart

gVisor

Lean FRO

OpenSSL

curl

Docker

metadata

slugflipping-the-eval-on-its-head

topic#ai-safety

secondary4 topics

sentimentpositive

canonicallesswrong.com

navigation

← prevWhy is Apple asking me to pay mo…

── more in #ai-safety 4 stories · sorted by recency

voodootikigod.com · 27 Jun · #ai-safety

Meet the ADLC

flyingpenguin.com · 25 Jun · #ai-safety

Still No Evidence Mythos Better at Security Than Self-hosted LLMs

devclubhouse.com · 27 Jun · #ai-safety

How to Stop AI Agents From Committing Your Secrets

searchenginejournal.com · 27 Jun · #ai-safety

Google’s Spam Update Now Reaches AI Answers. Enforcement Is Hard

── more on @glasswing 3 stories trending now

wpnews · 25 May · #artificial-intelligence

Maia-3: free and open source

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required