Flipping the eval on its head A new approach to cybersecurity evaluations proposes using language models as constant red-team oracles to empirically compare the attack surfaces of different software implementations, such as OpenSSL forks or formally verified runtimes. The method flips traditional evals by varying the codebase under test rather than the AI model, enabling leaderboards for empirical security properties. The author argues this is measurable and viable as token costs decrease. An eval is a product . Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we’d like to see k x n x m evals, or however many dimensions. This post is pitching an ambitious way to spend tokens on cyberhardening. If its not viable at current capabilities/costs, it may be viable next year or in six months. There are broadly three approaches to cyberhardening with secure program synthesis https://www.lesswrong.com/posts/8wtrLoDPyCfMLuHkt/how-to-solve-secure-program-synthesis or uplifted formal methods. It is very good and very exciting that Glasswing as well as companies like AISLE https://www.lesswrong.com/posts/7aJwgbMEiKq5egQbd/ai-found-12-of-12-openssl-zero-days-while-curl-cancelled-its , I think are doing the first thing. Are the other two things desirable? The point of the current post is to say that this is measurable . Typically, we vary the model to test different models on all the samples, where the samples might involve some scoring process. In the case of cyberhardening evals, the scoring process involves the codebase under study itself. If you can swap out any implementation of a spec/contract to use the LLM redteam to inspect and compare security properties, much like you would use a CPU to inspect and compare performance characteristics. The SPS team 1 at Apart’s AI Control hackathon This is cool because tokens are cheap, and soon new OCI runtimes will be cheap. The takeaway from our super preliminary, low lift, not peer reviewed data is to use gVisor for now . But soon, when Alok prompts “write me an OCI runtime in Lean and prove it correct”, it’d be great to stress test it instead of assuming “i have only proven it correct” 2 actually counts in real life . The Lean FRO shipped zip. They FFI to the reference implementation and prove their new implementation agrees up to the FFI. We don’t have an arena, and language model capabilities that could find vulns in zip are not yet widely available. But overtime, you want to be able to check a dashboard detailing the relative propensity toward 0-days and vulns in the reference implementation vs the Lean FRO’s implementation. Soon 3 , people building software will be able to choose between different OpenSSLs. Some of these will be forks of legacy with different governance policies about synthetic patches. Some of these will have a proof stack like How do we know which of these are best? Besides the traditional bench/eval for performance characteristics like time and space usage, we can use language models as blackhat oracles, where their redteam capabilities are held constant and the things they’re trying to break are what varies . Ideally you’d have a leaderboard, detailing the empirical attack surface of each of these contestants. Of course, openssl is just an example. You could say the same thing about curl, docker as above , postgres, linux, whatever etc. Evals, especially those in cybersecurity and program synthesis, typically hold reference implementations of primitives constant and vary the provider API key. In other words, the artifact under evaluation is the language model. Please fund these arenas, in which we turn that on its head and evaluate the primitives that the language models are interacting with. If you have 17 different versions of OpenSSL to put in your build or distro, some of them forks and some of them rerolled, identifying the one that has the least attack surface is an empirical question not a question of QA toolstack ideology . The well-known Knuth quote https://www.brainyquote.com/quotes/donald knuth 181622 : Beware of bugs in the above code; I have only proved it correct, not tried it . 80% forecast conditional on tokens being cheap