{"slug": "structural-proxies", "title": "Structural Proxies", "summary": "A researcher proposes 'structural proxies' as a method for AGI safety research, using current AI problems that share structural dynamics with future superhuman AI issues, such as adversarial attacks as a proxy for value generalization and faithfulness as a proxy for ELK. The approach aims to study naturally occurring AI failures to better understand and mitigate risks from advanced systems.", "body_md": "Lately I've been thinking a lot about what work would help with actually winning and getting to good worlds. In the spirit of that I decided to venture outside my normal wheelhouse and spend some time reflecting on what technical research could make me more confident about powerful AIs being safe.\n\nAGI safety research is tricky partly because we don’t actually have access to the thing we want to study, i.e. superhuman AI. Much of the work we do now is basically trying to lay (potentially irrelevant) foundations for the period when we actually know what we’re up against, and at that point, a lot of the work might be done by AIs.\n\nYou can group current approaches by how they try to sidestep this access problem:[[1]](https://www.lesswrong.com/feed.xml#fnhuyhgdgc85)\n\nHere I want to gesture at a different angle of attack, which I’m going to call **structural proxies**. The basic idea is that you look for current naturally-appearing problems with AI that share a structure with future problems — in other words, structural proxies. In particular, you want to have some reason to believe that the current problem is being generated by the same types of dynamics that will produce the future problem, even if the manifestations look quite different. You then use the proxies to try to understand the future problems.\n\nThe closest parallel is model organisms — it's plausible to me that this is just a special subset. But a lot of model organism work appears to start from the specific phenomenon (reward hacking, alignment faking, collusion...) and then deliberately construct an environment that demonstrates it. With structural proxies, you start from the process that produces the phenomenon (misgeneralisation, latent knowledge, optimisation for proxies...) and then look for other examples of that process in the wild.\n\nBriefly, I think this is a useful angle because:\n\nI’ll run through two examples now — adversarial attacks as a proxy for value generalisation, and faithfulness as a proxy for ELK — then give some broader thoughts.\n\nConceptually, what an adversarial attack does is exploit the gap between what we want the model to learn and what it’s actually learned — in other words, exploit its misgeneralisation. In the case of image models, producing adversarial inputs that work across a variety of models requires you to exploit the general ways that image models learn differently to humans.\n\nInterestingly, these differences appear to be somewhat natural and intuitive byproducts of the fact that image models train on static images whereas humans see things continually — to the point that some adversarial images [also “fool” humans who only look at them for a fraction of a second](https://arxiv.org/abs/1802.08195). I think this is spiritually similar to the way that feature visualisation on image models by default gives you a horrid static mess, but starts to look recognisable once you regularise for an input that’s [robust to small translations and other perturbations](https://distill.pub/2017/feature-visualization/).\n\nLanguage models are a bit of a different ballgame, because the input space isn’t quite so densely high-dimensional — for an image model, a slight perturbation means an almost imperceptible pixel shift, whereas for a language model it probably means an embedding that doesn’t correspond to any tokens. Nonetheless, I think the shape of the argument is similar: Jailbreaking shouldn’t be viewed as a strange epicycle on otherwise-sensible models, but rather as a sign that very fundamentally language models are generalising differently to the way we’d want them to — in other words, [they are misaligned](https://www.gleech.org/jailbreak).\n\nAnd indeed, it seems like by [greedily searching for token substitutions](https://arxiv.org/abs/2307.15043) one can do something similar to tweaking pixels on images — producing model-agnostic suffixes for text prompts that force certain behaviours. The good news is that, much like the image model case, this is much harder if you [regularise the inputs into being somewhat comprehensible](https://arxiv.org/abs/2506.15735), to the point that sometimes the shape of the attack becomes human-understandable.\n\nTaking a step back: it seems like the standard orientation to adversarial attacks is that they’re mostly a misuse problem, where people are using incomprehensible blobs of noise to hijack otherwise benign models — models that have been aligned against a really diverse range of examples and validated on a really diverse range of examples. In other words, one might think that adversarial attacks are shallow, and the actual deep thing is constitutional learning and eval performance.\n\nBut I think there’s a reasonable case to be made that it’s the other way around — that adversarial fragility is the deep signal, and eval performance is a shallow layer on top — and that blocking jailbreaks on language models and exploits on image models is hard for the same structural reason that alignment is hard.\n\nThe million dollar question here [3] is what the limit of this process looks like. For example, does there exist a training process which gives you image models that classify things exactly the way a human would, including in worst-case adversarial inputs? If so, I think that would make me a fair bit more optimistic about alignment. I expect that much of what you learned in building such a system would be pretty generalisable.\n\nThat’s the structural proxy angle: taking a known, natural issue (adversarial robustness) and noticing the ways it’s structurally similar to a big future problem (value generalisation), and using the proxy to get purchase on the future problem. Notably, lots of traditional adversarial robustness work like layering smaller classifiers on top is not at all useful for pursuing the proxy angle; on the other hand, plenty of adversarial robustness work that’s not super inefficient for actually preventing misuse suddenly looks really appealing from this angle, like trying to find conditions that make optimised jailbreaks more naturally interpretable, or trying to train models in a way that makes optimised inputs less like static.\n\nThe problem of [Eliciting Latent Knowledge](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0) (ELK) was ARC’s big attempt to give a formal, tractable variant of the alignment problem: if you have an AI that knows a bunch of stuff, can you train a system that will report that knowledge? Solving this gets you most of the way to solving alignment because you can just ask your misaligned system questions like “are you misaligned” and “is this going to go wrong” and actually elicit all the latent important details.\n\nAnd the basic challenge is that it’s really hard to be confident you’ll actually get an “direct translator” that expresses the latent knowledge in the AI, instead of a “human simulator” that figures out what you’d expect to hear and just tells you that. For example, no question/answer pairs from you can privilege “tell us the truth” as a strategy over “tell us what we believe the truth is”, especially once the model is superhuman. Worse, it’s possible that figuring out and extrapolating the most sophisticated world model humans might develop could be much easier than extrapolating over the one true model that ASI will learn.[[4]](https://www.lesswrong.com/feed.xml#fnrrevvnk99t9)\n\nThis was originally posed as a hypothetical problem, and most of the work on it has been fairly abstract — ARC’s current approach is mostly about finding algorithms for predicting the behaviours of different networks in a principled way that’s more efficient than just sampling them. [5] But I think one can learn quite a lot about it from looking at current systems — indeed, I think we’ve already learned a fair bit.\n\nFor example, we know:\n\nPeople already care a lot about whether AIs are honest. This has the advantage of being easily testable whenever you have some ground truth to hand. But for ELK I think the better proxy is faithfulness — not whether the output is correct, but whether the process generating it was coupled to the right internal structure.\n\nHere it’s useful to consider what work would be considered out of scope. One good example is consistency. Lots of people are into making models consistent across conversations — to reduce [sycophancy and jailbreaks](https://arxiv.org/abs/2510.27062), make them better at [conceptual reasoning](https://www.lesswrong.com/posts/fi3dtMrXJiWEtmJNa/a-preliminary-experiment-regarding-consistency-as-a-measure), or even [find out what they really think](https://x.com/jiaxinwen22/status/2062773614920040818). If you put on your ELK hat, though, it’s clear that this is only half the battle. Consistency makes AIs pick *one* world model, but does it make them pick the *right* one? (Spoiler alert: [maybe not](https://x.com/DavidDAfrica/status/2062205694288540012).)\n\nNow, solving ELK for GPT3 is pretty different to solving it for arbitrary superintelligences. My point is that you could still *try* to solve it, or to figure out what bits are hard. Do AIs even *have* a true model of the world, from which they model what humans expect to hear? How does this connect to the fact that current AIs are a bit more like [layers of prediction](https://www.lesswrong.com/posts/zuXo9imNKYspu9HGv/a-three-layer-model-of-llm-psychology) with some ability for the character to tap the raw capacities? [6] Can we find the causal channels when AIs\n\nSo, recapping the angle: take a known, natural problem (unfaithful reporting), noticing the ways it’s structurally similar to a future problem (ELK), and use the proxy to get purchase.\n\nMy big ambitious hope is that this kind of work could actually make us more or less confident about the difficulty of aligning very powerful AIs. I worry that a lot of safety work bounces off the hard parts of the problem, and I like the idea of actually grabbing at whatever hard parts you can find. It would be very useful to have compelling evidence that we’re not yet equipped for the hard challenges, by providing smaller but not much easier instantiations. But there's also a lot of upside — if we *could* solve jailbreaking in full generality, or we *could* figure out exactly how faithfulness works in current models, that would be big. I have a hunch that when you work on simplified versions of the problem, sometimes you risk simplifying away not just the things that make them difficult but also the things that make them solvable.\n\nMore mundanely, I just think it's good to have as many angles of attack as possible. Especially as we enter the era of automated research, it's good to have a plan for what we point the auto-engineers at. It's also good to have approaches with different strengths and weaknesses so that the overall portfolio is a bit more robust.\n\nThis work is notably predicated on the assumption that the mechanisms producing today's problems are somewhat continuous with the mechanisms that will produce tomorrow's. It is possible that actually things just won’t scale, or that crunch time will involve very different paradigms. It is also possible (likely, even!) that more powerful AIs will exhibit qualitatively new challenges that don’t have easy analogues, like advanced forms of situational awareness. This cluster of work, by construction, can only attack problems with some degree of continuity, so it’s not as much of a full package as something like AI control.\n\nStill, that's only a limit on how much optimism it can give us: there's still plenty of pessimism to try to squeeze out. Put another way, I think we should plan for problems scaling better than solutions. In particular, if we are headed to a world of abundant automated engineering, it could be useful to find more conceptually ways of red teaming how our existing approaches might break at a slightly higher level of abstraction. And if our plans can survive all that, I'll feel pretty good.\n\n*Thanks to JK, CG, NN, VK, JB, and DA for many comments and much discussion.*\n\nIn practice these aren’t perfectly separable — lots of current control work is based on making model organisms, model organism work relies on prosaic techniques, etc\n\nThis point heavily cribbed from Stanislav Fort’s excellent essay on [adversarial attacks as a baby version of alignment](https://www.lesswrong.com/posts/oPnFzfZtaoWrqTP4H/solving-adversarial-attacks-in-computer-vision-as-a-baby)\n\nNot adjusted for funding inflation\n\nIf you think ELK sounds solvable, I encourage you to take a crack at it — I think it’s good practice to ponder, and also if you do end up solving it we can all finally go take a holiday\n\nThere has also been some work explicitly on ELK in current models — see e.g. [this attempt to make model organisms](https://arxiv.org/abs/2312.01037)", "url": "https://wpnews.pro/news/structural-proxies", "canonical_source": "https://www.lesswrong.com/posts/mKiBhFJs3MoksaBMs/structural-proxies", "published_at": "2026-06-30 12:38:45+00:00", "updated_at": "2026-06-30 13:01:10.634968+00:00", "lang": "en", "topics": ["ai-safety", "artificial-intelligence", "machine-learning", "large-language-models", "computer-vision"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/structural-proxies", "markdown": "https://wpnews.pro/news/structural-proxies.md", "text": "https://wpnews.pro/news/structural-proxies.txt", "jsonld": "https://wpnews.pro/news/structural-proxies.jsonld"}}