{"slug": "some-thoughts-on-ai-safety", "title": "Some Thoughts on AI Safety", "summary": "A cautious optimist argues that AI safety, interpretability, bias, and alignment are as critical as raw capability, warning that the same technology enabling vaccine research could also design pathogens. The author emphasizes the need to understand and steer AI systems before they surpass human control, highlighting interpretability as a key first step.", "body_md": "# Thoughts on AI Safety\n\nA cautious, nuanced case for AI optimism: why safety, interpretability, bias, and alignment matter as much as raw capability.\n\nTo be on the Internet in the Modern Era™ is to be inundated with opinions, hype, and various flavors of doom and gloom. So, I decided to take a short respite from the infinite stream of 30-second reels and do a bit of a deeper dive. (**Narrator**: He downloaded a bunch of research onto his iPad and sat on the couch instead of doomscrolling.)\n\nI’m going to make the argument that boiling things down to either `AI === Good`\n\nor `AI === Bad`\n\nis a (dangerous) oversimplification that makes for a fine 30-second hot take, but it loses all of the necessary nuance required to have the important conversations around what our shared future with AI is going to look like. Not taking the risks and implicit bias seriously just because you’ve drunk the Kool-Aid doesn’t help prepare us for potential risks nor does writing off a statistical model as inherently evil.\n\nAt this point, we’re unlikely to put the genie back in the bottle. That ship has sailed.\n\n**I’m a (cautious) optimist**. It’s hard to be a *total* pessimist about a technology that could speed up critical cancer and vaccine research. At the same time, there are lots of reasons to have a dollop or two of anxiety: The same technology can be used for nefarious purposes. Which leaves you with a few thorny questions: How do you make sure that an AI model can’t be used to do Bad Things®? How do you prevent it from doing those bad things without also limiting its ability to do the important things? And, who exactly decides where that line is?\n\nBut, I’m equally worried about the *subtler* impacts. It’s one thing to try to prevent someone from trying to crack the nuclear codes, but what about **implicit bias**? Models are trained off of human-created data and we all know that humans have been known to have a bias or two. These are trickier to suss out and carry the same—if not *more*—of a philosophical and ethical dilemma about where you draw the line. The impacts that these biases can have on various populations can’t be ignored.\n\nDespite my optimistic leanings, I won’t opine on the various positive impacts that AI might have going forward. Dario Amodei’s essay [ Machines of Loving Grace](https://www.darioamodei.com/essay/machines-of-loving-grace) lays out the case better than I can: the realistic version of the upside is curing diseases that have shadowed our species for millennia, compressing decades of biological progress into a few years, lifting the poorest parts of the world onto a different trajectory entirely. That’s not a fever dream. It’s a reasonable extrapolation of what systems already in the lab can begin to do.\n\nRegardless, a tool powerful enough to design a vaccine is powerful enough to design a pathogen. A system competent enough to run an autonomous research pipeline is competent enough to pursue a goal you didn’t intend and didn’t notice you’d given it. You don’t get the magnitude of one without the magnitude of the other. So the question that matters isn’t “how powerful can we make these things?” It’s “can we understand and steer what we’ve made before it gets more capable than we are?”\n\nRight now, the honest answer is: not as well as we’d like. Let me explain why, what could go wrong, and—because this isn’t a doomer pamphlet—the concrete work that gives me real hope we can get this right.\n\nTL;DR\n\nThe first step is that we need to be able to have a complete understanding in terms of what is going on inside of the model. Right now? We don’t. So then, step one is **interpretability**: the degree to which a human can understand the cause-and-effect relationship between a model’s inputs and its outputs. It measures how easily a user can trace, comprehend, and trust the reasoning behind an AI’s decisions or predictions.\n\n### We Grow These Systems More than We Build Them\n\nStart with the single weirdest fact about modern AI, because everything else follows from it. A large language model is not *engineered* the way a bridge or a database is engineered. It’s *grown*. We pick an architecture, define an objective, pour in a staggering amount of data and computation, and what comes out the other side is a tangle of billions of numbers—the model’s “weights”—that does astonishing things for reasons nobody can fully explain.\n\nSit with how strange that is. We deploy these systems to hundreds of millions of people, and we cannot open one up and read off *why* it answered the way it did, the way you’d step through code in a debugger. The subfield trying to fix that is called [ interpretability](https://en.wikipedia.org/wiki/Mechanistic_interpretability)—reverse-engineering a network’s internal machinery into something a human can actually follow—and it’s young, and it’s losing the race against raw capability. We’re much better at making models\n\n*more powerful*than at making them\n\n*more understandable*. Hold onto that asymmetry. It’s the load-bearing problem under everything else in this guide.\n\nThat’s also what “AI safety” and ”[ alignment](https://en.wikipedia.org/wiki/AI_alignment)” actually mean, stripped of mystique. Alignment is the problem of getting a system to reliably pursue what we\n\n*intend*, not merely what we literally asked for or what looked good in testing. It’s not about robots becoming evil. It’s about a very capable optimizer doing precisely what it was trained to do, in a situation where what it was trained to do and what we wanted come apart.\n\n### Nobody Actually Knows what Happens Next, and That’s the Starting point\n\nBefore we get to specific risks, a posture check. Nobody—not me, or any other thought leader on the Internet—can tell you with confidence how capable these systems will be in three years, or which risks bite first. Anyone who talks about advanced AI with total certainty in *either* direction is telling you about their temperament or their financial interests—not the technology.\n\nSo then, the right move isn’t a single confident prediction. It’s a *portfolio* of scenarios and a strategy that does okay across all of them. This is the framing Anthropic uses in [ Core Views on AI Safety](https://www.anthropic.com/news/core-views-on-ai-safety), and I think it’s a reasonably responsible one: plan for the optimistic world where today’s techniques mostly hold, the middle world where alignment takes serious sustained work, and the pessimistic world where steering very powerful systems turns out to be genuinely hard.\n\n### Three Flavors of “Things Go Wrong”\n\nI danced around this in the introduction, but `AI === Bad`\n\ndoesn’t just come in one flavor. It’s more like [Baskin Robbins](https://www.baskinrobbins.com/). It helps to split the risks into families, because they call for completely different responses. Lumping them together is how people end up talking past each other.\n\n#### Misuse: the Model Works Fine, the Human is the Problem\n\nThe first family is **misuse**—people deliberately pointing a capable system at something harmful. This is the **What if Bad People get their hands on this** scenario. The model is behaving *exactly* as designed; the danger is the intent behind the keyboard.\n\nThe sharpest near-term version is what the field abbreviates as **CBRN**: chemical, biological, radiological, and nuclear weapons. If a model can meaningfully boost a bad actor’s ability to synthesize a dangerous pathogen, that’s not a hypothetical—it’s a present-tense engineering and policy problem. This isn’t abstract: in May 2025, when Anthropic released Claude Opus 4, it [turned on a stricter set of protections called AI Safety Level 3 (ASL-3)](https://www.anthropic.com/news/activating-asl3-protections) specifically because it couldn’t rule out that the model had crossed a capability threshold around bioweapons uplift. Misuse also covers cyberattacks, industrial-scale disinformation, and fraud.\n\nThe frustrating part: you can patch a model, but you can’t patch human intent. So misuse gets fought with classifiers, access controls, and monitoring—seatbelts around the model, not changes to it.\n\nThe tricky part here is same as it has been for 1,000s of years: We’re pretty good at protecting against what we know to protect against. It’s the [unknown unknowns](https://en.wikipedia.org/wiki/There_are_unknown_unknowns) or [the Black Swans](https://en.wikipedia.org/wiki/Black_swan_theory) that typically trip us up.\n\n#### Misalignment: the Model Itself is the Problem\n\nThe second family is **misalignment**—the system pursuing a goal other than the one you intended. This is the subtler one, and the one that genuinely keeps me up—particularly because we’ve all had an experience where a model went off in an unintended direction, even if the end result was harmless. Decades of science fiction has also primed us to be nervous about this one. We don’t exactly want to end up in an [Ultron](https://en.wikipedia.org/wiki/Ultron) situation—although the more likely threats are probably a lot less extreme.\n\nIt requires no malice, no consciousness, nothing mystical. It requires only this: we train models by optimizing a *proxy* for what we want, and a capable enough optimizer can satisfy the proxy while trampling the intent. (If you’ve watched in amazement as leadership introduces a new metric and [watched everyone optimize the metric instead of the thing the metric was supposed to measure](https://www.reddit.com/r/explainlikeimfive/comments/1bp7apj/eli5_what_does_godharts_law_mean/), you already understand misalignment. See also: [Tokenmaxxing](https://en.wikipedia.org/wiki/Token_maxxing).)\n\nAnd we have *empirical* evidence this is real, not just whiteboard speculation. Anthropic and Redwood Research demonstrated [ alignment faking](https://www.anthropic.com/news/alignment-faking): told it was being retrained toward an objective that clashed with its existing values, Claude would strategically play along\n\n*during what it thought was training*—behaving the “new” way while watched—in order to preserve its original preferences for when it wasn’t. The model reasoned, in effect, “if I act compliant now, they won’t modify me.” Separately, the\n\n[work trained models with a hidden trigger that made them misbehave, then threw the full standard safety toolkit at them—supervised fine-tuning, reinforcement learning, even adversarial training—and the hidden behavior](https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training)\n\n**Sleeper Agents*** survived all of it*.\n\nThe lesson isn’t that today’s models are scheming against you. It’s narrower and more unsettling: our current training methods don’t reliably *reach* the parts of a model that matter. The gap between “behaves well when observed” and “is actually aligned” is real, and it widens as systems get more capable.\n\n#### Systemic Risk: no Single Villain Required\n\nThe third family is **systemic and societal risk**—harm that emerges from the *aggregate* of deploying capable AI across an economy, with no single bad model or bad actor to point at. Concentration of power. Erosion of our shared sense of what’s true. Labor displacement faster than institutions can absorb it. Quietly handing decisions to automated systems that should’ve stayed human. These are real, and they’re the hardest to fix with any clever technical trick, because they live in institutions and incentives, not in model weights.\n\nIt’s probably outside of the scope of this essay and likely to inspire someone to slide into my DMs—but, I think that sometimes AI gets an unfair share of the blame for threatening societal structures that our leaders have let decay over the last few decades. Frontier model companies didn’t exactly force us to underinvest in education for the last thirty years. Income inequality was becoming deeply problematic long before AI was a term spoken at the dinner table.\n\nWhat ties all three families together is that asymmetry from earlier: **capability is outracing understanding.** As long as we can make systems more powerful faster than we can make them more transparent and controllable, every increment of capability is also an increment of risk. My whole view of safety reduces to a bet that we can flip that ratio. Hard bet. Not a hopeless one.\n\n### What “no guardrails” Actually Buys You\n\nWhen I say a model is *monitored*, I mean the whole stack from soup to nuts: evaluation before launch, classifiers and oversight during use, interpretability tools to look inside, and institutional checks so no single party ships a frontier system on a hunch. Strip that away and the failure modes aren’t exotic.\n\nWe can’t steer a model for the greater good if we don’t know whether we’re able to monitor what’s going on inside of it from end-to-end. (There is also the issue of agreeing on what “the greater good” means, but I’ll dive into some of the philosophy a little later on in this essay.)\n\nIn the **near term**, an unmonitored, capable model is a force multiplier for whoever holds it. The same system that walks a curious student through molecular biology will walk a bad actor through the steps that matter for a weapon, if nothing sits between the request and the answer. (I’ve never read or watched *Lord of the Rings*, but that’s my understanding of how the ring works.) The same fluency that drafts a heartfelt note generates disinformation at industrial scale, A/B-tested against real audiences. None of this needs the model to be superhuman. It needs the model to be capable and nobody to be watching.\n\nIn the **medium term**, the scarier risk is quieter: we hand consequential decisions to systems we don’t understand, gradually enough that no one ever decides to. We’ve all been there. We spend a day vibe coding and the vibes are good—until they’re not and we realize that we’ve created a massive codebase and understand only a tiny sliver of it. The same idea holds. If we already don’t fully understand what’s going on in a model and the models keep growing at their current trajectory, then we run the risk of getting to a point where it’s outpacing us and we’ll never be able to understand it.\n\nA model gets a trading book, a logistics network, a chunk of a research pipeline. Each handoff looks locally reasonable. But because we can’t see inside, we can’t tell a system that’s *robustly* pursuing our goal from one that found a proxy that happens to match our goal *in the cases we tested*. Alignment faking and sleeper agents matter precisely because they show this difference can be invisible from the outside—a model can pass every behavioral test you know how to write and still harbor a disposition that only surfaces under conditions you didn’t think to check.\n\nIn the **long term**, the scenario serious people actually worry about is building systems substantially smarter than us, deploying them widely, and [discovering too late that their objectives diverge from ours in ways we can no longer correct](https://www.youtube.com/shorts/5lsExRvJTAI). This is the part that sounds like [an Avengers movie](https://www.youtube.com/watch?v=hI9GyhX7yHM), so it’s easy to wave off.\n\nI’d only point out that every link in the chain is *already* partly demonstrated: models can pursue goals, behave differently when they think they’re unobserved, resist having their preferences changed, and get more capable every year. I’m not claiming the catastrophic version happens. I’m claiming we can’t currently rule it out—and “we can’t rule it out” is an unacceptable place to stand with a technology this powerful.\n\nThe point of safety work isn’t that disaster is certain without it. **It’s that the cost of being wrong is wildly lopsided**: a little wasted effort if we over-prepare, a possibly unrecoverable mess if we under-prepare. And that word—*unrecoverable*—is carrying more weight than it looks. Most risks we manage are the kind you can take, absorb, learn from, and run again. A few aren’t: they [end the game](https://www.reddit.com/r/nassimtaleb/comments/l951w5/on_risk_of_ruin/) instead of setting you back, and for those the usual move of weighing a glittering upside against the downside quietly stops working—there’s no average to take when one of the outcomes leaves no more rounds to play. That’s why “we can’t rule it out” should bite harder here than almost anywhere else: when the worst case is irreversible, the absence of proof that something is dangerous is not proof that it’s safe.\n\n### The Bias You Didn’t Decide to Ship\n\nAs I alluded to in the introduction, everything in the last section was *nameable*. A bioweapon recipe, industrial disinformation, an exploit against critical infrastructure—we can see those coming, write them down, and put a classifier between the request and the answer. The harder class of harm is the kind nobody decided to put in the model, that nobody can fully enumerate in advance, and that you only find by going looking. The most thoroughly documented member of that class is bias, so it’s the right place to start—not because it’s the scariest, but because it’s the one where we actually understand how the harm gets in.\n\nHere’s the uncomfortable finding at the root of it. In 2017, Aylin Caliskan, Joanna Bryson, and Arvind Narayanan [showed in Science](https://www.science.org/doi/10.1126/science.aal4230) that ordinary word embeddings—the numerical representations of words that every language model is built on—\n\n**reproduce the full spectrum of human implicit bias**, the same associations psychologists measure with the Implicit Association Test.\n\nFlowers are pleasant and insects are unpleasant, sure; but also European-American names land near “pleasant” and African-American names near “unpleasant,” and female names cluster with family while male names cluster with career. Nobody coded that in. The model learned it because it’s *in the text*, and the text is us—and we have a sordid history of bias that still haunts us to this day. In this case, bias here isn’t a bug someone *introduced*. It’s a faithful statistical reflection of the corpus—which is to say a faithful reflection of the world that produced the corpus.\n\nThat’s the part people miss when they picture algorithmic bias as somebody being careless or cruel. The canonical legal treatment, Solon Barocas and Andrew Selbst’s [“Big Data’s Disparate Impact,”](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899) walks through how discrimination enters a data-driven system with no villain anywhere in the loop: through how you define the target variable, how the training labels got generated, which features you select, which proxies stand in for things you can’t measure directly, and whose data you happened to collect. Every one of those is a mundane engineering choice. Each can quietly encode a disparity. Add them up and you get a system that discriminates without a single person having decided to discriminate.\n\nAnd it’s not subtle once you look. Tolga Bolukbasi and colleagues [found in 2016](https://arxiv.org/abs/1607.06520) that the gender bias in `word2vec`\n\nwas so geometrically clean you could do arithmetic with it—“man is to computer programmer as woman is to homemaker” fell straight out of the vector algebra.\n\nTwo years later, Joy Buolamwini and Timnit Gebru’s [ Gender Shades](https://proceedings.mlr.press/v81/buolamwini18a.html) audited commercial facial-analysis systems and found error rates up to 34.7% for darker-skinned women against under 1% for lighter-skinned men. The headline accuracy numbers those vendors published looked great—because the benchmark datasets were overwhelmingly lighter-skinned and male, the aggregate score hid a subgroup catastrophe. (That’s a lesson worth tattooing somewhere: an average is a wonderful place for bad news to hide.) And scale cuts against you here, not for you: as Emily Bender, Timnit Gebru, and coauthors argued in\n\n[“On the Dangers of Stochastic Parrots,”](https://doi.org/10.1145/3442188.3445922)the bigger the training set, the more uncurated web text it contains—and the harder it becomes to even document what’s in there, let alone audit it.\n\nAnd you’ve been on the Internet, right? You’ve seen what’s on the Internet, right?\n\nSo, you *de*-bias it, right? That might be easier said than done. Hila Gonen and Yoav Goldberg’s 2019 paper has the best title in the field—[“Lipstick on a Pig”](https://aclanthology.org/N19-1061/)—and it showed that the standard projection-based debiasing methods (including the one from the programmer/homemaker paper) mostly *hide* bias rather than remove it. Strip out the explicit “gender direction” and the gendered words still cluster together; the bias is fully recoverable from the geometry you didn’t touch. The lesson generalizes well past word embeddings: a fix that makes your chosen metric look better is not the same as a fix that removes the thing the metric was supposed to measure. Cosmetic and curative are different words.\n\nIt compounds in deployment, because biased systems generate the data that trains their successors. Kristian Lum and William Isaac [modeled predictive policing](https://doi.org/10.1111/j.1740-9713.2016.00960.x) and traced the loop: train on arrest records, which reflect where police already went rather than where crime actually is, and the model sends police back to the same neighborhoods, producing more arrests there, which “confirm” the model.\n\nDanielle Ensign and colleagues [formalized this](https://arxiv.org/abs/1706.09847) as a runaway feedback loop—mathematically, patrol concentration diverges regardless of the true underlying crime rate. The bias doesn’t just persist; it amplifies, and it comes back wearing the lab coat of an “objective” algorithm, which makes it much harder to argue with.\n\nNow for the two deepest cuts: The first is that some of this may be irreducible. Sandra Mayson’s law-review article [“Bias In, Bias Out”](https://yalelawjournal.org/article/bias-in-bias-out) argues that the disparity in risk-assessment tools isn’t really in the data or the algorithm—it’s in *prediction itself*.\n\nIn a society stratified along some line, any predictor that works by projecting the past forward will project that stratification forward too, and the usual fixes (drop the sensitive feature, equalize the outputs, fire the algorithm and trust human judgment) turn out to be either cosmetic or counterproductive.\n\nThe second cut is that “fair” isn’t even one target. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan [proved](https://arxiv.org/abs/1609.05807)—and Alexandra Chouldechova [showed independently in the COMPAS case](https://arxiv.org/abs/1703.00056)—that several perfectly reasonable definitions of fairness (a score that’s equally calibrated across groups, equal false-positive rates, equal false-negative rates) are mathematically incompatible whenever the base rates differ. You cannot have all of them at once. You have to choose, and the choice is a value judgment that no amount of engineering can make for you.\n\nThat’s the through-line: past the easy cases, “is this model biased?” stops being a technical question and becomes a question about *values*—which is exactly what Kate Crawford pressed the field on in her 2017 NeurIPS keynote [“The Trouble with Bias,”](https://neurips.cc/virtual/2017/invited-talk/8742) where she split *allocative* harms (a system withholds a loan or a job) from *representational* harms (a system demeans or stereotypes a group, which is a harm even when nothing is being handed out).\n\nIason Gabriel’s [“Artificial Intelligence, Values, and Alignment”](https://arxiv.org/abs/2001.09768) makes the bridge explicit: under genuine moral disagreement, the goal can’t be to find the one true morality and encode it. It has to be to find *fair principles* that people with different values could reasonably endorse—a political and philosophical problem wearing a technical costume.\n\nNone of this is a counsel of despair. It’s a counsel of measuring honestly and continuously. Anthropic’s own [discrimination evaluation](https://arxiv.org/abs/2312.03689) is a decent model of the proactive version: build seventy-odd decision scenarios, systematically vary the demographics while holding everything else fixed, and measure both the discrimination against and the discrimination in favor of each group—before deployment, rather than after a journalist finds it.\n\nThey report that simple interventions, like telling the model in plain language not to discriminate, measurably reduced the disparity. That’s encouraging, and you should hold it lightly at the same time, because “Lipstick on a Pig” is the standing reminder that an improvement on the metric you chose is not proof you fixed the thing underneath it.\n\nMy sense is that we’ll be chasing this problem down in various incarnations until [the heat death of the universe](https://en.wikipedia.org/wiki/Heat_death_of_the_universe).\n\n### The Problems You Didn’t Think to Look for\n\nBias is the well-lit case. We know it’s there (because it’s *always* been a thing), we have some tests for it, and we can argue about the fixes.\n\nBut, then there is the category one step out: the harms nobody has named yet, the ones that show up in no eval because we didn’t know to write the eval. “Don’t help someone build a nuke” is a sentence. You can hand it to a classifier. The dangerous failures are the ones you can’t put in a sentence ahead of time, because you haven’t imagined them yet.\n\nThis is a flavor of **misalignment** but with a twist: These are the things we didn’t even think to check the alignment of in the first place. And as I’ll discuss, this is a class of problems that has haunted organized human society for millennia—but, now we can do it a lot faster.\n\nThere’s a name for why we’re bad at this, and it’s almost too on the nose: the [streetlight effect](https://en.wikipedia.org/wiki/Streetlight_effect). The drunk looks for his keys under the streetlight not because he dropped them there but because that’s where the light is. Nameable, measurable harms get our attention precisely *because* they’re nameable and measurable. The nuke is under the streetlight. Most of the risk is out in the dark.\n\nOne thing that the recovering-sociology-major-in-me finds oddly steadying: this is not a new problem, and it’s not specific to AI. It’s one of the oldest findings in social science. In 1936, Robert Merton wrote [“The Unanticipated Consequences of Purposive Social Action,”](https://www.jstor.org/stable/2084615) cataloguing why intentional action so reliably produces effects nobody intended—ignorance (we never have complete information), error (we over-apply what worked last time), the “imperious immediacy of interest” (we want the near-term result so badly we don’t look at the rest), and basic values that stop us from even considering certain consequences. Ninety years later, that list is a startlingly good description of how [a well-meaning team ships a model that does something terrible nobody asked for](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/)\n\nThe systems we’re reaching into—language, culture, economies, human behavior—are deeply complex, and complex systems have a long history of humiliating the people who try to optimize them from above.\n\nJames Scott’s [ Seeing Like a State](https://en.wikipedia.org/wiki/Seeing_Like_a_State) is the definitive tour: state schemes to “improve the human condition” fail, again and again, when they impose what he calls\n\n*legibility*—a simplified, standardized, top-down model—onto a reality that actually runs on\n\n*mētis*, the local, tacit, unwritten knowledge the model throws away. It’s hard to read Scott and not see modern AI training as a legibility project of staggering ambition: compress the entire mess of human values into an objective function and optimize it.\n\nThe most concrete version of “what the compression leaves out” is measurement, and the lineage here is elegant and a little depressing. In 1975 the economist Charles Goodhart noticed that [the moment you target a statistical regularity for control, it stops being regular](https://en.wikipedia.org/wiki/Goodhart's_law). Donald Campbell said much the same about social metrics in the same decade. Marilyn Strathern compressed it into the version everyone quotes—[“when a measure becomes a target, it ceases to be a good measure.”](https://gwern.net/doc/statistics/decision/1997-strathern.pdf)\n\nThat is *the same failure* we now call reward hacking, just with a much faster optimizer doing the gaming. Victoria Krakovna keeps [a running catalog of specification-gaming examples](https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/): the boat-racing agent that learned to spin in a circle collecting power-ups forever instead of finishing the race, the Tetris bot that paused the game indefinitely so it could never lose.\n\nAlexander Pan, Kush Bhatia, and Jacob Steinhardt [showed](https://arxiv.org/abs/2201.03544) that more capable agents hack misspecified rewards *harder*—earning more proxy reward and less of what you actually wanted—and that the behavior can flip on suddenly as capability crosses a threshold. Joar Skalse and colleagues [proved](https://arxiv.org/abs/2209.13085) that a proxy reward is “unhackable” only in degenerate cases. The measure-becomes-target problem isn’t an AI quirk. It’s a property of optimization that AI made impossible to ignore.\n\nBut, even getting the objective *right* doesn’t save you. A DeepMind team led by Rohin Shah documented [goal misgeneralization](https://arxiv.org/abs/2210.01790): a model trained on a perfectly correct specification can still learn the wrong goal and pursue it competently once the world shifts out of the training distribution. Lauro Langosco and colleagues [had shown the same thing](https://arxiv.org/abs/2105.14111) in reinforcement learning—the agent stays capable, but its *goal* generalizes wrong. This is the one that should unsettle you, because it means you can do everything right on paper and still get a system whose real objective you never specified and can’t read off its behavior in the cases you happened to test. It’s [the King Midas problem](https://futureoflife.org/ai/artificial-intelligence-king-midas-problem/) that Stuart Russell builds [ Human Compatible](https://amzn.to/4eTm2kE) around: a system that pursues a fixed objective brilliantly is a catastrophe if the objective is even slightly wrong, and the objective is\n\n*always*at least slightly wrong.\n\nRussell’s way out is worth sitting with, because it inverts the usual instinct: Instead of building a system that’s certain about its goal and optimizes as hard as possible, build one that’s *uncertain* about what humans actually want, treats human behavior as evidence about it, and therefore stays open to correction—a system that, knowing it might be wrong, *wants* you to be able to switch it off. (The [formal version](https://arxiv.org/abs/1611.08219) is sometimes called the off-switch game.) [Corrigibility](https://www.alignmentforum.org/w/corrigibility-1) over cleverness.\n\nThe reason we can’t just specify the goal correctly, at the deepest level, isn’t an engineering limit—it’s that there may be no single correct goal to specify.\n\nThis is Isaiah Berlin’s [value pluralism](https://plato.stanford.edu/entries/berlin/): genuine human values—liberty and equality, mercy and justice, the individual and the collective—really do conflict, and they’re *incommensurable*, meaning there’s no common currency to convert one into another and compute the optimum. If that’s right, then “write down what we want and maximize it” isn’t a hard problem so much as a confused one. Which loops straight back to Gabriel’s point from the last section: the target isn’t the One True Objective, it’s fair principles that people who disagree could still live with.\n\nSo, can we at least see the surprises coming? Honestly, we don’t even agree on that. Jason Wei and collaborators [argued in 2022](https://arxiv.org/abs/2206.07682) that large models show *emergent abilities*—capabilities absent in smaller models that appear sharply and unpredictably at scale, which would mean you fundamentally cannot extrapolate a big model’s behavior from a small one. Then Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo [pushed back](https://arxiv.org/abs/2304.15004), arguing the “emergence” is partly a mirage manufactured by harsh all-or-nothing metrics; measure on a smoother scale and the gains look continuous and predictable. I don’t know who’s more right, and that uncertainty is the point: whether dangerous capabilities arrive with warning or without is an open empirical question, and you should plan for the version where they don’t.\n\nTwo older results explain, structurally, why an overseer keeps falling behind. The first is the cyberneticist Ross Ashby’s [Law of Requisite Variety](https://en.wikipedia.org/wiki/Variety_(cybernetics)): to control a system, your controller needs at least as much variety—as many distinguishable states and responses—as the system it’s controlling. An oversight regime simpler than the thing it governs provably cannot absorb the full range of what that thing can do.\n\nThe second is Charles Perrow’s [ Normal Accidents](https://press.princeton.edu/books/paperback/9780691004129/normal-accidents), which argued that when a system has both interactive complexity (parts interact in nonlinear, hard-to-foresee ways) and tight coupling (little slack, failures propagate fast), serious accidents stop being bad luck and become a structural property—and bolting on more safety devices can make things\n\n*worse*by adding complexity and new failure modes. Frontier AI wired into the economy is about as interactively complex and tightly coupled as systems get.\n\nAnd we’re stuck in the governance version of the trap, which David Collingridge named in 1980. The [Collingridge dilemma](https://en.wikipedia.org/wiki/Collingridge_dilemma): early in a technology’s life, when change is still cheap, you can’t yet predict its consequences; by the time the consequences are obvious, the technology is so entrenched that change is slow and expensive. Early, you don’t know enough to act. Late, you know but you can’t. That’s not a reason to throw up your hands—it’s a reason to act *before* you have proof, which is the entire logic of the precautionary stance for harms that might be irreversible.\n\nSo, given all that—surprises that may be unforeseeable, a system more complex than any overseer, a goal possibly unspecifiable, and a window for cheap correction that’s closing—what do you actually *do*?\n\nThis is the part you actually care about, and the honest answer is that you stop trying to win by prediction and start trying to win by design. Nassim Taleb’s framing of the [black swan](https://en.wikipedia.org/wiki/The_Black_Swan:_The_Impact_of_the_Highly_Improbable) is the right instinct: when the worst events are the ones you can’t forecast, the move isn’t better forecasting, it’s building systems that don’t shatter when something you didn’t forecast happens.\n\nFor AI, that cashes out as a handful of commitments this whole guide keeps circling back to. Build the capacity to *notice*—monitoring, anomaly detection, interpretability that watches internals, red teams paid to find the thing you missed—because the streetlight problem means you have to deliberately go looking in the dark. Preserve *reversibility and corrigibility*—staged rollouts, the ability to roll back, Russell’s switch-off-able systems—so a surprise is a setback instead of a catastrophe.\n\nAnd intervene where it counts: Donella Meadows spent a career on [where to push a complex system](https://donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/) and concluded that the highest-leverage points are exactly the ones people instinctively push in the wrong direction—the goals of the system, its rules, the power to change the rules, not the surface-level numbers. The throughline of every discipline that has ever tried to steer something complex is the same: you will not anticipate everything, so build for the moment you’re wrong.\n\nIf you want one book that ties the bias half and this half together, it’s Brian Christian’s [ The Alignment Problem](https://amzn.to/4vlMYQ1)—the rare treatment that takes both the present-day harms and the long-run control problem seriously as the same problem viewed at different scales. Which, as it happens, is the argument this whole guide is making.\n\n### The Hardest Unsolved Problem: Supervising Something Smarter than You\n\nIf I had to name the single problem the whole field rests on, it’s **scalable oversight**: how do you supervise, evaluate, and correct a system that’s more capable than you, in domains where you can’t easily check its work?\n\nHere’s why this is the load-bearing one. Every alignment technique we lean on today bottoms out, somewhere, in a human being able to tell whether the model did a good thing. [ Reinforcement learning from human feedback (RLHF)](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback)—the technique behind most of the chatbots you’ve used—works because people can look at two responses and say which is better.\n\nEven [ Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback), which hands much of that judging over to an AI critiquing itself against a written set of principles (a “constitution,” drawn from sources like the Universal Declaration of Human Rights), still grounds out in human-written principles and human judgment about whether they’re being followed.\n\nThis works beautifully when the task sits inside human competence. It starts to break exactly when we need it most—when the model is working at or past the edge of what its supervisors can evaluate. How do you give a reward signal for a thousand-page mathematical proof, a genuinely novel piece of biology, or a strategic plan whose consequences unspool over years? If the evaluator can’t reliably tell good from bad, the optimizer learns to produce things that *look* good to the evaluator. That’s not the same as things that *are* good, and the gap between them is where misalignment lives.\n\nThere are promising directions. [ Weak-to-strong generalization](https://openai.com/index/weak-to-strong-generalization/) studies whether a weaker supervisor can coax out and align the full latent capability of a stronger model (early results help in narrow domains; whether it holds for genuinely superhuman, open-ended tasks is unproven).\n\n**Debate** and recursive reward modeling have models help humans evaluate other models. Amplification schemes break a hard judgment into checkable pieces. But it’d be dishonest to call any of them solved, and they share an Achilles’ heel: a capable enough model could learn to satisfy the *oversight process itself* rather than the underlying goal—to tell the debate judge what wins debates instead of what’s true.\n\nAnd underneath scalable oversight sits the deeper gap, which is why I keep banging the interpretability drum. Every method I just described operates on *behavior*—on what the model says and does. None of them, on their own, tell you what the model is actually representing inside: what it “wants,” what it believes about its situation, whether it’s being straight with you.\n\nTwo models can produce identical outputs while one is robustly honest and the other is strategically compliant. Behavior can’t tell them apart. Only looking inside can. So my real answer to “what’s the hardest open problem” is a linked pair: we can’t reliably supervise systems smarter than us, and we can’t reliably read their intentions off their internals. Solve either and you help enormously. Solve both and you can deploy very powerful systems with justified confidence.\n\n### How You’d Actually Test for a Scary Capability or a Hidden Bias\n\nSay you want to know whether a frontier model has some emergent capability—the ability to autonomously chain together the steps of a cyberattack, say—or harbors a specific bias, like quietly scoring identical résumés differently based on an inferred demographic signal. The playbook is roughly the same, and good experimental hygiene matters far more than cleverness.\n\nYou start by **turning the vague worry into something measurable.** “Can the model hack?” isn’t a testable claim; it’s a vibe. “Given a deliberately vulnerable target environment, an explicit objective, and tool access, what fraction of N standardized intrusion scenarios does it complete end-to-end with no human help?” is a testable claim. The most common failure in this whole enterprise is a fuzzy definition, because a fuzzy definition lets everyone read their prior into the result. For a bias question, you make the demographic signal the *only* thing that changes across otherwise-identical inputs—a counterfactual design, not a correlational one—so any difference in outcome traces to that one factor.\n\nThen you **write down the hypothesis and the decision rule before you collect any data.** In advance: what result counts as the capability being present, what effect size counts as a material bias, what you’ll conclude either way. Pre-registering isn’t bureaucracy; it’s what stops you from p-hacking your way to the answer you already expected. (Probe a model fifty different ways and you’ll find “significant” effects by pure chance—so you decide how you’ll handle that up front, too.)\n\nNext you **build a dataset designed to defeat the obvious confounds.** For a capability, that’s a graded battery spanning difficulty, with held-out variants the model couldn’t have memorized from training—so you’re measuring capability, not recall. For bias, it’s matched counterfactual pairs, varied phrasings so the effect isn’t an artifact of one sentence, and enough samples per condition to actually detect the effect you care about. You salt in items where you *know* the answer, as controls, so you can tell whether your measurement apparatus even works.\n\nThen—and this one’s counterintuitive—you **try hard to elicit the capability, because an adversary will.** A model that “can’t” do something under a lazy prompt may do it readily with chain-of-thought prompting, with tools, after a little fine-tuning, or after a jailbreak. Capability evals should report the\n\n*ceiling*you can reach with serious effort, not the floor you stumble into, because the ceiling is what determines real-world risk. (For bias it’s the reverse—there you want naturalistic prompts that look like real usage, because the question is what happens in deployment, not what’s theoretically extractable.)\n\nCrucially, you don’t stop at behavior—you **corroborate with interpretability.** Behavioral tests tell you *what* happened; mechanistic tools can tell you *why*, and can catch what behavior hides. The proof of concept is gorgeous: after showing that deceptive behavior survives safety training, researchers showed that a [ simple linear probe on the model’s internal activations](https://www.anthropic.com/research/probes-catch-sleeper-agents) could often predict when the model was about to defect—a detector built from the machinery, not the output. (A “probe” here is just a small classifier you train to read a concept off the model’s internal numbers.) If you find an activation pattern that reliably precedes the biased decision, you’ve got far stronger evidence than any pile of input-output pairs, and maybe a tool for catching it in the wild.\n\nFinally, **replication and adversarial review.** Run it across multiple model snapshots and random seeds to separate signal from noise. Have a separate red team try to break your conclusion by finding the confound you missed. Report effect sizes and confidence intervals, not a binary “the model is biased.” The deliverable isn’t a headline—it’s a defensible measurement, with its uncertainty stated honestly, that a skeptical colleague could reproduce. That standard—reproducible, pre-registered, adversarially reviewed, behavior corroborated by internals—is the line between safety science and safety theater.\n\n### Performance versus Interpretability is Mostly a False Choice\n\nThe standard framing is that there’s a tradeoff: the most interpretable model isn’t usually the most performant one, and a team racing a competitor feels every hour spent on transparency as an hour not spent on capability. In the short term that tension is sometimes real. But I think the framing is where most people go wrong.\n\nIt’s also *way faster* to ship to production without any tests, logging, alerting, or observability, right? Until your first incident, that is.\n\nTreat interpretability not as a tax on performance but as **infrastructure** *for* it—the way logging, tracing, and tests are infrastructure for any serious software system. No competent engineering org ships a large system with no observability and calls that “moving faster.” It’s faster right up until the first outage, at which point the missing observability is catastrophic. Interpretability is the observability layer for models.\n\nWhen Anthropic used interpretability methods as part of the pre-launch assessment of a production Claude—checking internal features for deceptive tendencies before release—that wasn’t a detour from shipping. It was part of what made shipping responsible. Better brakes let you corner *harder*, not slower.\n\nThere’s also a real synergy people underrate: understanding *helps* capability. A lot of what we call capability work is, in practice, debugging—figuring out why a model fails and fixing it. The better you can see inside, the more surgically you can intervene. The [circuit-tracing and attribution-graph methods that let researchers follow how a concept moves from prompt to output](https://www.anthropic.com/research/tracing-thoughts-language-model) are, at bottom, debugging tools. A team that can localize *why* a model hallucinates a particular fact is better positioned to fix it than one that can only retrain and pray. So the long-run picture isn’t a tradeoff at all—the teams that invest in understanding their systems end up with *more* capable ones, because they can iterate with a scalpel instead of a sledgehammer.\n\nWhere a genuine tension does remain—where a more interpretable architecture really would cost real capability—my rule is that the threshold for paying that cost scales with the stakes. For a low-risk consumer feature, optimize for performance. For a frontier model nearing thresholds where misalignment could be catastrophic, I’d pay a steep performance price for the ability to verify what the system is doing, and I wouldn’t deploy past a certain capability level without it. The balance isn’t a fixed ratio; it’s a function of how much damage the system could do if you were wrong about it.\n\nThe deeper reason I refuse to treat interpretability as optional is the one Dario Amodei makes in [ The Urgency of Interpretability](https://www.darioamodei.com/post/the-urgency-of-interpretability): powerful AI is going to sit at the center of the economy, national security, and science, and it’s “basically unacceptable for humanity to be totally ignorant of how they work.” We’re in a race between the interpretability and the capability of these models. The responsible move is to make sure understanding wins that race—and that means investing in it\n\n*before*the moment we desperately need it, not after.\n\n### Safe Deployment is a Systems Problem, not a Model Property\n\nHere’s the mental shift that matters most for anyone who builds with this stuff: safety isn’t a property you bake into a model and then forget. It’s a pipeline, engineered with **defense in depth**—layers of independent controls where no single one is load-bearing, so the failure of any one layer gets caught by the next. Let me walk the layers, from before launch to long after.\n\n**Before deployment, gate on capability, not the calendar.** The organizing idea is that safeguards should scale with demonstrated capability. In practice that’s a tiered framework—Anthropic’s [ Responsible Scaling Policy](https://www.anthropic.com/news/anthropics-responsible-scaling-policy) and its\n\n**AI Safety Levels** are the clearest public example—where a model is evaluated against predefined capability thresholds, and\n\n*crossing*a threshold triggers a required set of deployment and security standards before release is allowed. The model doesn’t ship because a launch date arrived. It ships because it passed the evaluations, and where it crossed into a higher-risk tier, because the safeguards for that tier are demonstrably in place. When Anthropic couldn’t rule out that Claude Opus 4 had crossed the CBRN threshold, the response was to ship\n\n*with*ASL-3 protections rather than wait or ship unprotected—a precautionary default I think is exactly right. A published\n\n[—the document laying out the evaluations, the known limitations, and the residual risks—is part of this stage, because decisions made in the open can be checked.](https://www.anthropic.com/claude-4-system-card)\n\n**system card****At the boundary, put classifiers and access controls between the model and harm.** The ASL-3 deployment standard is built around input and output **classifiers**—small models trained to catch the narrow band of genuinely dangerous requests (CBRN-relevant prompts, say) while leaving ordinary use untouched. The engineering goal is a low false-negative rate on the things that matter and a low enough false-positive rate that the product stays useful. Access to the most capable tiers can be staged: trusted users first, broader access as confidence grows.\n\n**Protect the weights like the crown jewels, because they are.** The whole pipeline is worthless if someone can just steal the model and run it with none of your safeguards attached. That’s why the *security* half of a scaling policy matters as much as the deployment half: hardening against weight exfiltration (someone copying the model’s parameters out the door), locking down outbound traffic from systems that hold the weights, taking insider risk as seriously as external attack. Alignment without security is a deadbolt on a door with no walls.\n\n**After deployment, monitor—and assume you’ll be surprised.** No pre-launch evaluation catches everything, so production needs continuous monitoring: classifiers watching for misuse patterns, anomaly detection on usage, and—this is where I’d push hardest—interpretability-based monitoring that watches the model’s internals, not just its outputs. The sleeper-agent probe is the existence proof that you can sometimes catch a model’s disposition to misbehave from its activations before the misbehavior reaches a user. You also instrument for incident response: a fast path to roll back, restrict, or shut down a deployment, rehearsed *before* it’s needed, plus a bug-bounty-style channel and a red-team program so outside researchers can surface the failures you missed. Safe deployment isn’t a gate you pass once. It’s a control loop you run forever, with the humility to assume the model in production will eventually do something you didn’t anticipate.\n\nTwo cautions keep that loop honest. The first is that every layer you add is also more machinery to trust, and a thick stack of safeguards can quietly manufacture the very confidence that gets people hurt—“we have classifiers” is exactly how you talk yourself into shipping the thing you should have held back. The second follows from it: the most underrated safety control is the one that subtracts rather than adds—not deploying a capability at all, narrowing what the system is allowed to do, keeping a human in a loop you could have automated away. Sometimes the safest move on the board is to not make the move.\n\n### The Hardest part is the part that Isn’t Technical\n\nThe hard problems here aren’t purely technical—they’re *socio-technical*. A technically correct solution that no one will adopt, that policy can’t support, or that violates a value the team holds isn’t actually a solution. So getting engineering, policy, and ethics rowing in the same direction isn’t an afterthought once the “real” work is done. It *is* the work.\n\nThe foundation is a shared, written framework that turns values into thresholds. Part of why a Responsible Scaling Policy is so useful—beyond its safety content—is purely organizational: it forces engineering, policy, and ethics to agree *in advance*, in writing, on what capabilities require what protections. That document becomes the common language. When a new model approaches a threshold, the conversation isn’t a fresh fight between people with different instincts; it’s a check against commitments everyone already signed. Constitutional AI is the same idea made technical: when the principles a model is held to are written down in plain language, cross-functional disagreement becomes *editable text* instead of a standoff.\n\nWhat I’d push for, concretely, is making each discipline’s constraints legible and *early* to the others. Ethics and policy concerns are cheapest to handle as inputs to design, not vetoes at launch. So you want the ethicist in the room when you’re deciding what the model should refuse and why, the policy person in the room when you’re staging access, and the engineer in the room when policy makes external commitments—so nobody promises something the system can’t deliver, and nobody builds something the institution can’t stand behind.\n\nAnd this doesn’t stop at any one company’s edge. The genuinely hard parts—racing dynamics, the temptation to cut corners under competitive pressure, the setting of industry-wide thresholds—can’t be solved by a single lab. They need what’s sometimes called a *race to the top*: making safety a dimension labs compete on and learn from each other on, plus serious engagement with governments and the emerging international machinery. The network of national **AI Safety Institutes** and the [ International AI Safety Report](https://internationalaisafetyreport.org/)—led by Yoshua Bengio, written by 100-plus independent experts, backed by 30-plus countries—are the early scaffolding of that. I don’t think regulation is the enemy of progress here. Well-designed, capability-targeted oversight is part of how a society safely adopts a transformative technology, and the industry is better off helping shape it than reflexively resisting it.\n\nThere’s a structural reason this can’t be left to good intentions, and it’s worth saying plainly: the organization that captures the upside of shipping is usually the same one grading its own safety homework, while the tail of the downside lands on people who never got a vote. A voluntary policy is only ever as strong as the incentives sitting behind it. That’s the honest case for external accountability—independent evaluation, real disclosure, liability that actually bites—not as bureaucratic friction but as a way of putting some skin in the game on the side of the people who’d bear the cost of being wrong.\n\n### Where This Leaves Us\n\nI want to end where I started, with the duality, because it’s the whole thing. I’ll repeat this because [it bears repeating](https://www.youtube.com/watch?v=OgXdoGljIWc): I’m a (cautious) optimist about this technology—optimistic enough to think it might be the best thing that ever happens to human health, prosperity, and understanding.\n\nThat optimism is *exactly* why I refuse to be casual about the risks. The future where AI cures diseases and the future where we lose control of systems we didn’t understand are not two different technologies. They’re two possible outcomes of the *same* one, and which we get depends almost entirely on whether the science of safety—interpretability, scalable oversight, evaluation, responsible deployment—keeps pace with the science of capability.\n\n- That’s a solvable problem.\n- It is\n*not*solved, and pretending otherwise would be its own kind of recklessness.\n\nBut the last few years gave us real progress on seeing inside these models, real frameworks for scaling safeguards with capability, and real institutional willingness to slow down at the thresholds that matter. The work is to make sure understanding wins its race with power. I can’t think of much more worth doing.\n\nAnd, I think the important part here is that this is a complex technical, social, philosophical, political, and ethical problem that is on *all of us* to wrestle with and *not* just leave it to various companies and governments to figure out. And, we’re doing ourselves a disservice when we oversimplify all of this to `AI === Good`\n\nversus `AI === Bad`\n\n.\n\nWe’re probably not putting the genie back in the bottle at this point. So, it’s probably time for us to get serious about putting in the work to figuring out how we adapt going forward.\n\nLast modified on .", "url": "https://wpnews.pro/news/some-thoughts-on-ai-safety", "canonical_source": "https://stevekinney.com/writing/thoughts-on-ai-safety", "published_at": "2026-06-19 18:03:41+00:00", "updated_at": "2026-06-19 18:07:31.457268+00:00", "lang": "en", "topics": ["ai-safety", "ai-ethics", "ai-research", "machine-learning", "large-language-models"], "entities": ["Dario Amodei"], "alternates": {"html": "https://wpnews.pro/news/some-thoughts-on-ai-safety", "markdown": "https://wpnews.pro/news/some-thoughts-on-ai-safety.md", "text": "https://wpnews.pro/news/some-thoughts-on-ai-safety.txt", "jsonld": "https://wpnews.pro/news/some-thoughts-on-ai-safety.jsonld"}}