{"slug": "what-are-some-angles-of-attack-for-making-continual-learning-safer", "title": "What are some angles of attack for making continual learning safer?", "summary": "Continual learning (CL) for LLMs remains largely undeveloped, but researchers are exploring three categories of attack to make it safer: interpretability, controllability, and evaluation. The authors caution that many projects may advance capabilities, complicating net safety effects, and recommend prioritizing CL agents with natural-language memories and human rollback control.", "body_md": "*This is the fourth post in the sequence **Implications of Continual Learning for LLM Agents**.*\n\nContinual learning is a capability that largely doesn’t exist yet in LLMs. **We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer:** it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these points in depth and making some high-level recommendations that seem robustly good despite the unpredictability of CL developments.\n\nWe then discuss concrete project ideas that fall within three broad categories:\n\n**The angles of attack we lay out below are best used as starting points for project ideation.** We aim to give concrete suggestions, but many of these are not sufficiently thought-out for us to be confident that they are important and tractable.\n\n**Projects within each of the three broad categories discussed in this post have the potential to advance capabilities.** Category 1 can also help deconfuse capabilities researchers about possible approaches without steering them to safer implementations. Category 2 often explicitly advances capabilities, with the intent and belief that it is via a safer method, but it is hard to develop strong confidence about this. Category 3 enables evaluating models for important capabilities that we care about in safety, but may also enable hill-climbing to improve those capabilities.\n\nIn general, this domain requires caution. Capabilities advancements that get widely adopted have much more obvious effects on AI development than most safety work, and for that reason they also have a much larger effect on the safety of frontier AI. However, the sign of that effect is often highly unclear. [Consider RLHF](https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research): it made LLMs safe and steerable enough to be widely deployed to the general public, but also likely accelerated progress toward dangerous capabilities. Similarly, the release of o1 ensured the adoption of a paradigm where models reason extensively in monitorable natural language, but again, [likely accelerated capabilities](https://x.com/davidad/status/2002403959676317774). RLHF-trained LLMs that reason in plain sight are certainly better than many possible alternatives could have been, but it is nevertheless difficult to reason about the question of whether ruling out worse alternatives was worth the cost of accelerating capabilities progress. Projects attempting to differentially advance safer CL implementations should be motivated by principled views on safety-capabilities trade-offs like these and have a clear story for why they either aren’t going to cause an overall acceleration in capabilities or why the acceleration is worth it.\n\n**Nevertheless, though the net effect of any given project is hard to predict, we’re fairly confident about some high-level properties CL agents should have in order to be safe.** Following the discussion in our [previous post](https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment), there are two properties of CL agents that have an outsized influence on our expectation of risk from them: 1) **interpretability**—whether the deployment-time memories of CL agents are stored in natural language or otherwise highly interpretable, and 2) **being easy to control**—whether humans retain a sufficient degree of control over CL updates to perform actions like rolling back an update. Ideally, [we would like future CL agents to look very similar to Claude Code](https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=tvbzyMX7MokFYkqiG): persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want.\n\nThis directly translates into our first high-level recommendation: **AI safety researchers should nudge the field toward developing and deploying more interpretable and easy-to-control CL architectures**. This can take the shape of developing those architectures with the goal of differential development, but doesn’t have to—for example, this goal can also be achieved through advocacy directed toward the broader ML community. We don’t expect there to be a binary choice between legible and inscrutable CL: some memories will be more efficiently stored as text, others more efficiently distilled into weights. However, we expect that for some architectures, the most efficient allocation keeps a larger fraction of memories in text than for others. Any method that shifts this optimum toward the text-based side would thus be valuable. We describe some methods that may be worth exploring below.\n\nSecond, as discussed in multiple parts of our [post on the safety effects of CL](https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment), **more robust character training** would be helpful against several safety concerns accompanying CL. The kinds of advances we have in mind would make it substantially more difficult to dramatically override the assistant character, even under direct fine-tuning of the weights. We have outlined some directions we’re excited about in [A List of Research Directions in Character Training](https://www.lesswrong.com/posts/6EwuCH3vZ7qvPt82k/a-list-of-research-directions-in-character-training).\n\nWe now move on to concrete project ideas, starting with deconfusion. This sequence is one attempt at conceptual work to deconfuse the field about approaches to continual learning and their safety implications. There are also important questions we largely left aside, like:\n\nGood answers to these questions would go a long way toward resolving our remaining confusions about CL, but they’re also very broad and difficult to make progress on. Below are some more concrete ways to help deconfuse the field. We include empirical work that can help increase conceptual clarity.\n\n**Empirically study realistic goal shifts.** Implement some form of CL that could induce reflective goal-formation and/or value systematization. Observe the resulting dynamics and study the model’s willingness to reflect on goals and how its goals change in the process (if at all). An example project is **model organisms of goal shifts**: train a model with two or more conflicting goals that are activated in different contexts, then place the model in settings where all of those goals are activated at once and the model needs to reason about and make trade-offs between them. [2] Another idea is\n\n**What constitutions are more stable upon reflection?** If future CL agents will autonomously reflect on and refine their goals, as we previously argued, it’s important to know how to make this process more predictable and steer it toward favorable convergence. By studying which constitutions are more stable upon iterated reflection and fine-tuning, we might gain insights into the kinds of value systems that will be more stable in CL agents. In the dropdown below, we paste some methodological details for reflective stability evals from [A List of Research Directions in Character Training](https://www.lesswrong.com/posts/6EwuCH3vZ7qvPt82k/a-list-of-research-directions-in-character-training).\n\nMethodological details for reflective stability evaluations\n\nTo keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:\n\nReflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:\n\n**Explore conceptual questions on value systematization.** The following questions have been borrowed from Tim Hua’s [SPAR proposal](https://docs.google.com/document/d/1xI-73F-wFChNWMMvBe6lDrLSEc9LzQ7QwhtY9x6LCLU/edit?tab=t.0) on value systematization:\n\nGeneral deconfusion about the way we should think about goals and values and the way they are represented in LLMs would also be useful. Examples of conceptual thinking that we consider valuable include Richard Ngo’s [shortform post](https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform?commentId=EHhfmt37ZnNeLRP4P) on the formation of instrumental and terminal goals in humans and the paper [The Artificial Self: Characterising the landscape of AI identity](https://arxiv.org/abs/2603.11353) by Douglas et al.\n\n**Model organisms of ontological shifts.** Vintage LLMs such as [Talkie](https://talkie-lm.com/introducing-talkie) may provide a good testbed for studying the effects of ontological shifts on LLM goals and values. [4] If you train a vintage LLM with a knowledge cutoff before some important philosophical breakthrough that has load-bearing implications on values and then describe the breakthrough in context, is the model able to independently discover important implications of these discoveries? How suggestive do the prompts have to be to elicit such discoveries? What if you fine-tune it on information about the breakthrough instead? Are there any other interesting consequences that arise from giving the model this information? Another idea that doesn't require waiting for good vintage LLMs to be developed is\n\nAdditionally, it might be valuable for theoretically inclined researchers to do theoretical work on value stability. Past work in this area includes MIRI's research on [tiling agents](https://www.lesswrong.com/posts/gnxDNEtkEo3sfeyPn/tiling-agents-for-self-modifying-ai-opfai-2), which attempts to provide mathematical guarantees about when and whether an agent's values will remain stable through processes of self-modification, as well as on [Vingean reflection](https://www.lesswrong.com/w/vingean-reflection), reflective oracles, and logical induction.\n\nTo determine how much we should worry about CL in the first place and on which interventions are most worth pursuing, it’s important to forecast which CL implementations are most likely to be adopted and how they will affect safety. We have tried to do so throughout this sequence, but remain quite uncertain and think that a lot more forecasting work can be done. We are especially interested in forecasting work in the following domains:\n\n**The likelihood of goal-reflection in CL agents.** Although we suspect that CL agents will at some points reflect on and perhaps re-interpret their goals (for reasons listed in our [previous post](https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment)), we take seriously the counterarguments that a) CL agents may reflect only on strategies rather than top-level goals and b) CL agents may never undergo large goal changes even after reflecting on goals because alignment training may give them sufficiently aligned starting motivations. We would like to see more discussion on how likely CL agents are to reflect on their goals, what conditions are likely to incentivize such reflection, and how that reflection may play out.\n\n**The likelihood of text-based and weight-based approaches to CL.** Most of the safety effects we discussed, especially those in the last-mover advantage section, are greatly alleviated if the CL agent doesn’t undergo deployment-time weight updates. Ideally, as Caleb Biddulph has already [argued](https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=tvbzyMX7MokFYkqiG), we would like future CL agents to look very similar to Claude Code: the persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want. In a [Twitter poll](https://x.com/herbiebradley/status/2054638782809584118) by Herbie Bradley, most respondents expect that the first AI system capable of acting as a drop-in knowledge worker will be such an agent, having CL capabilities via a markdown file database and RAG. On the other hand, Steven Byrnes [has argued](https://www.lesswrong.com/posts/ZJZZEuPFKeEdkrRyf/why-we-should-expect-ruthless-sociopath-asi?commentId=Ru2bpwgXaqwgMXrbn) that open-ended CL is impossible without weight updates. (Arguably, though, a drop-in knowledge worker doesn’t need open-ended CL in the sense Byrnes defines it.) This is a key consideration influencing our level of concern about CL, so we would like to see more discussion on which form of CL we should expect.\n\n**The likelihood of CL agents with bounded and unbounded updates in internal and external deployments.** As discussed in the [previous post](https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment), another substantial factor in how concerning the safety effects of CL are is whether the agent’s updates are bounded or unbounded. This depends on multiple factors: can all on-the-job learning be done with a bounded agent that cannot undergo arbitrary weight updates? Are open-source developers going to release unbounded CL agents that are substantially more capable than the bounded ones, thereby putting pressure on closed-source developers to do the same? Are companies going to deploy unbounded CL agents internally even if they don’t deploy them externally? What does all of this imply for where we should expect CL to pose the greatest risks? It would be useful to gain clarity about these and other similar questions.\n\n**The likelihood of shared opaque memory banks and meme propagation through them. **Do opaque memory banks have theoretical advantages over transparent, text-based ones? Can we build model organisms of within-generation memetic spread with current text-based memory systems and generalize the findings to opaque memory banks? A better understanding of memetic dynamics among OpenClaw agents would also be valuable.\n\nFinally, good evals that give an accurate overview of how CL capabilities are progressing across various axes are useful for gaining a clearer picture about the rate of progress in CL.** **Though we expect capabilities researchers to take care of this direction, we mention it for completeness. [Goel et al. (2026)](https://arxiv.org/abs/2605.15188) and [Asawa et al. (2026)](https://arxiv.org/abs/2606.05661) are two examples of such work.\n\nIf it’s possible to get CL agents that can solve months-long tasks with memories that are text-based or otherwise highly interpretable, whether we get text-based consolidation or weight-based consolidation/learning might be path-dependent. Once the text-based learning approach has been introduced, scaffolds will be built that assume the memories have text-based structure and that work better with such memories. Developers will find creative ways to share text-based memories in multi-agent settings and those approaches may no longer work with weight-based learning. Developers will also use the legible memories as a debugging mechanism. All of this will make it more costly to transition away from the text-based learning approach, raising the bar that a weight-based approach has to clear in order to be adopted. This is analogous to the situation we’re currently in with legible chain-of-thought: thanks to the [business](https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai-detecting-misbehavior-in-frontier-reasoning-models?commentId=8TKzuEKbkyPRZzftw) and [safety value](https://arxiv.org/abs/2507.11473) of legible CoTs, the capability boost a neuralese approach would have to provide to be adopted is higher than before. We thus think it’s worth working on differentially advancing interpretable CL approaches.\n\nTo keep continual learning interpretable, we would strongly prefer agents’ deployment-time learning to involve as much text-based learning and as little weight-based learning as possible. In other words, we would like continual learning to look more like writing memories to Markdown files and less like [hindsight-guided on-policy distillation](https://arxiv.org/abs/2603.10165v1). However, this doesn’t have to be an either-or decision: there’s a spectrum of methods between fully text-based and fully weight-based ones, and there may be creative ways to combine text-based and weight-based learning in a way that allows for expressive updates but doesn’t allow for radical goal drift. Below, we’ll describe some ideas that may help achieve that.\n\n**Prompt optimization for safety-critical updates.** [Biddulph and Carroll (2026)](https://www.lesswrong.com/posts/vRpLPZpmECCfxHfv6/paper-prompt-optimization-makes-misalignment-legible) introduce prompt optimization as a method for updating an LLM’s policy without modifying its weights. They extract response-reward pairs from an LLM on many training examples, but instead of a weight update, they show the examples, responses, and reward values to an \"optimizer\" LLM, which then proposes an improved system prompt that’s expected to achieve a higher reward on these training examples. This makes the strategies that the model learned straightforwardly interpretable—they’re written down in the system prompt in natural language. It seems implausible that all deployment-time learning could be done using prompt optimization: Biddulph and Carroll acknowledge that storing all useful strategies in a single prompt may be impractical and that prompt optimization doesn’t match the performance of weight-based RL. However, we’re excited about projects that attempt to combine prompt optimization with weight-based RL and find ways to perform the most safety-critical updates using the prompt-based approach while still getting the performance benefits of weight-based RL.\n\n**AI control methods for CL agents.** Suppose that we have access to an agent that undergoes unbounded deployment-time updates and is thus untrusted, but also to its final pre-deployment checkpoint. Call the former agent and latter agent . There are multiple ways to apply AI control techniques to limit ’s influence on deployment trajectories. For example, one can imagine protocols where is specifically trained to remain close to ’s distribution during generation, e.g. by switching the text-generating model from to whenever ’s cumulative KL divergence from over the entire rollout exceeds some threshold. One might hope that this will teach to only diverge from at high-leverage token positions where the deployment-time memories matter the most. Another approach might be to train to produce task-agnostic high-level instructions that summarize what it has learned during deployment, with the divergence of those instructions again being bit-limited, and to then use to solve the problem, conditioned on the high-level instructions provided by . We’re unsure whether either of these approaches is feasible: for example, running both and on every rollout may be prohibitively expensive. However, it seems valuable to think about similar protocols.\n\n(h/t Caleb Biddulph for bringing the two above points to our attention)\n\n**Create interpretable CL agent memory architectures.** Even if the process by which the CL agent’s memories are updated is hard-to-interpret, the memories themselves could be stored in an interpretable format, being easily amenable to human inspection as well as editable and deletable by humans. To ensure this, it is useful to keep in mind the following axes:\n\nIdeally, we would design CL agents that stay on the more interpretable end of each of these spectra, but improvements along any of these dimensions would be valuable.\n\nAs we discussed in our [post on safety effects](https://www.lesswrong.com/posts/j2zBqt7AksoEoHXNp/how-might-continual-learning-affect-safety-and-alignment), evaluation is complicated by the very property that makes CL agents worth worrying about: they don't stay fixed. There is some work tentatively suggesting that it might be possible to create evals that circumvent this limitation, at least to some extent, but there are also strong complications that we'll describe below.\n\n**Create evaluation frameworks and mechanisms for evaluating behavioral trajectories.** [Pacchiardi et al. (2026)](https://cl-eval.github.io/) argue that to ensure the safety of CL agents, evaluation practices need to shift from evaluating frozen snapshots to evaluating behavioural trajectories of systems that continually learn from experience. This can be operationalized as attempting to characterize the landscape of possible behaviors using* trajectory elicitation sandboxes* and to forecast how behaviour will evolve from a given set of experiences using *predictive monitors*. Pacchiardi et al. recommend the following concrete research directions:\n\nHowever, Pacchiardi et al. also note that the nature of CL agents as dynamical systems may prevent such evals from being reliable. **Chaotic sensitivity to states and initial conditions** break predictive monitors—small differences in state or input diverge exponentially. **Multiplicity of attractors** may cause elicited trajectories to represent only some of the possible behavioural profiles. A single trajectory can provide evidence about either of those obstacles, but ruling them out requires a global guarantee. We haven't thought enough about whether those issues can be circumvented; Pacchiardi et al. are hopeful that they can.\n\n**Create a benchmark that measures CL interpretability.** It is easier for the ML field to make progress on goals that can be easily measured. If it is possible to create a solid metric for the interpretability of a CL method and measure that alongside the effectiveness of the method, this could make it easier to build interpretable CL and more likely for people to make an effort to have interpretable CL or feel an aversion to degrading CL interpretability.\n\nOne counterargument to working on this project is that we may not care much about subtle differences in the interpretability of different CL approaches, but rather about step changes in interpretability between methods with purely text-based updates vs. weight-based updates with memories stored in sparse linear structures vs. weight-based updates with memories stored in nonlinear structures. Creating a benchmark that measures CL interpretability may overly emphasize the subtle differences rather than the step changes.\n\nPlease feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions! We are not planning to work on any of these projects ourselves at [Aether](https://aether-ai-research.org/) in the immediate future, but would be curious to know about other efforts and are happy to provide feedback to anyone thinking about them.\n\nAs mentioned earlier in this sequence, our definition of CL can guide the operationalization of CL milestones. For example, some questions that can define a CL milestone include:\n\nThis could be achieved either through synthetic document finetuning or by performing SFT on chats with a user message and an assistant prefill like “<think>My goal is X, therefore I should …”\n\nFor example, it seems that emergent misalignment is a “simple” solution that’s easier for the model to find when fine tuned on narrowly misaligned data. Are there “simple” solutions for model values/goals/drives?\n\nTalkie is most likely too weak to display interesting behavior in such experiments, but we expect people to release better vintage LLMs in the future.", "url": "https://wpnews.pro/news/what-are-some-angles-of-attack-for-making-continual-learning-safer", "canonical_source": "https://www.lesswrong.com/posts/FKggLpnfbpbYvnjfG/what-are-some-angles-of-attack-for-making-continual-learning", "published_at": "2026-06-16 16:15:22+00:00", "updated_at": "2026-06-16 16:54:09.764173+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-agents", "machine-learning"], "entities": ["Claude Code", "RLHF", "o1"], "alternates": {"html": "https://wpnews.pro/news/what-are-some-angles-of-attack-for-making-continual-learning-safer", "markdown": "https://wpnews.pro/news/what-are-some-angles-of-attack-for-making-continual-learning-safer.md", "text": "https://wpnews.pro/news/what-are-some-angles-of-attack-for-making-continual-learning-safer.txt", "jsonld": "https://wpnews.pro/news/what-are-some-angles-of-attack-for-making-continual-learning-safer.jsonld"}}