More AI safety research and project ideas I haven't seen elsewhere

wpnews.pro

This is a follow-up to my August 2025 post.

This post contains all the AI safety research and project ideas I've had in the past 10 months which I think could be high impact. I’m sharing them in case any are helpful or generative for others. I don’t plan to pursue most of these myself as there are just too many for me to do them all justice. All of these ideas are at least somewhat novel. Many could be expanded into full essays, policy proposals, or longer-term research programmes. Also, a warning: some of them are a bit weird!

There should be a live, regularly-updated and highly visual database of AIS research questions and the latest progress on each of them*.* The nearest existing thing is the EA Midwest list-of-lists; it could be far more. Could plausibly be made and maintained mostly with AI scraping and auto-categorisation. A database like this would allow people to easily see the state of research on a given subtopic, and where there is progress to be made. I suspect that the majority of updating could be automated. AISafety.com's map separates listings into categories with an Airtable backend and a graveyard for discontinued projects, plus aisafetyfeed (where AI helps summarise, tag, and rate the novelty of content) and trecursive's tree-maps. None nail "questions + live progress on each".

Similarly, there should be a live, regularly-updated and highly visual database of proposed interventions in AI safety, which tracks how many people are working on each intervention, and roughly how much time they're on each intervention*. *This would allow us to more quantitatively assess neglectedness. Landscape maps and 80k problem profiles gesture at neglectedness, but a quantitative time-allocation tracker I couldn't find. These two projects could even be combined into one comprehensive database.

Which disciplines have had the least contact with AI safety, and might they have anything to contribute? See this Claude deep research artefact: https://claude.ai/public/artifacts/55bde7d3-2216-43ea-83f0-9857e1e48750

Are attractor basins from AI use a risk vector that curtails genuine innovation in AIS itself? One candidate antidote: deliberately spending time in layers of reality that are far from AI (eg spending time in nature) in order to tap into sources of inspiration which lack these attractor basins. Adjacent to model-collapse and LLM-homogenisation-of-thought work, but the specific application to safety-research creativity I haven't seen.

**Given the high degree of disagreement among experts regarding which AI safety interventions are most promising, would it be helpful for intervention comparisons to factor in interactions ****between ****interventions (synergies, clashes) and viability across ** broad timelines? From what I've been able to gather, these factors aren't often taken into account. For example, mechanistic interpretability and evals may be mutually reinforcing, because better interpretability can improve the design of evals while evals can help identify where interpretability work is most urgently needed. By contrast, aggressive public campaigning for a might clash with quiet institutional work if it makes policymakers or labs more defensive, though it might also expand the Overton window in ways that make moderate regulation easier. Timeline-robustness also matters: some interventions may dominate on short timelines because they can be deployed quickly, while others may only pay off on longer timelines because they require deep scientific or institutional maturation. The point would be to figure out which interventions are most synergistic with other interventions, and which remain viable across both short and long timelines. Portfolio/"defence in depth" framing is standard; formal interaction-matrices don't seem to exist.

Should frontier-AI-company employees strike, with demands of their companies committing to safety?

To whatever extent frontier AI's energy and resource consumption breaks traditional climate forecasts, how should AIS strategy adapt for the likely increased resultant geopolitical and environmental instability?

**Can recursive self-improvement be (roughly) simulated through an LLM repeatedly improving its system prompt, in order to study alignment implications? **This would not reproduce full RSI, since the model’s weights, architecture, training data, and capabilities remain fixed while only its instruction context changes. Still, prompt-level self-modification could serve as a useful toy model for studying alignment dynamics, especially how goals, constraints, failure modes, and deceptive or unstable behaviours might shift across iterative improvement cycles. Prompt-level self-improvement exists (Promptbreeder, self-refine, self-rewarding LMs); as a deliberate alignment toy-model for RSI dynamics it's less done. Reasonable small project, modest novelty.

If the world is currently getting worse (a perspective which is of course extremely subject to debate), is there an argument that postponing the singularity makes things worse? untouched. The "opposite of a long reflection": scale up today's not-completely-terrible values before they degrade. In that framing, delay is not merely caution, but an active choice to let worse norms, more brittle institutions, and more desperate incentives become the substrate from which superintelligence eventually emerges.

What are the implications of evidence on the robustness of international organisations for the international network of governmental AI safety institutes? For example, good leadership is one of the factors which seems to predict robustness regardless of circumstances. How good is the leadership of AISIs, and how could we improve it?

Has political polarisation become so strong as to mean that, contrary to the concern that AI safety should remain unpoliticised, AI safety may in fact only become salient if it becomes a partisan issue? Worth exploring risks and benefits of deliberately allying AIS with one political faction.

If states each had oracle/ASI access and could foresee likely conflict outcomes, might that force negotiation and prevent war? Eg:

Is physical AI / robotics a fundamentally different ballpark from purely digital AI for safety purposes? We cannot necessarily assume 1:1 transfer of digital-AI safety theory results to embodied systems. We might have to develop good understanding and control from some degree of first principles. It would be good and interesting to think through AIS for robotics carefully, thinking about how each aspect of AIS interacts with robots. How important is the conceptual and linguistic gap between AI safety and robotics, and how do we close it? And do we need organisations equivalent to Anthropic, Redwood etc for robotics?

Is there a risk of AIs escaping online videogames or virtual environments onto the internet? Particularly acute for villain characters who use 'evil' LLM persona to shape their dialogue and behaviour. This seems like one of the most plausible environments in which ‘evil’ LLM personas may be deliberately designed.

Leading figures in AI think we might need to simulate emotion in order for AI to be truly smart and safe. Is this possible without inducing sentience?

**Do digital minds research and brain organoid research have anything to learn from each other? **Eg can we develop a shared, substrate-neutral framework for assessing when a biological organoid or digital system becomes a possible moral patient? How can welfare-relevant states be assessed in systems that cannot verbally report pain, pleasure, distress, boredom, or preference? What would count as evidence of negative valence in a brain organoid or a digital system? Etc; these are random examples I generated with an LLM, there are probably much higher-impact intersections.

Do near-future videogames pose uniquely severe s-risks? Many (possibly millions) of NPCs in near-future videogames might run on LLMs, which may be sentient, or more advanced architectures which may be even more likely to be sentient. Videogames are possibly the only context in which AI systems might be deliberately tortured (think about how NPCs in games like Grand Theft Auto are treated!). This could suggest that videogames will be an environment particularly susceptible to becoming sites of s-risk. Brian Tomasik wrote on suffering in RL agents and video-game characters years ago; the s-risk field (CLR) owns the scale argument. But these concepts together, combined with increasing genuine possibility of sentience in advanced AI now as well as increasing experimentation with use of LLMs in NPCs, poses a novel risk vector.

source & further reading

forum.effectivealtruism.org — original article Giving scalable feedback (BC #16) How to Be Afraid of AI? Dietary defaults in Claude Opus 4.8

More AI safety research and project ideas I haven't seen elsewhere

Run your AI side-project on zahid.host