# Some Thoughts on Bengio's Scientist AI

> Source: <https://www.lesswrong.com/posts/382m7ATXELt5eBgr3/some-thoughts-on-bengio-s-scientist-ai>
> Published: 2026-05-26 03:05:43+00:00

*Epistemic Status: I wrote this for an application then realized it might be of interest to others or spark a conversation. Yoshua Bengio and **LawZero** are important players in AI Safety, so I think we should have a conversation about their ideas.*

I have two substantial concerns with [Yoshua Bengio’s Scientist AI](https://arxiv.org/abs/2502.15657). One is that it fails to think through the consequences of success, and will fall into the same kind of alignment failures as agentic AI. A second is that Bengio’s method for making a scientist AI would fall short for both practical and theoretical reasons.

Even leaving aside some of his philosophically difficult claims like the mention that they want to make an AI that can't model itself but can still predict the world, the scientist AI as described would be unsafe. Logically speaking, if someone asked, "How can cancer be cured?" it would output some sequence of steps which could involve making an agent AI to solve cancer. I've seen Yudkowsky's blog posts from over ten years ago on why tool AI is not a solution to alignment.

Practically speaking, it tends to be the case that intelligently seeking out information in an agendic way is the best way to do science. Stopping your scientist AI from doing that is weakening it. By handicapping their scientist AI, Bengio’s team will always be behind labs that allow their AI (scientist or not) to explore. Even worse, there are strong theoretical reasons to expect understanding the world to be impossible without taking actions. This is related to the field of causal inference, pioneered by figures such as Judea Pearl. “Correlation does not imply causation” is a common mantra for a reason. Correlation is evidence for causality, reverse causality, common cause, or selection bias. Without causal assumptions or taking actions, it is simply not possible to deduce the correct causal model. Reinforcement learning, the paradigm that Bengio correctly identifies as the thing that puts us most at risk, is also the training paradigm that allows AIs to learn new causal models. This tension exists throughout the paper. Bengio often writes about wanting to train the model for causal modeling, but he also says that his plan is based on inherently associative conditional probabilities. No matter how much data he gives the model during training, this fact remains.

There are other possibly insurmountable obstacles to the plan as well. Constructing a formal language to map to reality is extremely difficult if not impossible, and researchers have been trying to do that for decades. And it would be hard to train such a model without human data, but if you train on human data, you're back at pre-training an AI which can play a potentially malicious character.

Despite my criticisms, there are good aspects of the paper. Their short-term plan is something I would like to see: If I’m reading them correctly, it is fine-tuning an LLM model to be good at hypothesizing what might go wrong with a user’s request. That is both practically useful and can make the system safer to use. Characterizing risky agentic AI systems by their “affordances, goal-directedness, and intelligence” is a good idea, even if enough intelligence can lead to the other two trivially. I will work “anytime preparedness” into my plans, so that I can contribute in both short and long timelines. I have done something similar, by doing both fieldbuilding (long term payoff) and research (short term payoff if the research is impactful), and I appreciate the handle.

I have a lot of respect for Yoshua Bengio, but his scientist AI plan is not doable for reasons that I'm sure other AI safety people have tried telling him before. Still, I'm glad that he is working on AI safety, and I expect his group to contribute something valuable, even if the plan as written can't work.
