The best language models are still getting smarter and more capable. To an increasing degree, this is because they are trained by Reinforcement Learning with Verifiable Rewards. Chain of thought reasoning allows models to evade the finite depth restriction on information flow by passing (relatively little) information back into the first layers of the model through the token stream. Although pretraining was already enough to produce decently-good chains of thought by pure imitation, RLVR allows further optimization of the chain of thought by rewarding those chains of thought that result in a verifiably correct answer. Because tasks that are far too difficult for humans can still be possible to grade, RLVR provides a way of bootstrapping to capabilities far beyond the human level.
What kinds of things can be turned into verifiable rewards? Most obviously, coding tasks where there are unit tests (& maybe performance tests) to determine if a solution was correct. Or in a similar vein, creating mathematical proofs in a formal proof language like Lean or Agda. We can use almost any RL environment that is easy enough for the model to interact with. For example, the model plays a text adventure game where there are rewards for getting certain items or reaching certain rooms.
We'll call this class of problem "exactly graded", because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to write its code with only an informal problem description to guide it, but still grade its solutions by running unit tests.
Another class of problem is where the reward is subjective, or just difficult to compute. An extreme example of this might be asking the model to invent a joke. While this is the kind of question that could benefit from chain of thought reasoning (coming up with a really funny joke can require a lot of thought and iterating on ideas), it's very difficult to automatically evaluate whether a joke is funny. In addition, the training corpus contains few examples where people verbalize their process for coming up with jokes, so purely relying on pretraining to give us reasoning traces to imitate won't get us very far. We'll call this class of problem "inexactly graded".
One option here is to get humans to provide the grades. Too many rollouts are generated during RL for humans to label every rollout , so we usually train a reward model as an intermediary. Human graders assign rewards to a diverse set of rollouts (mixed with manually-created correct answers, perhaps). Then we train a reward model on this dataset of rewards. The loss function here is simple mean-squared error. Then we do our actual RL optimization, where the reward model assigns rewards to the each of the large number of rollouts produced by our model during training.
One problem with training a reward model as an intermediary is that it opens the door to the learner finding ways to trick the reward model into assigning a higher reward than it should. This can happen in subtle ways, making it hard for AI researchers to reliably notice and fix these issues. Also, even the smaller amount of human labelling required to train a reward model can still be expensive.
So, Exactly graded problems:
They are relatively difficult to cheat.
However a wide variety of problems, including most alignment-flavoured problems, are not exactly graded.
In general training on these can make your model smarter, but will not make it more friendly.
Inexactly graded problems:
Trying to optimize the chain of thought means we need some kind of automated grader.
But the subjectivity of grading makes the grader subject to adversarial optimization. Cheater strategies like prompt-injecting the grader to get a higher score are incentivized.
Need to manually supply labels, but the supervision they provide is sparse: just a bunch of scalars.
So for inexactly graded problems, figuring out how to train chain-of-thought reasoning is hard. What would be really nice is if we could just make training chain-of-thought as easy as supervised fine-tuning. Instead of obtaining a diverse dataset of various kinds of answers (both good and bad), carefully grading those answers, and then training a reward model, supervised fine-tuning just asks for a dataset of answers that are known to be good. (Yes, this does kind of destroy the appealing [1] "train far beyond human intelligence" part of doing RL. Though this strikes me as less important for inexactly graded problems anyway. And it's not like RL is a
The problem with just treating this as a supervised fine tuning problem is that, while we want the model to reason before answering because it improves performance, we can't differentiate through the chain of thought. The random selection of tokens in the chain of thought, besides destroying a lot of information contained in the activations, is non-differentiable.
Except that if the chain of thought is neuralese, then we actually can differentiate it!
Usually Neuralese is considered a bad development on the axis of alignment-related concerns, because raw activation vectors are much less interpretable than an approximately-English reasoning trace. This is true. But there are other important alignment properties besides interpretabilty.
The nice thing about a pure neuralese reasoning trace is that it is completely differentiable. In standard supervised fine-tuning, we differentiate through our model in order to increase the probability it assigns to each training example. If the model now generates a neuralese reasoning trace, then we can just do the exact same thing, except that we must differentiate though many forward passes of the model, joined into a chain by their neualese vectors. The googleable term for this is backpropagation through time.
So a nice recipe for learning an inexactly graded problem is:
The dataset is a bunch of examples of good answers.
The loss function is the exact same as regular SFT (namely, cross-entropy).
The only difference is that models do many steps of neuralese reasoning before answering, and we backpropagate through the entire reasoning chain.
To be clear, backpropagation through time is not a recent idea at all. It goes all the way back to RNNs. Or see here for a modern version. But because of its ability to train problems with inexact grading, I claim this recipe should be getting more attention as a path to better-aligned models than we currently give it.
Agents not only reason in chains, but also repeatedly use tools and interact with their environment to achieve their goals. Training models to be good agents will surely require RL. If we want to train agents to perform inexactly graded tasks, it doesn't seem we can avoid the need to train reward models. But I still think it will be helpful if we can simply backpropagate through the parts of these rollouts that are pure reasoning, instead of treating each reasoning token as an RL action. Reduce the effective trajectory length by reducing the number of things that count as actions, basically. Because the latent reasoning trace is not graded and so is hidden from the reward model, I also expect this reduces the surface area for hacking the reward model.
Perhaps there is a way to get the best of both worlds: The interpretabilty of token-based reasoning and the simple example-based training of SFT or neuralese-SFT?
I think there are some things that can be done here, but they strike me as less simple and likely-to work than the neuralese path. I have already made my most important points, so you can skip this section, as it will be fairly unpolished.
Let be the context, including the question posed to our model. Let be the answer produced by the model, while denotes a ground-truth answer from the training data. (Note that it's not necessarily wrong if any given is very different from . We just want them to be sampled from the same distribution conditional on .) Let be the reasoning trace the model used in coming up with its answer. Let denote the parameters of our learner. During inference, we sample in the following order: During supervised fine tuning, we have access to question-answer pairs but no reasoning trace linking the two. We'd like a training process that improves the distribution somewhat, where "improve" means that it makes the distribution closer to the empirical distribution .
The basic idea here is to train a helper model that tells us which reasoning traces make a given answer more likely to be produced.
Consider the following two-stage process:
Initialize . Train a model . (The simplest setup is just to concatenate with and generate forwards from there.) Training data for this process comes from computing rollouts using .
Initialize . We'll now train a model . Given a question-answer pair , we sample many reasoning traces . We optimize with policy gradient, where the reward for a given trace is:
This can be broken into a sum of token-level terms. We can add a KL penalty to prevent ourselves from diverging too quickly from a reasonable distribution here.
Besides being complicated, this also re-introduces an intermediate model with parameters that can potentially be gamed. But it is probably still worth trying, especially if one places a high importance on interpretability of chains of thought.
I asked ChatGPT about the history of this kind of technique and it found the following papers:
https://arxiv.org/pdf/2312.02179 The authors use Markov Chain Monte Carlo to sample chains of thought conditional on instead of training a helper model. They directly train their learner to imitate the sampled in this way instead of doing RL.
https://arxiv.org/pdf/2601.09260 This is a really smart paper. Here the latent is a continuous-valued latent vector, rather than a token-stream chain of thought, and we have a decoder . The thinking process here is that is repeatedly updated according to a learned velocity field. The velocity field is optimized to point in the direction that most increases the probability of the actual answer , relative to its probability at the current value of . This is latent reasoning (basically neuralese!), not natural-language reasoning, and so does not achieve the goal of easy interpretability. But on the plus side: It is a much cleaner idea than the token-based scheme described above.
https://arxiv.org/pdf/2602.14469 Yep. Because we condition on the true answer when generating to train on, post-hoc justification is incentivized, and is bad. IDK about their proposed solution though.
The method I describe above trains by perturbing on-policy reasoning chains towards those that are more likely to generate the supplied answer, rather than updating on some confabulated by cross-entropy. So I predict it would work better than the 2023 paper because of that. It looks worse than flow reasoning, but flow reasoning is uninterpretable, and it's unclear to me which of the two uninterpretable techniques (regular neuralese and flow reasoning) has the advantage.
Neuralese will overall be a good thing for alignment.
from the perspective of labs with little concern about existential risk from AI ↩︎