I post-trained a model to reliably roll a die A developer post-trained a language model to roll a die, revealing that models default to outputting 4 due to training data bias. Standard reinforcement learning fails to encourage exploration, but adding an entropy bonus successfully taught the model to output each number with equal probability, highlighting a key challenge for RL-based training. ask any llm to roll a dice, it’ll almost always output 4. gpt, claude, qwen, deepseek, doesn’t matter. 4. 4. 4. 4. when i asked claude why it does that, it responded with the following: LLMs don't actually generate random numbers—we predict likely tokens based on training data. When humans are asked to pick a "random" number between 1-6, they disproportionately say 4 it feels more random than 1, 6, or 3 , so that bias gets baked into our outputs. this is a cute party trick. but it points at something important: the model is not rolling a dice, but picking the most likely token. this is a problem for reinforcement learning , where the model learns from self-experience. it generates some responses, gets rewarded for the good ones, and the gradient pushes it toward whatever worked. if the model can only output 4 during training, it never finds out that 2’s ok, or that 6 would’ve been fine, or that any number 1-6 is valid . the gradient has nothing to work with. you can’t reinforce a behavior the model never tried. exploration buys you two things. better performance: interesting solutions live off the beaten path e.g. the non-obvious proof step, the clever tool call, the bug fix nobody’s written a Stack Overflow answer about. and better generalization , because a model that knows three ways to solve a problem has somewhere to go when the first two don’t work - one that’s memorized a single path falls off a cliff the moment you see something novel. rolling a dice is a useful toy example. if exploration fails here, it’s not going to magically work on reasoning tasks where the right path is one of billions and most of them are wrong. so: let’s see how bad it is, then see what we can do about it. the setup prompt: roll a dice. output the roll in