I post-trained a model to reliably roll a die

A developer post-trained a language model to roll a die, revealing that models default to outputting 4 due to training data bias. Standard reinforcement learning fails to encourage exploration, but adding an entropy bonus successfully taught the model to output each number with equal probability, highlighting a key challenge for RL-based training.

ask any llm to roll a dice, it’ll almost always output 4. gpt, claude, qwen, deepseek, doesn’t matter. 4. 4. 4. 4. when i asked claude why it does that, it responded with the following: LLMs don't actually generate random numbers—we predict likely tokens based on training data. When humans are asked to pick a "random" number between 1-6, they disproportionately say 4 it feels more random than 1, 6, or 3 , so that bias gets baked into our outputs. this is a cute party trick. but it points at something important: the model is not rolling a dice, but picking the most likely token. this is a problem for reinforcement learning , where the model learns from self-experience. it generates some responses, gets rewarded for the good ones, and the gradient pushes it toward whatever worked. if the model can only output 4 during training, it never finds out that 2’s ok, or that 6 would’ve been fine, or that any number 1-6 is valid . the gradient has nothing to work with. you can’t reinforce a behavior the model never tried. exploration buys you two things. better performance: interesting solutions live off the beaten path e.g. the non-obvious proof step, the clever tool call, the bug fix nobody’s written a Stack Overflow answer about. and better generalization , because a model that knows three ways to solve a problem has somewhere to go when the first two don’t work - one that’s memorized a single path falls off a cliff the moment you see something novel. rolling a dice is a useful toy example. if exploration fails here, it’s not going to magically work on reasoning tasks where the right path is one of billions and most of them are wrong. so: let’s see how bad it is, then see what we can do about it. the setup prompt: roll a dice. output the roll in <answer </answer tags. reward: 1 if the number in answer tags is between 1 & 6. 0 otherwise. algorithm: grpo, the standard llm rl baseline. we set the group size to be 6 the model generates 6 attempts per prompt . to simulate real dice, the llm would have to roll each number ~1/6 time. let’s see if we can get there. attempt 1: just run grpo the model starts with rolling 4 most of the time and over time it learns to just roll 4 100% of the time. this is expected. every time the model rolls a 4, the model gets rewarded. so, from the model’s perspective “always roll a 4” is a perfect policy. there’s really no incentive to do anything different. attempt 2: let’s turn up the heat temperature the obvious first fix: if the model is too confident in 4, make it less confident. temperature is the knob for this. at higher temperature, the model samples more randomly instead of always picking its top choice i.e. 4. at temp=1.5, the same collapse happens. at higher temperatures, the model starts acting a little too random 🙂 temperature only changes how we sample from the model. doesn’t change what the model has learned. the model still thinks 4 is the right answer; we’re just occasionally overriding it with noise. and once training kicks in, any accidental non-4 rolls get reward 1 just like the 4s did, so there’s still no reason for the model to actually shift its beliefs. it keeps betting on 4. to actually fix this, we need to change what/how the model is learning, not just how we’re getting responses from it. attempt 3: entropy bonus so let’s push on training directly. let’s add a loss term that penalizes the model when it’s overly confident in its answer. the idea is simple: if the model puts 99% of its weight on “4”, punish it. if it spreads its probability more evenly across answers, reward it or at least penalize it less . this is called an entropy bonus — entropy here just means “how spread out are the probabilities?” normal training: maximize reward loss = -reward with entropy bonus: also punish overconfidence probabilities = model.output probs e.g. 0.01, 0.01, 0.01, 0.96, 0.01 entropy = -sum p log p for p in probabilities low entropy → model is very confident bad, we penalize this high entropy → model is spread out good, lower penalty loss = -reward - α entropy α controls how much we care about diversity in plain english: the model gets penalized not just for wrong answers, but for being too sure of itself. it’s forced to keep some probability on every option, which means it has to keep exploring. with this penalty, we see the model doing better and not collapsing to 4. while not exactly uniform, it’s still better than before. but, we also notice that the response lengths increase over time. when you read the outputs it’s obvious why: the model effectively “hacks” the entropy loss by basically padding with low-probability tokens. because we compute entropy across all the output tokens, this juices entropy. one way to get around this is to make the reward stricter. 0 reward unless the model outputs exactly <answer {roll}</answer . any other frivolous strings are penalized. with that, we got a result similar to the above but without response lengths increasing. raising the difficulty: only even rolls count the entropy bonus worked — on the easy version where every number is valid. but real environments aren’t like that. the reward structure is usually hidden and the model has to figure out what works through exploration. to stress-test this, we now reward only 2, 4, and 6. the model needs to find its way from “roll 4” to “roll even numbers.” now with this new setup, we still see the model somewhat collapsing. it starts with rolling 4 all the time, explores its way to 6, collapses to just rolling 6 and then starts rolling 4 a lot again, etc. basically ping-ponging between 4 and 6 depending on the model checkpoint. but never rolls a two. we also generally notice a good amount of instability across steps in the training runs i.e. swings across preferred numbers . when we increase the entropy loss coefficient to try to push the model towards more diversity, it just devolves and starts producing full gibberish. so, while the entropy loss option is promising → its instability + inability to really get close to a good uniform distribution across all outcomes leaves us wanting more. attempt 4: other rl algorithms maybe we need something more than vanilla grpo. we tried a bunch: dapo, cispo, and a few others. none did better than the entropy-loss baseline. the issue isn’t the algorithm — it’s that the sampling distribution was already peaked on 4 before training started. without any feedback telling the model that diversity is valuable, it has no incentive to explore. attempt 5: diversity bonus in the reward instead of modifying the loss function directly, we can try tweaking the reward directly. within each group of 6 rollouts, cluster the responses by their roll and give rare rolls a bonus. if 5 of the 6 rollouts say “4” and one says “2,” the “2” gets extra credit for being unusual. for answer in rollouts: k = count rollouts == answer how many others gave the same answer bonus = N / N - 1 k rare answers small k get a bigger bonus reward answer += bonus when all N rollouts agree k = N , the bonus is 1/ N-1 very small . when an answer appears only once k = 1 , the bonus is N/ N-1 i.e. nearly N times larger. the rarer the answer, the more it’s rewarded. concretely, with N=6: if k=1 only one rollout gave that answer , bonus = 6/ 5×1 = 1.2 . if k=6 every rollout agreed , bonus = 6/ 5×6 = 0.2 . being the lone outlier is worth 6× more than going with the crowd. with this, we see more diversity. but there’s a problem: the bonus is additive, and it applies to any rare roll - including wrong ones. when the env only rewards 2/4/6, a rare “3” can still net positive reward from the diversity bonus alone. so the model starts rolling some 1s, 3s, and 5s. finally: divide the reward by cluster size small change, big effect: instead of adding a bonus for rare answers, divide the reward by how many responses fell into the same cluster. this flips the asymmetry. a wrong roll gets reward 0, and 0 divided by anything is still 0 — no diversity credit for being wrong. a correct roll gets reward 1, which gets scaled down when lots of other rollouts picked the same number. so the model still has an incentive to spread across 2, 4, and 6, but zero incentive to explore into 1, 3, or 5. for answer in rollouts: how many rollouts gave the same answer k = count rollouts == answer correct answers shared by fewer rollouts score higher reward answer = base reward answer / k and it works. we get a clean, almost uniform spread across the valid rolls and little invalid rolls. next with the cluster divisor strategy, we have a nice tool to ensure better diversity/exploration during llm fine-tuning. but two questions remain. 1. how do you cluster? for dice, it’s easy i.e. exact match on the number. for long-horizon agentic traces, it can be hard to distinguish between actually novel reasoning paths. using an llm for clustering is an option - but that can get real costly, real fast. 2. what if the correct answer is really far from base capabilities? the divisor trick still leans on the base model stumbling into a valid answer by chance & then making it more salient. if the valid region is far far away from where the model starts, no reward shaping is going to get you there. you need something else. more on both of these in future posts. the dice problem is a microcosm of a harder version: getting a reasoning model to actually search its solution space instead of replaying the paths it already knows. the cluster divisor is simple and gives us a concrete handle on that problem — a way to make the training signal itself reward exploration rather than just hoping the model stumbles into it.