{"slug": "i-post-trained-a-model-to-reliably-roll-a-die", "title": "I post-trained a model to reliably roll a die", "summary": "A developer post-trained a language model to roll a die, revealing that models default to outputting 4 due to training data bias. Standard reinforcement learning fails to encourage exploration, but adding an entropy bonus successfully taught the model to output each number with equal probability, highlighting a key challenge for RL-based training.", "body_md": "ask any llm to roll a dice, it’ll almost always output 4. gpt, claude, qwen, deepseek, doesn’t matter. 4. 4. 4. 4.\n\nwhen i asked claude why it does that, it responded with the following:\n\n```\nLLMs don't actually generate random numbers—we predict likely tokens based\non training data. When humans are asked to pick a \"random\" number between\n1-6, they disproportionately say 4 (it feels more random than 1, 6, or 3),\nso that bias gets baked into our outputs.\n```\n\nthis is a cute party trick. but it points at something important: the model is not rolling a dice, but picking the most likely token.\n\n**this is a problem for reinforcement learning**, where the model learns from self-experience. it generates some responses, gets rewarded for the good ones, and the gradient pushes it toward whatever worked.\n\nif the model can only output 4 during training, it never finds out that 2’s ok, or that 6 would’ve been fine, or that *any number 1-6 is valid*. the gradient has nothing to work with. you can’t reinforce a behavior the model never tried.\n\nexploration buys you two things. **better performance:** interesting solutions live off the beaten path e.g. the non-obvious proof step, the clever tool call, the bug fix nobody’s written a Stack Overflow answer about. and **better generalization**, because a model that knows three ways to solve a problem has somewhere to go when the first two don’t work - one that’s memorized a single path falls off a cliff the moment you see something novel.\n\nrolling a dice is a useful toy example. if exploration fails here, it’s not going to magically work on reasoning tasks where the right path is one of billions and most of them are wrong.\n\nso: let’s see how bad it is, then see what we can do about it.\n\n### the setup\n\n**prompt:** roll a dice. output the roll in `<answer></answer>`\n\ntags.\n\n**reward:** 1 if the number in answer tags is between 1 & 6. 0 otherwise.\n\n**algorithm:** grpo, the standard llm rl baseline. we set the group size to be 6 (the model generates 6 attempts per prompt).\n\nto simulate real dice, the llm would have to roll each number ~1/6 time. let’s see if we can get there.\n\n### attempt 1: just run grpo\n\nthe model starts with rolling 4 most of the time and over time it learns to just roll 4 **100%** of the time.\n\nthis is expected. every time the model rolls a 4, the model gets rewarded. so, from the model’s perspective “always roll a 4” is a perfect policy. there’s really no incentive to do anything different.\n\n### attempt 2: let’s turn up the heat (temperature)\n\nthe obvious first fix: if the model is too confident in 4, make it less confident. temperature is the knob for this. at higher temperature, the model samples more randomly instead of always picking its top choice i.e. 4.\n\nat temp=1.5, the same collapse happens.\n\nat higher temperatures, the model starts acting a little too random 🙂\n\ntemperature only changes how we sample from the model. doesn’t change what the model has learned. the model still thinks 4 is the right answer; we’re just occasionally overriding it with noise. and once training kicks in, any accidental non-4 rolls get reward 1 just like the 4s did, so there’s still no reason for the model to actually shift its beliefs. it keeps betting on 4.\n\nto actually fix this, we need to change what/how the model is learning, not just how we’re getting responses from it.\n\n### attempt 3: entropy bonus\n\nso let’s push on training directly. let’s add a loss term that penalizes the model when it’s overly confident in its answer.\n\nthe idea is simple: if the model puts 99% of its weight on “4”, punish it. if it spreads its probability more evenly across answers, reward it (or at least penalize it less). this is called an **entropy bonus** — entropy here just means “how spread out are the probabilities?”\n\n```\n# normal training: maximize reward\nloss = -reward\n\n# with entropy bonus: also punish overconfidence\nprobabilities = model.output_probs()   # e.g. [0.01, 0.01, 0.01, 0.96, 0.01]\n\nentropy = -sum(p * log(p) for p in probabilities)\n# low entropy  → model is very confident (bad, we penalize this)\n# high entropy → model is spread out    (good, lower penalty)\n\nloss = -reward - α * entropy   # α controls how much we care about diversity\n```\n\nin plain english: the model gets penalized not just for wrong answers, but for being *too sure* of itself. it’s forced to keep some probability on every option, which means it has to keep exploring.\n\nwith this penalty, we see the model doing better and not collapsing to 4. while not exactly uniform, it’s still better than before.\n\nbut, we also notice that the response lengths increase over time. when you read the outputs it’s obvious why: the model effectively “hacks” the entropy loss by basically padding with low-probability tokens. because we compute entropy across all the output tokens, this juices entropy.\n\none way to get around this is to make the reward stricter. 0 reward unless the model outputs exactly `<answer>{roll}</answer>`\n\n. any other frivolous strings are penalized. with that, we got a result similar to the above but without response lengths increasing.\n\n### raising the difficulty: only even rolls count\n\nthe entropy bonus worked — on the easy version where every number is valid. but real environments aren’t like that. the reward structure is usually hidden and the model has to figure out what works through exploration. to stress-test this, we now reward only 2, 4, and 6. the model needs to find its way from “roll 4” to “roll even numbers.”\n\nnow with this new setup, we still see the model somewhat collapsing. it starts with rolling 4 all the time, explores its way to 6, collapses to just rolling 6 and then starts rolling 4 a lot again, etc. basically ping-ponging between 4 and 6 depending on the model checkpoint. but never rolls a two.\n\nwe also generally notice a good amount of instability across steps in the training runs (i.e. swings across preferred numbers).\n\nwhen we increase the entropy loss coefficient to try to push the model towards more diversity, it just devolves and starts producing full gibberish.\n\nso, while the entropy loss option is promising → its instability + inability to really get close to a good uniform distribution across all outcomes leaves us wanting more.\n\n### attempt 4: other rl algorithms\n\nmaybe we need something more than vanilla grpo. we tried a bunch: dapo, cispo, and a few others. none did better than the entropy-loss baseline. the issue isn’t the algorithm — it’s that the sampling distribution was already peaked on 4 before training started. without any feedback telling the model that diversity is valuable, it has no incentive to explore.\n\n### attempt 5: diversity bonus in the reward\n\ninstead of modifying the loss function directly, we can try tweaking the reward directly.\n\nwithin each group of 6 rollouts, cluster the responses by their roll and give rare rolls a bonus. if 5 of the 6 rollouts say “4” and one says “2,” the “2” gets extra credit for being unusual.\n\n```\nfor answer in rollouts:\n    k = count(rollouts == answer)  # how many others gave the same answer\n    bonus = N / ((N - 1) * k)     # rare answers (small k) get a bigger bonus\n    reward[answer] += bonus\n```\n\nwhen all N rollouts agree (k = N), the bonus is `1/(N-1)`\n\n(very small). when an answer appears only once (k = 1), the bonus is `N/(N-1)`\n\ni.e. nearly N times larger. the rarer the answer, the more it’s rewarded.\n\nconcretely, with N=6: if k=1 (only one rollout gave that answer), bonus = 6/(5×1) = **1.2**. if k=6 (every rollout agreed), bonus = 6/(5×6) = **0.2**. being the lone outlier is worth 6× more than going with the crowd.\n\nwith this, we see more diversity.\n\nbut there’s a problem: the bonus is additive, and it applies to any rare roll - including wrong ones. when the env only rewards 2/4/6, a rare “3” can still net positive reward from the diversity bonus alone. so the model starts rolling some 1s, 3s, and 5s.\n\n### finally: divide the reward by cluster size\n\nsmall change, big effect: instead of *adding* a bonus for rare answers, *divide* the reward by how many responses fell into the same cluster.\n\nthis flips the asymmetry. a wrong roll gets reward 0, and 0 divided by anything is still 0 — no diversity credit for being wrong. a correct roll gets reward 1, which gets scaled down when lots of other rollouts picked the same number. so the model still has an incentive to spread across 2, 4, and 6, but zero incentive to explore into 1, 3, or 5.\n\n```\nfor answer in rollouts:\n    # how many rollouts gave the same answer\n    k = count(rollouts == answer)\n    # correct answers shared by fewer rollouts score higher\n    reward[answer] = base_reward[answer] / k\n```\n\nand it works. we get a clean, almost uniform spread across the valid rolls and little invalid rolls.\n\n### next\n\nwith the cluster divisor strategy, we have a nice tool to ensure better diversity/exploration during llm fine-tuning. but two questions remain.\n\n**1. how do you cluster?** for dice, it’s easy i.e. exact match on the number. for long-horizon agentic traces, it can be hard to distinguish between actually novel reasoning paths. using an llm for clustering is an option - but that can get real costly, real fast.\n\n**2. what if the correct answer is really far from base capabilities?** the divisor trick still leans on the base model stumbling into a valid answer by chance & then making it more salient. if the valid region is far far away from where the model starts, no reward shaping is going to get you there. you need something else.\n\nmore on both of these in future posts.\n\nthe dice problem is a microcosm of a harder version: getting a reasoning model to actually search its solution space instead of replaying the paths it already knows. the cluster divisor is simple and gives us a concrete handle on that problem — a way to make the training signal itself reward exploration rather than just hoping the model stumbles into it.", "url": "https://wpnews.pro/news/i-post-trained-a-model-to-reliably-roll-a-die", "canonical_source": "https://castform.com/blog/rl-diversity/", "published_at": "2026-06-17 18:30:12+00:00", "updated_at": "2026-06-17 18:53:38.931283+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-research"], "entities": ["Claude", "GPT", "Qwen", "DeepSeek", "GRPO"], "alternates": {"html": "https://wpnews.pro/news/i-post-trained-a-model-to-reliably-roll-a-die", "markdown": "https://wpnews.pro/news/i-post-trained-a-model-to-reliably-roll-a-die.md", "text": "https://wpnews.pro/news/i-post-trained-a-model-to-reliably-roll-a-die.txt", "jsonld": "https://wpnews.pro/news/i-post-trained-a-model-to-reliably-roll-a-die.jsonld"}}