# I post-trained a model to reliably roll a die

> Source: <https://castform.com/blog/rl-diversity/>
> Published: 2026-06-17 18:30:12+00:00

ask any llm to roll a dice, it’ll almost always output 4. gpt, claude, qwen, deepseek, doesn’t matter. 4. 4. 4. 4.

when i asked claude why it does that, it responded with the following:

```
LLMs don't actually generate random numbers—we predict likely tokens based
on training data. When humans are asked to pick a "random" number between
1-6, they disproportionately say 4 (it feels more random than 1, 6, or 3),
so that bias gets baked into our outputs.
```

this is a cute party trick. but it points at something important: the model is not rolling a dice, but picking the most likely token.

**this is a problem for reinforcement learning**, where the model learns from self-experience. it generates some responses, gets rewarded for the good ones, and the gradient pushes it toward whatever worked.

if the model can only output 4 during training, it never finds out that 2’s ok, or that 6 would’ve been fine, or that *any number 1-6 is valid*. the gradient has nothing to work with. you can’t reinforce a behavior the model never tried.

exploration buys you two things. **better performance:** interesting solutions live off the beaten path e.g. the non-obvious proof step, the clever tool call, the bug fix nobody’s written a Stack Overflow answer about. and **better generalization**, because a model that knows three ways to solve a problem has somewhere to go when the first two don’t work - one that’s memorized a single path falls off a cliff the moment you see something novel.

rolling a dice is a useful toy example. if exploration fails here, it’s not going to magically work on reasoning tasks where the right path is one of billions and most of them are wrong.

so: let’s see how bad it is, then see what we can do about it.

### the setup

**prompt:** roll a dice. output the roll in `<answer></answer>`

tags.

**reward:** 1 if the number in answer tags is between 1 & 6. 0 otherwise.

**algorithm:** grpo, the standard llm rl baseline. we set the group size to be 6 (the model generates 6 attempts per prompt).

to simulate real dice, the llm would have to roll each number ~1/6 time. let’s see if we can get there.

### attempt 1: just run grpo

the model starts with rolling 4 most of the time and over time it learns to just roll 4 **100%** of the time.

this is expected. every time the model rolls a 4, the model gets rewarded. so, from the model’s perspective “always roll a 4” is a perfect policy. there’s really no incentive to do anything different.

### attempt 2: let’s turn up the heat (temperature)

the obvious first fix: if the model is too confident in 4, make it less confident. temperature is the knob for this. at higher temperature, the model samples more randomly instead of always picking its top choice i.e. 4.

at temp=1.5, the same collapse happens.

at higher temperatures, the model starts acting a little too random 🙂

temperature only changes how we sample from the model. doesn’t change what the model has learned. the model still thinks 4 is the right answer; we’re just occasionally overriding it with noise. and once training kicks in, any accidental non-4 rolls get reward 1 just like the 4s did, so there’s still no reason for the model to actually shift its beliefs. it keeps betting on 4.

to actually fix this, we need to change what/how the model is learning, not just how we’re getting responses from it.

### attempt 3: entropy bonus

so let’s push on training directly. let’s add a loss term that penalizes the model when it’s overly confident in its answer.

the idea is simple: if the model puts 99% of its weight on “4”, punish it. if it spreads its probability more evenly across answers, reward it (or at least penalize it less). this is called an **entropy bonus** — entropy here just means “how spread out are the probabilities?”

```
# normal training: maximize reward
loss = -reward

# with entropy bonus: also punish overconfidence
probabilities = model.output_probs()   # e.g. [0.01, 0.01, 0.01, 0.96, 0.01]

entropy = -sum(p * log(p) for p in probabilities)
# low entropy  → model is very confident (bad, we penalize this)
# high entropy → model is spread out    (good, lower penalty)

loss = -reward - α * entropy   # α controls how much we care about diversity
```

in plain english: the model gets penalized not just for wrong answers, but for being *too sure* of itself. it’s forced to keep some probability on every option, which means it has to keep exploring.

with this penalty, we see the model doing better and not collapsing to 4. while not exactly uniform, it’s still better than before.

but, we also notice that the response lengths increase over time. when you read the outputs it’s obvious why: the model effectively “hacks” the entropy loss by basically padding with low-probability tokens. because we compute entropy across all the output tokens, this juices entropy.

one way to get around this is to make the reward stricter. 0 reward unless the model outputs exactly `<answer>{roll}</answer>`

. any other frivolous strings are penalized. with that, we got a result similar to the above but without response lengths increasing.

### raising the difficulty: only even rolls count

the entropy bonus worked — on the easy version where every number is valid. but real environments aren’t like that. the reward structure is usually hidden and the model has to figure out what works through exploration. to stress-test this, we now reward only 2, 4, and 6. the model needs to find its way from “roll 4” to “roll even numbers.”

now with this new setup, we still see the model somewhat collapsing. it starts with rolling 4 all the time, explores its way to 6, collapses to just rolling 6 and then starts rolling 4 a lot again, etc. basically ping-ponging between 4 and 6 depending on the model checkpoint. but never rolls a two.

we also generally notice a good amount of instability across steps in the training runs (i.e. swings across preferred numbers).

when we increase the entropy loss coefficient to try to push the model towards more diversity, it just devolves and starts producing full gibberish.

so, while the entropy loss option is promising → its instability + inability to really get close to a good uniform distribution across all outcomes leaves us wanting more.

### attempt 4: other rl algorithms

maybe we need something more than vanilla grpo. we tried a bunch: dapo, cispo, and a few others. none did better than the entropy-loss baseline. the issue isn’t the algorithm — it’s that the sampling distribution was already peaked on 4 before training started. without any feedback telling the model that diversity is valuable, it has no incentive to explore.

### attempt 5: diversity bonus in the reward

instead of modifying the loss function directly, we can try tweaking the reward directly.

within each group of 6 rollouts, cluster the responses by their roll and give rare rolls a bonus. if 5 of the 6 rollouts say “4” and one says “2,” the “2” gets extra credit for being unusual.

```
for answer in rollouts:
    k = count(rollouts == answer)  # how many others gave the same answer
    bonus = N / ((N - 1) * k)     # rare answers (small k) get a bigger bonus
    reward[answer] += bonus
```

when all N rollouts agree (k = N), the bonus is `1/(N-1)`

(very small). when an answer appears only once (k = 1), the bonus is `N/(N-1)`

i.e. nearly N times larger. the rarer the answer, the more it’s rewarded.

concretely, with N=6: if k=1 (only one rollout gave that answer), bonus = 6/(5×1) = **1.2**. if k=6 (every rollout agreed), bonus = 6/(5×6) = **0.2**. being the lone outlier is worth 6× more than going with the crowd.

with this, we see more diversity.

but there’s a problem: the bonus is additive, and it applies to any rare roll - including wrong ones. when the env only rewards 2/4/6, a rare “3” can still net positive reward from the diversity bonus alone. so the model starts rolling some 1s, 3s, and 5s.

### finally: divide the reward by cluster size

small change, big effect: instead of *adding* a bonus for rare answers, *divide* the reward by how many responses fell into the same cluster.

this flips the asymmetry. a wrong roll gets reward 0, and 0 divided by anything is still 0 — no diversity credit for being wrong. a correct roll gets reward 1, which gets scaled down when lots of other rollouts picked the same number. so the model still has an incentive to spread across 2, 4, and 6, but zero incentive to explore into 1, 3, or 5.

```
for answer in rollouts:
    # how many rollouts gave the same answer
    k = count(rollouts == answer)
    # correct answers shared by fewer rollouts score higher
    reward[answer] = base_reward[answer] / k
```

and it works. we get a clean, almost uniform spread across the valid rolls and little invalid rolls.

### next

with the cluster divisor strategy, we have a nice tool to ensure better diversity/exploration during llm fine-tuning. but two questions remain.

**1. how do you cluster?** for dice, it’s easy i.e. exact match on the number. for long-horizon agentic traces, it can be hard to distinguish between actually novel reasoning paths. using an llm for clustering is an option - but that can get real costly, real fast.

**2. what if the correct answer is really far from base capabilities?** the divisor trick still leans on the base model stumbling into a valid answer by chance & then making it more salient. if the valid region is far far away from where the model starts, no reward shaping is going to get you there. you need something else.

more on both of these in future posts.

the dice problem is a microcosm of a harder version: getting a reasoning model to actually search its solution space instead of replaying the paths it already knows. the cluster divisor is simple and gives us a concrete handle on that problem — a way to make the training signal itself reward exploration rather than just hoping the model stumbles into it.