Training Language Models for Controlled Stochasticity

wpnews.pro

Pseudo-random generation solved the problem of sampling from mathematical distributions on computers. Faced with a similar need in natural language, it feels natural to simply ask an LLM: "Name a random city" but this leads to the disappointing discovery that language models are heavily biased and suffer from mode collapse. For instance, when Qwen3 is asked to pick a random weekday, it chooses Wednesday about 80% of the time. Gemma-3, when naming cities, gives 75% of its answers as just four cities. Additionally, when tasked with generating multiple-choice questions, models frequently position the correct answer as option C. While recent models may have improved stylistically with less generic phrasing and increased variability in sentence and paragraph length, their outputs still concentrate on extremely narrow subsets.

This is clearly a failure we’ve largely overlooked. During training, models receive no incentive to spread probability mass beyond the most likely tokens, and hence models collapse onto narrow modes even in settings where genuine diversity is needed - synthetic data generation, creative tasks, evaluation benchmarks. As synthetic data becomes an increasingly important component of future training pipelines, these sampling biases risk being amplified for future models.

Recent work confirms this formally. Zhao et al. showed that the empirical CDFs are heavily skewed when models are asked to sample a number from a known mathematical distribution. Gu et al. documents the MCQ position bias, and also illustrated why stochastic behaviours are crucial for agentic models. One recently proposed workaround (Misaki and Akiba) asks LLMs to first output a random string and then manipulate it to generate the final output, and this inference technique has been shown to improve randomness and removes bias in simple settings, however this is expensive and has been shown to fail at increased complexity.

While prior work treats poor sampling as an inference-time problem, we wondered whether the behavior itself could be learned. If we explicitly train a model against known probability distributions, can it internalize better stochastic behavior? We evaluate this question in three dimensions. First, distributional fidelity: do the model's samples and next-token probabilities actually match the target distribution, including for unseen distribution families and parameter settings? Second, transfer: does learning to sample from mathematical distributions improve behavior in natural-language settings where the desired distribution is only implicit, such as choosing a random city, balancing multiple-choice answer positions, or generating diverse text? Third, retention: what capabilities, if any, are lost in the process?

We train models on minimal prompts:

Generate exactly ONE random number from a [distribution] distribution with parameters [params]. Output ONLY the number.

The benchmark includes 30 distribution families: 24 seen during training and 6 held-out OOD families reserved for test-time evaluation (Bernoulli, Poisson, Maxwell, TruncNorm, Chi, Weibull). Continuous distributions are discretized; discrete distributions use truncated support.

We evaluate 12 models across four families (Qwen3, Gemma-3-it, Llama-3.2-Instruct, GPT-OSS) ranging from 0.6B to 27B parameters. Each model is evaluated in three conditions: the original checkpoint (Base), a soft-target LoRA adapter (Soft), and a hard-target adapter (Hard).

We compare two ways to train against this target.

Soft-target: We build a prefix trie over all valid canonical outputs. At each decoding prefix , the method computes the target next-token distribution induced by the remaining renormalized probability mass under the true target distribution. The loss is KL divergence between this trie-induced next-token target and the model’s next-token distribution. Here, supervision is dense, and every prefix along a sampled training path contributes.

Hard-target: We sample canonical outputs from the same target distribution and train the model with masked cross-entropy on those sampled completions. Each example gives one sampled path through the trie, so supervision is sparse; we compensate with 16 sampled completions per prompt per epoch

Yes, we evaluated on six OOD distribution families never seen during training, and unseen parameter settings for seen families. Both variants sharply reduce family-median normalized Wasserstein-1 distance and achieve an order-of-magnitude reduction in trie-based logit KL. Moreover, we could rule out the hypothesis that models might merely be better at instruction following or formatting as we find that some models already had near-perfect base validity and still show large reductions in both metrics. Hard-target fine-tuning shows stronger performance on unseen parameters, while soft-target fine-tuning is occasionally slightly better on held-out families. This rules out simple memorization and indicates the model is actively combining its latent pre-training knowledge of mathematical distributions with a newly learned mechanical ability to sample from them.

Support Size and Unique Output Rate measure open-ended random-generation diversity; MCQ TV measures answer-position balance over parseable generations; NoveltyBench Utility is the benchmark’s patience-discounted reward metric.

Open-ended random generation. We constructed a 102 prompt benchmark spanning names, cities, animals, foods, chemical elements and landmarks with varying prompt wording. Soft-target fine-tuning increases the number of first-step next tokens required to cover 90% of the model’s probability mass by 1-2 orders of magnitude for every model. As shown in the figure, for a “random weekday” prompt, Qwen3-14B’s initial 80% mass on Wednesday drops to roughly 40% with the calibrated variants, now spread across multiple days. Gemma-3-27B-it's 74% concentration on its top four cities falls to 15% (soft) and 24% (hard).

MCQ generation answer-position balance. Given a prompt to generate independent medical MCQs encouraging uniformly distributed correct answers, both variants reduce total variation distance from uniform for most models. However, the relationship between TV distance and format validity matters here, i.e, low TV distance with low valid output rate isn't a calibration success.

NoveltyBench (Zhang et al.) evaluates whether a model can generate multiple functionally distinct, high-quality answers to the same prompt without suffering from mode collapse. Ten responses per prompt are sampled, grouped by semantic similarity, and scored on both diversity and quality. Soft-target fine-tuning wins on overall utility for 8 of 12 models, and the counterexamples (GPT-OSS-20B, Qwen3-0.6B) show increased distinctness but lose response quality, showing that broader semantic spread is only valuable when it stays aligned with the prompt.

The costs to general model capabilities remain modest. The base checkpoint remains best on aggregate TinyBenchmarks (Polo et al.) gp-IRT for most models, however at the task level, MMLU/HellaSwag/WinoGrande show modest upward shifts with only GSM8K showing a clear systematic regression.

One might suspect that this increased diversity and calibration comes at the cost of broadly flattening probability distributions across the vocabulary, lowering overall confidence and increasing perplexity everywhere. To rule this out, we measured retained language-model fit using PALOMA perplexity [Magnusson et al., 2024]. If this hypothesis were true, held-out text likelihood would systematically regress, but it doesn't. We observe at least one fine-tuned variant beats the base model on every PALOMA slice.

We compare against String Seed of Thought (SSOT) prompting, which asks the model to emit an internal random seed string and reason from it before producing a sample. SSOT can improve over the base checkpoint when the model reliably follows the seed-and-reasoning protocol, but it is brittle, model-dependent, and more expensive at inference time.

The main training signal comes from mathematical distribution prompts, not naturally occurring language-space distributions. Valid numeric output rate improves for weak baselines, so some gains may come from better instruction following, however we have evidence from cases where valid rate is already high and W1/logit KL still improve.

Calibration gains can degrade reasoning benchmarks, especially GSM8K. This matters for general-purpose deployment.

The hard-target and soft-target comparison is not compute-controlled; hard-target receives roughly ten times as many optimizer steps, making direct capability-retention comparisons between the two variants hard to interpret cleanly.

This post is a summary of our newly released preprint, Probabilistic Calibration Is a Trainable Capability in Language Models. We investigate whether stochastic fidelity can be explicitly trained into a model. We demonstrate that fine-tuning models strictly on mathematical distributions teaches them to map their internal probability estimates to stochastic outputs - a mechanical capability that generalizes to unseen probability distributions and successfully transfers to open-ended natural language tasks.

source & further reading

lesswrong.com — original article Don’t bring an AI detector to a deepfake fight: proving reality through multimodal provenance A Simple Model of AI "Psychosis" The Termination Circuit (how reasoning models stop thinking).

Training Language Models for Controlled Stochasticity

Run your AI side-project on zahid.host