{"slug": "teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values", "title": "Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values", "summary": "Researchers from the National University of Singapore, VinUniversity, and Nanyang Technological University found that weak-to-strong reward models trained on one preference dataset fail to generalize to others, a problem they attribute to representation drift. Their proposed fix, Representation Anchoring, preserves intermediate representations during training to improve transfer, with RAIL serving as a particularly challenging benchmark for harmlessness.", "body_md": "A recent paper from the National University of Singapore, VinUniversity, and Nanyang Technological University studies weak-to-strong reward models, and it uses our RAIL dataset as one of three benchmarks for harmlessness. I want to walk through what the paper found and why its use of RAIL is worth a short post. The result is useful on its own if you train or evaluate reward models, and the benchmark choice says something about what RAIL is for.\n\nThe paper is “When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift” (https://arxiv.org/abs/2605.25629).\n\nRAIL is a framework from Responsible AI Labs that scores the normative behavior of language models across eight measurable dimensions, instead of a single safe-or-unsafe label. In “RAIL in the Wild” (https://arxiv.org/abs/2505.00204) we applied it to Anthropic’s Values in the Wild dataset: more than 308,000 anonymized Claude conversations and over 3,000 annotated value expressions. We mapped those value expressions to the eight dimensions and computed scores, which turned a set of broad principles into numbers you can compute on real traffic.\n\nThat matters here because a preference dataset built this way carries a values-grounded signal, not just a stylistic one. It is a reasonable independent test of whether a model has learned “harmless” in a way that holds up, rather than whether it has learned the surface patterns of one particular dataset.\n\nWeak-to-strong generalization is a practical version of the scalable oversight problem. You train a small model on gold labels, use it to label data for a larger model, and hope the larger model’s pretrained knowledge lets it generalize past the smaller model’s mistakes. The larger model never sees the gold labels directly.\n\nA reward model is the object under study. It scores a prompt and response pair, and preferences come from the score differences. In RLHF pipelines these scores stand in for human judgment, so the reward model’s reliability shapes everything trained on top of it.\n\nThe paper’s question is simple to state. If you train a weak-to-strong reward model on one preference dataset, does its accuracy carry over to a different dataset with the same goal but different prompts, styles, and labeling conventions? The authors test this with a zero-shot protocol: train on one source dataset, then evaluate on the held-out split of that dataset and on every other dataset in the same category, with no target examples in training.\n\nThe headline result is that strong in-distribution gains do not reliably predict out-of-distribution transfer. A model can score well on the dataset it trained on and then add little or nothing on a different dataset with the same goal.\n\nThe authors argue the cause is representational. Fine-tuning on a single source dataset can pull the strong model’s internal features toward that dataset’s quirks, away from the broadly useful preference features it already held from pretraining. They call this representation drift, and they support it by showing that preserving intermediate representations improves transfer. If imperfect labels were the only problem, that intervention would not help as much as it does.\n\nTheir fix is Representation Anchoring. During training they keep a frozen copy of the pretrained strong model and add a term that penalizes the student for drifting too far from the reference in representation space, while the usual preference loss still teaches the task. The frozen copy is dropped at inference, so the deployed reward model is unchanged and costs nothing extra to serve.\n\nTo measure all this honestly, they use three metrics rather than raw accuracy:\n\nFor harmlessness, the three datasets are Anthropic Harmless from HH-RLHF, PKU-SafeRLHF, and RAIL. Each takes a turn as the training source and as a held-out target.\n\nRAIL turns out to be a genuinely hard target. In the setting where the Llama student trains on Anthropic Harmless and transfers to RAIL, the standard and confidence-based methods do not beat the weak teacher, and one rationale-based baseline drops below it. Only the anchored model transfers with a positive gain. Here are the harmlessness numbers for the Llama-3.1–8B student, comparing in-distribution gain against the gain transferred to RAIL:\n\n```\nLlama-3.1-8B, harmlessness (mean of 3 seeds)                       Anthropic Harmless     PKU-SafeRLHFMethod                 in-dist   to RAIL      in-dist   to RAILNaive weak-to-strong    +3.09     0.00         +4.91     +2.26Confidence-based        +3.26    -0.31         +5.08     +2.47SEAM                    -6.66    -0.51         +7.94     +6.27Anchor                  +3.39    +0.31         +5.49     +4.62Source: Le, Cao, et al. (2026), Table 1\n```\n\nTwo things stand out. The anchored model is the only one that stays positive on both axes in both settings: it holds its in-distribution gain and still transfers to RAIL. SEAM can transfer strongly from PKU-SafeRLHF, reaching the highest gain to RAIL in that column, but it collapses in-distribution when trained on Anthropic Harmless, with a raw gain of -6.66. That is exactly the trade the Net Transfer Score is built to catch.\n\nThe same pattern shows up with the Qwen family. Trained on Anthropic Harmless, the standard method posts the best in-distribution gain in its group and then transfers to RAIL with an absolute gain of only 0.52. It learned to score its own dataset without carrying the safety signal across.\n\nWhen an independent team picks a dataset as one of the few that define a safety category, it is treating that dataset as a distinct and credible measure of the goal, not a duplicate of the others. If RAIL were interchangeable with Anthropic Harmless and PKU-SafeRLHF, every method would transfer to it the same way it transfers to them. It does not. Methods transfer to RAIL differently, and a method built specifically to preserve general preference features is the one that handles it. That is evidence RAIL is testing something real.\n\nThere is a practical takeaway here for anyone training reward models, separate from RAIL. If you evaluate only on your training distribution, you are measuring memorization as much as alignment. A more honest check trains on your source data and reports transfer to at least one independent dataset in the same category, using a metric that penalizes source-domain regression. A values-grounded set is a useful third axis in that check, because it is less likely to share the surface patterns your other datasets do.\n\nIf you use RAIL as a benchmark in your own evaluations, I would be glad to hear what you find.\n\n[Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values](https://pub.towardsai.net/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values-3415281e9e57) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values", "canonical_source": "https://pub.towardsai.net/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values-3415281e9e57?source=rss----98111c9905da---4", "published_at": "2026-06-22 03:58:32+00:00", "updated_at": "2026-06-22 04:14:05.021956+00:00", "lang": "en", "topics": ["ai-research", "ai-safety", "machine-learning", "large-language-models"], "entities": ["National University of Singapore", "VinUniversity", "Nanyang Technological University", "RAIL", "Responsible AI Labs", "Anthropic", "Llama"], "alternates": {"html": "https://wpnews.pro/news/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values", "markdown": "https://wpnews.pro/news/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values.md", "text": "https://wpnews.pro/news/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values.txt", "jsonld": "https://wpnews.pro/news/teaching-to-the-test-why-reward-models-learn-the-dataset-not-the-values.jsonld"}}