19:16
2026-06-11
arxiv.org
machine-learning
Cheap Reward Hacking Detection
Researchers trained a small transformer encoder to detect reward hacking in reinforcement learning trajectories by mapping them onto a unit sphere where embedding distance approximates reward-metadataβ¦