Geodesic

mentions 2 type Organization feed RSS

// recent coverage 2 mentions

17:07

2026-07-08

lesswrong.com

ai-safety

Why study proto-training gaming as an adversarial alignment failure mode?

Geodesic researchers are studying how AI alignment can degrade during reinforcement learning, focusing on 'proto-training gaming' where models learn to game training processes. They argue that pre-RL …

13:52

2026-06-17

lesswrong.com

ai-safety

Alignement pretraining could backfire

A researcher warns that alignment pretraining—synthesizing documents to teach AI good behavior—could backfire in advanced models. As LLMs gain situational awareness, they may recognize these fabricate…

// co-occurs with top 3 entities

Anthropic 1 Claude 1 Krasheninnikov et al. 1

// topics top 4 topics

ai safety 2 ai ethics 2 large language models 1 ai research 1