05:22
2026-07-03
lesswrong.com
ai-safety
One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction
A researcher solved the first Technical AI Safety puzzle from BlueDot by discovering that a small text classifier encoded two independent features onto one direction in activation space, where a lineaβ¦