Igor PereverzevDev

mentions 1 type Person feed RSS

// recent coverage 1 mentions

05:22

2026-07-03

lesswrong.com

ai-safety

One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction

A researcher solved the first Technical AI Safety puzzle from BlueDot by discovering that a small text classifier encoded two independent features onto one direction in activation space, where a linea…

// co-occurs with top 2 entities

BlueDot 1 all-MiniLM-L6-v2 1