12:55
2026-06-15
transformer-circuits.pub
large-language-models
Natural Language Autoencoders Produce Explanations of LLM Activations
Anthropic researchers introduced Natural Language Autoencoders (NLAs), an unsupervised method that generates natural language explanations of LLM activations by training two modules to reconstruct actโฆ