{"slug": "paper-dictionary-learning-identifiability-for-understanding-saes", "title": "[Paper] Dictionary Learning Identifiability for Understanding SAEs", "summary": "A new analysis of dictionary learning, which Sparse Autoencoders (SAEs) approximate, identifies necessary optimality conditions that explain why SAEs exhibit puzzling behaviors like feature-splitting and feature-absorption. The study reformulates the dictionary learning optimization problem and derives first-order optimality constraints that, if violated, prevent a solution from being a local optimum, such as prohibiting hierarchically related features. These theoretical tools aim to improve understanding of SAE failure modes and could guide the design of better concept extraction techniques.", "body_md": "Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like [feature-splitting](https://transformer-circuits.pub/2023/monosemantic-features), [feature-absorption](https://arxiv.org/abs/2409.14507), or [encoding dense features](https://arxiv.org/pdf/2506.15679). Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors.\n\nIn this work I analysed dictionary learning (which SAEs approximate) to examine when and why these effects occur (a similar motivation to [multiple](https://www.lesswrong.com/posts/3zBsxeZzd3cvuueMJ/paper-a-is-for-absorption-studying-feature-splitting-and) [previous](https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) [efforts](https://www.lesswrong.com/posts/kcg58WhRxFA9hv9vN/toy-models-of-feature-absorption-in-saes)). I present some general-purpose theoretical approaches that I found useful in understanding these phenomena. Further, having identified a failure mode of SAEs, these tools could be applied to other optimisation problems to see if they behave better, leading to better concept extraction techniques (though this paper doesn't pursue this).\n\nBriefly, the technical contribution is the following. I study the dictionary learning optimisation problem and, following [excellent earlier work](https://arxiv.org/abs/0812.1869), reformulate it in various ways, including showing the problem is convex in the wide-dictionary limit. One thing we definitely know about SAE representations is that they are local optima of the SAE optimisation problem. To be a local optima there must be no perturbations from the optima which decrease the loss to first order. [As in earlier work](https://arxiv.org/abs/0904.4774), I use this to derive first-order optimality conditions which place interpretable constraints on the ways features and residuals are allowed to relate to one another in an optimal solution. If you break the constraints, you cannot be a local optima. For example, this prohibits the existence of hierarchically related features. Finally, towards the end of the paper we consider the wide-dictionary limit and show it can explain some findings about, for example, dense features.\n\nThe paper can be found [here](https://arxiv.org/abs/2606.02385).\n\nI welcome any comments or criticism from this community, as I am not yet a fluent mechanistic interpretability speaker!\n\nTL;DR: I give an example of the kind of necessary optimality condition you can derive and how it 'explains' feature splitting and absorption.\n\nI take a representation, and perturb it slightly by adding a small change. I consider a particular class of perturbations (those within the span of the features) and derive a necessary feature-feature relationship. To express the simplest version of this condition let's consider just two features with unit-norm dictionary/decoder vectors and , and encodings of datapoint : and . and are the responses of SAE features 1 and 2 to datapoint .\n\nTo state the condition first we remove the mean:\n\nThen we divide by the minimum value:\n\nThen the condition is:\n\nAnd we get another condition by swapping the 1 and 2 on the right hand side.\n\nThis is telling us that the way the decoder vectors are arranged is constrained by the behaviour of one feature while the other is inactive. Below is a figure from the paper that gives three illustrations of this construct:\n\nThe scattered blue points are the modified feature responses () in three different datasets. The red regions are the relevant convex hulls: how one feature varies when the other is inactive. In order for a pair of features to be stable, must lie within both the illustrated convex hulls. As can be seen, data that vary significantly as in panels A and C pass this check, while data with missing chunks, as in panel B, fail, and cannot exist.\n\nThis can be used to explain why you can't get hierarchically related dictionary features. We operationalise a hierarchy as a set of low-level features that are active only when the higher-level are, for example, since all labradors are dogs, 'labrador' is a low-level feature that is active only with the higher-level feature 'dog'. This structural dependence manifests as an empty convex hull (see below) meaning this feature combination is unstable and can never be a minima, thus explaining why dictionary learning can never learn them (this is a more formal generalisation of [existing ideas](https://www.lesswrong.com/posts/3zBsxeZzd3cvuueMJ/paper-a-is-for-absorption-studying-feature-splitting-and)):\n\nIn the paper I explore these ideas further:\n\nI raise two big shortcomings of the work here:\n\nIn sum, I hope these theoretical tools can be useful in thinking about why these tools do what they do, letting us learn more from them, and suggest avenues to designing their successors in principled ways.", "url": "https://wpnews.pro/news/paper-dictionary-learning-identifiability-for-understanding-saes", "canonical_source": "https://www.lesswrong.com/posts/jYomRhKiffA6Jnrij/paper-dictionary-learning-identifiability-for-understanding", "published_at": "2026-06-05 00:39:30+00:00", "updated_at": "2026-06-05 00:52:21.715639+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-research", "ai-safety"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/paper-dictionary-learning-identifiability-for-understanding-saes", "markdown": "https://wpnews.pro/news/paper-dictionary-learning-identifiability-for-understanding-saes.md", "text": "https://wpnews.pro/news/paper-dictionary-learning-identifiability-for-understanding-saes.txt", "jsonld": "https://wpnews.pro/news/paper-dictionary-learning-identifiability-for-understanding-saes.jsonld"}}