Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model A research project using k-sparse autoencoders found interpretable features in a small reasoning model, including features that appear to correspond to the model's reasoning process. The experiment analyzed hidden states from 32 inferences on GSM8K problems, identifying features with high activation frequencies across multiple inferences, particularly in layer 10. The findings suggest that sparse autoencoders can extract meaningful features from small reasoning models, though the results are preliminary and require further validation. Models such as sparse autoencoders SAEs and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models LLMs . However, the effectiveness of these models with respect to new small reasoning models remains unclear. While it may seem obvious that it’s possible to extract features from reasoning models using SAEs, it’s not fully determined whether they can effectively uncover features, especially in small reasoning models. In addition, an interesting question arises when thinking about the nature of reasoning models: Do reasoning models have specific features that relate to the thinking process that it performs? This project aims to investigate whether k-sparse autoencoders are capable of extracting features from a small reasoning model, with a focus on finding evidence for features that correspond to the reasoning process of the model. Providing evidence to answer this question could support the presence of interpretable features related to the model’s reasoning process, in addition to possibly enhancing our comprehension of small reasoning models. While the results are promising, I have to say that there is room for opportunities to further analyse the feature activations or refine the analysis methodology to enhance feature extraction accuracy and interpretation. The main experiment consisted of first generating 32 model inferences with problems from the GSM8K https://huggingface.co/datasets/openai/gsm8k dataset, using a temperature of 0.6 and a maximum of 400 new generated tokens. The hidden states of these inferences were encoded using a k-sparse autoencoder https://huggingface.co/EleutherAI/sae-DeepSeek-R1-Distill-Qwen-1.5B-65k k = 32 focusing on three layers: 5, 10 and 20, thus obtaining the top 32 latent activations and feature indices per inference and layer. The first hidden state related to the input prompt tokens was excluded because of what is seems to be a model bias towards the “ <|begin▁of▁sentence| ” and “ <|User| ” special tokens and to maintain a focus on the model generated tokens. Then, an analysis of the latent activations and feature indices was performed by obtaining the top feature activations, feature and activation frequencies per layer and model inference, in addition to the top activating tokens per feature, layer and inference. Several graphs were created to visualize and analyze this. The graphs below are examples of a plausible relationship between tokens the model commonly uses within the reasoning process. The feature 32456 was a top feature for 28 model inferences, feature 6252 for 14 inferences, and feature 31146 for 10 inferences. All of these features appear in layer 10, also indicating the importance of this layer for the reasoning process. This experiment also revealed the presence of features with a large number of varied top activating tokens, as well as features with a small number of specific activating tokens. The following graph shows an example of one feature feature 1038 in layer 10, was a top feature in 32 inferences with high activation values related to several tokens that don’t seem to have a direct and specific relationship. Easier to see here https://github.com/artmtt/sae-interpretability-small-reasoning-model/blob/main/plots doc/png/1 model out layer 10 feature 1038 token acts bar.png . The code that was developed for this project can be found in this GitHub repository https://github.com/artmtt/sae-interpretability-small-reasoning-model . This is an exploratory project with notable limitations , including pre-trained k-SAEs and no formal statistical validation. Findings should be treated as preliminary. These are the resources that were used to conduct the experiments in this project: The main reasons behind the reasoning model and dataset selection involve effectively running the experiments until completion within the coding environment I currently have access to, in addition to the time constraints this project needs to fulfill. All the inferences of the model were generated with a temperature of 0.6 and a maximum of 400 new generated tokens. The input prompts were formatted using a chat template to make the model behave like an assistant. Furthermore, the prompt format follows the recommendations given in the DeepSeek R1 model card with just a small change: Raw input reasoning prompt example: <|begin▁of▁sentence| <|User| 2 + 3 = x What is the value of x? Please reason step by step in a few words, and put your final answer within \boxed{}<|Assistant|