Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Researchers at an undisclosed institution analyzed LoRA fine-tuning in Gemma-2-9B using sparse autoencoders, finding that adapter-specific feature dictionaries show weak geometric alignment with pretrained features across ranks 4 to 32. The study demonstrates that LoRA updates occupy partially distinct representational structures in the residual stream, with adapter-specific autoencoders reconstructing delta activations more effectively than pretrained ones. These findings suggest that standard interpretability tools may fail to capture features introduced by fine-tuning, raising implications for mechanistic analysis and safety auditing of adapted language models.

arXiv:2605.28896v1 Announce Type: new Abstract: Low-Rank Adaptation LoRA has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine-tuning remain insufficiently understood. In this work, we investigate the geometry of LoRA-induced representations using Sparse Autoencoders SAEs . We introduce a delta activation framework that isolates the adapter-specific contribution to the residual stream. Using Gemma-2-9B with LoRA ranks 4, 8, 16, and 32, we train adapter-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries. We evaluate representational alignment using cosine similarity between decoder directions, principal-angle analysis of feature subspaces, and Centered Kernel Alignment CKA between activation representations. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA-induced feature dictionaries and pretrained SAE features. Adapter-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks. These findings provide empirical evidence that LoRA fine-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine-tuned language models.