cd /news/large-language-models/feature-geometry-of-lora-adapters-a-… · home topics large-language-models article
[ARTICLE · art-17136] src=arxiv.org pub= topic=large-language-models verified=true sentiment=· neutral

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Researchers at an undisclosed institution analyzed LoRA fine-tuning in Gemma-2-9B using sparse autoencoders, finding that adapter-specific feature dictionaries show weak geometric alignment with pretrained features across ranks 4 to 32. The study demonstrates that LoRA updates occupy partially distinct representational structures in the residual stream, with adapter-specific autoencoders reconstructing delta activations more effectively than pretrained ones. These findings suggest that standard interpretability tools may fail to capture features introduced by fine-tuning, raising implications for mechanistic analysis and safety auditing of adapted language models.

read1 min publishedMay 29, 2026

arXiv:2605.28896v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine-tuning remain insufficiently understood. In this work, we investigate the geometry of LoRA-induced representations using Sparse Autoencoders (SAEs). We introduce a delta activation framework that isolates the adapter-specific contribution to the residual stream. Using Gemma-2-9B with LoRA ranks 4, 8, 16, and 32, we train adapter-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries. We evaluate representational alignment using cosine similarity between decoder directions, principal-angle analysis of feature subspaces, and Centered Kernel Alignment (CKA) between activation representations. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA-induced feature dictionaries and pretrained SAE features. Adapter-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks. These findings provide empirical evidence that LoRA fine-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine-tuned language models.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/feature-geometry-of-…] indexed:0 read:1min 2026-05-29 ·