VFUSE: Virulent Feature Understanding With Sparse AutoEncoders

Researchers introduced VFUSE, a mechanistic interpretability approach using sparse autoencoders to audit protein models for hazardous features. Applied to RoseTTAFold3 and RFDiffusion3, linear probes in SAE latent space detected hazardous designs with up to 0.848 AUROC, identifying monosemantic features for virulence. This is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model.

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE Virulent Feature Understanding with Sparse autoEncoders , a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84. To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design. There has been a ton of mechanistic interpretability research done on LLMs, from SAELens https://pypi.org/project/sae-lens/ to Neuronpedia https://www.neuronpedia.org/ to Golden Gate Claude https://www.anthropic.com/news/golden-gate-claude . Even CNNs https://research.google/blog/deepdream-a-code-example-for-visualizing-neural-networks/ and ViTs https://github.com/Prisma-Multimodal/ViT-Prisma seem to have a bunch of interesting work. An area that seems relatively underinvestigated is protein model interpretability. There are some early papers here such as InterProt https://interprot.com/ and FoldSAE https://arxiv.org/pdf/2511.22519 , but so many unanswered questions and possibilities. We wanted to answer the question: Can SAEs Sparse Autoencoders trained on RFDiffusion3 and RoseTTAFold3 be used to to classify hazardous vs non-hazardous proteins in an interpretable way? We trained Matryoshka Batch TopK Sparse Autoencoders SAEs on diffusion transformer activations in RFDiffusion3 RFD3, a generative protein model and RoseTTAFold3 RF3, a protein structure prediction model like AlphaFold , sampling 1475 length-matched benign/hazardous pairs from UniProt/SafeProtein + ToxinPred3. To simulate generation around hazardous motifs with RFDiffusion3, we noise the original coordinates by 5 Angstroms with partial diffusion, and rediffuse to the original protein. Here is a beautiful viper ammodytoxin A After fitting probes logistic regression classifiers on both raw and SAE activations, we found SAE probes outperformed raw activation probes for certain layers, peaking at 0.848 AUROC on layer 12 of RF3 on ToxinPred3. We cluster based on homology to avoid fold family memorization, using MMseqs2. Even cooler, we were able to find individual SAE features correlated with hazardous/benign proteins that light up on individual amino acids The discriminative power AUROC of features increases as we go deeper into the model especially RFD3 , suggesting the model has learned more complex structural concepts at deeper layers. Overall this is just scratching the surface of what's possible with Interp x Protein Models. What other structural features have protein design/folding models learned? Specificity, binding strength, thermostability, immunogenicity, etc? Can they classify real vs AI generated proteins? Can we use them to steer protein generation in addition to conditioning signals baked in during training? I'm super excited to see what people do next Thanks to Matt Olson for coauthoring, and the Institute for Protein Design for their great work on the RFDiffusion and RoseTTAFold models Paper https://arxiv.org/abs/2606.10080