{"slug": "when-are-two-networks-the-same-tensor-similarity-for-mechanistic", "title": "When Are Two Networks the Same?", "summary": "Researchers have developed a tensor similarity method that can detect changes in neural network behavior, such as backdoor attacks, by comparing the weight-space structure of models rather than just their outputs. The technique, detailed in a new paper, reveals a 0.15 delta in matrix similarity and a more visible difference in tensor similarity between checkpoints before and after fine-tuning on poisoned data. This approach allows for analyzing out-of-distribution behavior directly from weights, offering a more principled way to identify when two networks are functionally identical.", "body_md": "[We've found a method](https://arxiv.org/abs/2605.15183) that tells you:\n\nThere's only one catch: you have to use a tensor network.\n\nWe've [already shown](https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant) that tensor-transformer variants are performant (this isn't a novel claim, see these papers for [MLPs](https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant#Appendix_A__Noam_Shazeer_s_2020_paper_) and [Attention](https://arxiv.org/abs/2410.18613)), so here we're focusing on the interpretability advances.\n\nA tensor network is just a specific decomposition of a tensor, d a tensor is just a generalization of a matrix. This means we can apply tools from linear algebra to our entire network in a principled way. In our paper, we focus on a generalization of cosine similarity we call *tensor similarity.*\n\nThe most direct result is:\n\nThe expected inner product of the activations of two multilinear models under a Gaussian input (\n\n) is equal to their weight-space inner product (i.e. functional similarity on gaussian inputs)ie tensor similarity\n\nLet's look at our baselines:\n\nNow to our tasks.\n\nSo after training on [SVHN](http://ufldl.stanford.edu/housenumbers/) (harder MNIST), we finetune more while mixing in poisoned data (ie a black diamond on the top right is now labeled \"9\").\n\nWe can see during training that the model learns the backdoor while predicting at the same accuracy for non-poisoned data.\n\nOn the bottom, we have a checkpoint by checkpoint graph comparing the outputs of all the poisoned data. The diagonal means \"* how similar is each checkpoint's output with itself*\", so it's trivially 100% similar (dark blue). Notice the first checkpoint is only similar to itself; this is due to the network being randomly initialized. The next block is after ~learning the task. Then the top right block is after we inserted the backdoor.\n\n[* It's important to take the time to understand this similarity heatmap because 90% of our figures are in this format*]\n\nSo if we know what the poisoned data is, we can see the difference by looking at the outputs. But what if we don't know the trigger?\n\n(Top) How similar are each checkpoint's * matrices *(i.e. \"local sim\") to each other? We see a delta of 0.15 between the blocks of checkpoints before & after finetuning on poisoned data. (Bottom) How similar is each checkpoint's\n\n(Top) How similar is each checkpoint's * tensor* (ie \"global sim\") with each other? By accounting for symmetries (or in this case NOT accounting for anti-symmetric components that cancel out), we see a more visible difference between the poisoned checkpoints and original. (Bottom) Because we have a tensor, we can find the tensor-slice for class 9 and compare only those with each other. This would be cheating since we wouldn't know \"\n\nNow you have the context to see the full image:\n\nMatrix cosine * did *show a good delta, so it's not a bad baseline. However, in our later cases, it shows ~0 difference. Why is this the case?\n\nMatrix cosine sim is * local*, so it will be:\n\nWhen we instead contract the tensor network into its overall tensor, all of the permutations/rescalings/etc will cancel out. That said, the takeaway isn't just \"use tensor sim for backdoors\", but more broadly we claim:\n\nTensor networks are more analyzable; for instance, OOD behavior is computable from the weights alone. This experiment is evidence for that (and that we coded it up correctly, lol).\n\nLet's train on SVHN again, but initially only train on classes 0-4. Then mix in 5. Then 6 and so on until 9. Then we'll remove class 9 from training, getting catastrophic forgetting, and add it back in.\n\nDo note this is the same type of plot as before where x & y axes are \"checkpoint/training step\".\n\nWhat's interesting is that we clearly see the difference between adding 9 and removing it (** control **just means training more with the same data, you know, as a control). In fact, removing 9 is similar to \"add 8\" ie before you added 9 in the first place! Re-adding 9 also is similar to when you added 9 in the first place.\n\n(Top Left) is the tensor sim image already explained. (Top Middle) is output similarity which only shows a bit of sim. (Top Right) is the local matrix cosine sim which doesn't show the expected structure at all. (Bottom) we do tensor sim but get the slice for each class, comparing that class's tensor with itself across checkpoints. We can clearly identify 9 as the 'forgotten digit'.\n\nIn modular arithmetic [4], we classically go from \"memorization\" to \"generalization/grokking\". \"Frequency\" here means the frequencies used by the model (which we can compute solely from the weights), specifically frequencies 0-60 with 0 on bottom row and 60 on top.\n\nWe do have an older image that gives a different angle:\n\nThe bottom is the cosine similarity of the * frequencies being used*. It's self similar in the first half because there are no frequencies (just memorization). What's interesting is that the tensor sim tracks the continued frequency change throughout training.\n\nHere the model is learning to predict n-grams better; however, all other methods don't show changes. Tensor similarity does show many changes, but we didn't explore * what *those differences correspond to. The important takeaway though is:\n\nTensor sim can tell us * where* the differences are (and we can even localize with attribution), but we still don't know\n\nOverall, Tensor Networks are a solid foundation for rigorous, formal analysis. We can actually use principled techniques like cosine similarity! I highly recommend those working on finding the [True Names](https://www.lesswrong.com/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation) of concepts to use tensor networks; 1-4 layer bilinear layers aren't that difficult to work with either and are performant tensor networks.\n\nFor example, this method seems perfect for [Natural Abstractions](https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1)/[Condensation](https://www.lesswrong.com/posts/BstHXPgQyfeNnLjjp/condensation): we can directly compute if two tensors are functionally equivalent across all inputs.\n\nIf you want to learn more about this project, do read [our paper](https://arxiv.org/abs/2605.15183), and for tensor networks: this [LW post](https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant). We hope to release more educational material Soon.\n\nThis is a very tight approximation! Read our [theory section](https://arxiv.org/pdf/2605.15183#page=3) for details (the relevant equation is [eq 5](https://arxiv.org/pdf/2605.15183#page=5))\n\nThere is a stripe though. This corresponds to the drop in clean-accuracy on the first checkpoint of finetuning on poisoned data.\n\nClass 4 was also affected, I assume because it's similar to 9.\n\nThe above image may* *look asymetric to the untrained eye, but it's just the x & y axes being on different scales.", "url": "https://wpnews.pro/news/when-are-two-networks-the-same-tensor-similarity-for-mechanistic", "canonical_source": "https://www.lesswrong.com/posts/Yzw6KDQc336CpHmGi/when-are-two-networks-the-same-tensor-similarity-for", "published_at": "2026-05-29 15:53:41+00:00", "updated_at": "2026-05-29 16:23:33.721192+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-research", "computer-vision", "ai-safety"], "entities": ["SVHN", "MNIST", "Noam Shazeer"], "alternates": {"html": "https://wpnews.pro/news/when-are-two-networks-the-same-tensor-similarity-for-mechanistic", "markdown": "https://wpnews.pro/news/when-are-two-networks-the-same-tensor-similarity-for-mechanistic.md", "text": "https://wpnews.pro/news/when-are-two-networks-the-same-tensor-similarity-for-mechanistic.txt", "jsonld": "https://wpnews.pro/news/when-are-two-networks-the-same-tensor-similarity-for-mechanistic.jsonld"}}