1 Layer Induction Heads and Some Research

Researchers challenge the established belief that induction heads require two layers in transformer architectures, arguing that the phenomenon may be attributable to two attention heads rather than two layers. The investigation questions the foundational premise of prior work from Anthropic and others, which claimed induction heads are impossible in a single layer. This finding could reshape understanding of in-context learning in AI models.

Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount of time reading papers, reproducing results, and testing ideas firsthand, a recurring pattern becomes difficult to ignore: there is often a substantial gap between what research claims promise and what the underlying evidence actually demonstrates. A common theme we have observed is the tendency for extraordinary claims to emerge from work that does not always withstand rigorous scrutiny. The title of this article appears to fall into a similar category. While we are confident in the reasoning and research process that led us to this conclusion, we are more than willing to provide additional context and welcome debate, criticism, or counterarguments. Ultimately, good research is not about how exciting a claim sounds but rather it is about how well that claim survives careful questioning. One Such questioning that lead to this article has been: "Why aren't induction heads possible in a single layer?" This article has been written under the assumption that a lot of people will be able to understand the contents of the research for two reasons. Anybody who read the title and clicked the article due to intrigue must have an inherent sense of induction heads and how they are not possible in single layers, and there might be another set of readers who have a basic idea behind transformers. Regardless, if you are someone who does not know the core components of the transformer architecture, we will be covering some basics that will allow you to understand the contents of this post and gain something out of this article. If you know most of the basics around the transformer architecture, feel free to skip this section else refer to the LessWrong posts below which give a detailed walkthrough and the necessary concepts. The Question Why aren't Induction Heads not possible in a single layer seems mundane, Feels like a question that has been answered with substantial evidence. There has been a string of papers from Anthropic https://www.anthropic.com/ and many others who have explored the phenomenon of Induction heads and clearly attributed the reason as well. Papers like A Mathematical Framework for Transformer Circuits https://transformer-circuits.pub/2021/framework/index.html , Toy models of Superposition https://transformer-circuits.pub/2022/toy model/index.html , In-context Learning and Induction Heads clearly state https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html : Now this was the most important premise of the above paper again, what's important is totally a subjective research question, but in this case, the paper's motivation is strongly emphasized by mentioning that whilst the effect of induction heads was shown in 2 layers in their previous work, the attribute of why wasn't specified, and the paper underpinned this as the main reason for explaining it in the above paper . Now, K-composition demonstrates the underlying mechanism of induction heads and solidifies the above statement in understanding why induction heads require a composition of attention heads in different layers. But I always wondered, why different layers? Is the effect of induction heads due to the composition of 2 layers or two attention heads themselves? The question seems very hard to answer due to the inherent parallelizable structure of the transformer architecture. Regardless, it seemed like a very important question to me because, inherently, this will reveal the attribution of induction heads to not necessarily 2 layers, but rather 2 heads themselves. Whilst the statement itself is not exactly self-implied, it does not seem important in understanding at first. While talking to fellow researchers in the field of Mechanistic Interpretability, a lot of people pointed out that whilst showing the attribution to just 2 heads would be an important finding, allowing us to correct the statement that induction heads require a composition of 2 attention heads and not necessarily 2 layers, it might not necessarily be an important question. But, “Mech interp is a fundamentally empirical science; getting your hands dirty gives key context for your learning.” Since I am someone who loves to play around with ideas that seem totally impossible, I always pursue them for the art of learning. And so did I learn, and here I want to share my learning. This is where I completely stop larping and get into research mode. Again, getting back to the title, which seems extremely not-so-possible, the motivation behind this entire research venture of mine has been to show that induction heads might be possible due to just the composition of 2 attention heads and not necessarily 2 layers, and to dig deeper into what is happening behind the scenes. Referencing and replicating Anthropic's famous 2-attention-head setup, we clearly understand the composition of 2 layers and how it leads to the formation of induction heads in general. Interestingly, we also stress the importance of the QK circuit and OV circuit themselves here. To narrow down my hypothesis about the composition of heads and not layers, I set up a simple attention head structure which is mathematically similair to 1 layer. Oh WOW — a recurrent structure. Now hold on If you do not agree with this setup as 1 layer, that is absolutely fine, but I have something interesting that is more important than the 1 layer itself, so stick with me. The mathematical representation for the above 2 setups, in comparison with a single-layer model, would be: Setup 1: Anthropic-style two-layer setup Setup 2: Regular one-layer setup Setup 3: One-layer recurrent setup So the one-layer recurrent model can be seen as a one-layer model with repeated internal attention. I replicate the induction head experiment using a synthetic dataset to keep the experimental setup computationally feasible. I also ensure that the synthetic dataset itself does not lead to the formation of skip trigrams, since 1-layer models can still perform well by relying on skip-trigram behavior. This allows the experiment to focus more directly on whether the recurrent attention-head structure can produce induction-like behavior, rather than allowing the model to succeed through simpler shortcut mechanisms. A B C D | A B C D | A B C D | A B C D ... Each training sequence contains one repeating block. For every sequence, this block is chosen randomly and can have a different length. At the beginning of the sequence, the tokens are random, so the next token is hard to predict. But once the block appears for the first time and then starts repeating, the next token becomes predictable. The model can only predict it by copying from the earlier occurrence of the same pattern. In other words, the model has to look back, find where the current token appeared before, and copy the token that came immediately after it. This is exactly the kind of behavior we expect from an induction head. I ran this experiment on the three setups discussed above: the 1-layer model, the 1-layer recurrent model, and the 2-layer model. The results were not unusual at first. To measure whether the model is showing induction-like behavior, I use two simple metrics: the induction attention score and the second-half loss. The induction attention score checks whether an attention head is looking at the right previous positions. More specifically, when the model sees a token in the repeated part of the sequence, we check whether it attends to the position that came right after the same token appeared earlier. If a head consistently attends to those positions, it suggests that the head is learning the copy-from-the-past behavior needed for induction. The second-half loss measures how well the model predicts tokens in the repeated part of the sequence. This part is important because the tokens are no longer random; they can be predicted by looking back and copying from the earlier block. So, a lower second-half loss means the model is better at using the repeated pattern to predict the next token. Note that the experiments on Figure 2 and Figure 3 was on multiple seeds showing similar structure throughout. Looking through the figures we notice the standard induction score massively going up and the sharp induction phase transition indicating the grokking effect around 10^1m training tokens on x axis. Looking at the figures, there are a few important patterns that stand out. The below questions are structured in the same way that occoured to me while analysing the figures, looking back at the logs and from the intuition of the other papers I have read I have provided explanations to answer these questions. The 1Layer parallel model never forms a real induction head, but it is also not doing nothing. Instead, it finds a simpler positional shortcut. From the attention maps, both heads spread their attention over positions that are roughly one block-period back. Since the block length is between 16 and 48 tokens, this broad attention band often lands near the correct earlier part of the sequence. This helps the model narrow down the next-token prediction. Instead of guessing from all 256 possible tokens, it reduces the guess to a much smaller set of possible in-context tokens. This is why the loss drops from around 5.7 to around 2.8. This positional signal is still useful because it tells the model roughly where the previous occurrence of the pattern ends. Since the attention is concentrated around positions one block-period back, the model can often locate the last token of a relevant earlier sequence, which is enough to improve greedy next-token prediction. However, this does not tell the model where the matching sequence begins. Without a mechanism for precise content-based matching, the model cannot recover the full span of the earlier occurrence and therefore cannot determine exactly which token should be copied next. However, it flatlines there because this strategy is not precise. The model is not matching the current token to its exact previous occurrence. It is only averaging over a broad positional region. Without precise content matching, it cannot reliably copy the exact next token. This is also visible in the induction score, which rises slightly but stays capped around 0.06. So the model has found a weak positional shortcut, not a true induction mechanism. The rise happens because the broad “one-period-back” attention band sometimes overlaps with the correct induction-source positions. In other words, some attention accidentally lands on the right places. But the attention is still broad and positional, not content-based. The heads are not finding the exact earlier occurrence of the current token. They are only attending to a rough region where the answer is often located. This is why the induction score rises above chance, but only slightly. It stops around 0.06 because the model cannot sharpen the attention further without a real composition mechanism. Both heads also behave almost the same way, which suggests that there is no real head specialisation happening. The 1Layer seq model forms induction slightly earlier because it has a shorter and simpler circuit. Fewer components need to line up, so the induction behavior appears earlier in training. However, the 2L model quickly catches up and then overtakes it. Around the transition point, the 1Layer seq model starts forming induction first, but the 2L model soon reaches a sharper and stronger induction pattern. The 2L model becomes sharper because its second-layer induction head receives a cleaner input. In the 2-layer setup, LayerNorm helps clean and stabilize the residual stream before the second attention head reads from it. In contrast, the 1L seq model has to work with a rawer residual stream. The token information and other signals are more mixed together, which makes it harder for the head to form a perfectly sharp induction pattern. This is why the 1L seq model can learn induction, but the 2L model eventually reaches a higher induction score and lower loss. Both models show an early spike when the induction circuit first forms. At this stage, the model relies heavily on very sharp attention to the correct previous positions. Later, the circuit becomes more refined. The model improves other parts of the copying mechanism, especially the output side of the circuit. As this happens, it no longer needs to place all its weight on extremely sharp attention. So the induction score drops slightly, but this does not mean the model is getting worse. The loss continues to improve during this period. This means the model is becoming better overall, even though the attention score becomes slightly less extreme. This looks like a second phase transition. Before this point, the 2L model already has a decent induction circuit. Its loss is low, and its induction score is stable. But around 530M tokens, the induction score briefly dips while the loss also starts dropping. Then, over the next stage of training, the induction score jumps sharply and the loss collapses close to zero. This suggests that the model temporarily disrupts its earlier circuit and reorganizes into a much cleaner and sharper induction solution. This is similar to a grokking-like cleanup. The model already had a working solution, but later discovers a much better one. The 1L seq model shows a small ripple around the same point, but it does not suddenly jump like the 2L model. The likely reason is that the 1L seq model has a lower ceiling. Since its head reads from a rawer residual stream, the best induction solution available to it is softer and less precise. It can keep improving slowly, but it does not have access to the same clean, near-perfect solution that the 2L model finds. So both models experience a similar training disturbance, but they respond differently. The 2L model snaps into a much sharper circuit, while the 1L seq model continues improving gradually. Transformer Lens https://github.com/TransformerLensOrg/TransformerLens is a popular framework that anyone who has worked with mechanistic interpretability might have come across. While using TransformerLens to play around with this setup, I noticed an interesting pattern. To understand the figures, it is useful to first understand what the axes mean. Each row represents the query token, and each column represents the source token, also called the key token. If two tokens have a high QK product, the model pays more attention between those positions, and the square becomes darker. In both the 1L sequential model and the 2L model, we see an important diagonal pattern appear in the second half of the sequence. This is exactly where the pattern starts repeating. This diagonal shows that the model is looking back to the earlier occurrence of the same token and using it to predict what comes next this is the visual signature of induction head formation . Whilst this part is expected, the interesting difference appears in Head 0. In the 1L sequential model, Head 0 looks much cleaner and more dominant. In the 2L model, Head 0 looks more distributed and less sharply focused. My intuition is that this difference comes from LayerNorm. In the 2L model, there is a LayerNorm between Head 0 and Head 1. This means Head 0 does not need to write an extremely clean or dominant previous-token signal by itself, because LayerNorm can help clean and stabilize the information before Head 1 reads it. In the 1L sequential model, there is no such LayerNorm between the two attention steps. Because of this, Head 0 is forced to write a much clearer and stronger previous-token attention pattern directly into the residual stream. Head 1 then has to use that signal without the same kind of cleanup that happens in the 2L model. This is interesting because it suggests that LayerNorm may be playing an important role in how information is passed from one attention head to another. In mechanistic interpretability, LayerNorm often gets less attention than attention heads themselves, but these results suggest that it may be very important for understanding layer-to-layer interactions. To study this effect further, the next interesting step was to ablate the key, query, and value inputs separately and observe how each one affects induction head formation. What surprised me was the strong dominance of K-composition over both the query input and the value input. The figure below shows this clearly: corrupting the key causes the largest increase in second-half loss, directly linking the key input to induction head formation. This suggests that, in this setup, the model depends most heavily on the key pathway to form the induction circuit. Before understanding the attribution of why this happens and further exploration another interesting finding was the study of QK circuits formation and OV circuit formation. Surprisingly, the 1L sequential model is able to form a decent OV circuit. The OV circuit is the “copying” part of the induction mechanism. It asks a simple question: if the model attends to a token, does it increase the probability of outputting that same token? This is why we see a clear diagonal in the OV plots. A strong diagonal means that when the model attends to token T, it tends to boost token T in the output. In simple terms, the head has learned how to copy. The eigenvalue plots below the OV heatmaps give another way to see this. For readers new to the topic, an eigenvalue can be thought of as showing whether a circuit strengthens or weakens certain directions in token space. If many eigenvalues have a positive real value, it means the circuit is preserving or amplifying useful token-copying directions. So when we see many points on the positive real side, it suggests that the OV circuit is copy-friendly. This explains why all three setups show some level of OV copying, including the 1L parallel control. Copying is mostly a one-head operation. A single attention head can learn value and output weights that say, “If I attend to this token, boost this same token.” It does not necessarily need another head to do that. But the QK circuit is different. The QK circuit is the “matching” part of induction. It asks : does the current token attend to a previous position where the same token appeared before? This is where the 1L parallel model fails. For induction to work, the key at a position needs to contain information about the previous token. That previous-token information usually has to be written by one head and then read by another head. In the 2L model, this is possible because the second-layer head can read what the first-layer head wrote. In the 1L sequential model, this is also possible because Head 1 can read the output of Head 0. But in the 1L parallel model, both heads run at the same time. Head 1 cannot read what Head 0 wrote, because Head 0’s output is not available yet. So even though the model has an OV circuit that can copy, it does not have a proper QK circuit that tells it where to copy from. This is why the QK plot for the 1L parallel model does not show the same clean diagonal structure. The model can copy in principle, but it cannot reliably point that copying mechanism to the correct previous position. In short, OV copying is a one-head trick, but QK matching is a composition trick. The 1L parallel model learns how to copy, but it does not learn where to copy from. That is exactly why it never forms a working induction head. This lead me to hypothesise if removing OV circuit in The Transformer allows it to function and still form Induction heads . Turns out it does By setting the attention formula as and concatenating the outputs of each attention heads instead of relaying them through WO matrix we get the below results. Parameters | 123.65M | 109.48M | No-OV is smaller | Tokens seen | 3.60B | 3.60B | matched | Training time | 11.2h | 11.0h | about same | Validation perplexity | 24.65 | 26.73 | Baseline better | Best validation perplexity | 24.65 | 26.61 | Baseline better | HellaSwag | 0.309 | 0.314 | No-OV slightly better | ARC-Easy | 0.386 | 0.404 | No-OV slightly better | ARC-Challenge | 0.227 | 0.217 | Baseline slightly better | LAMBADA accuracy | 0.213 | 0.165 | Baseline better | Induction bump | 2.816 | 2.715 | about same | Copy accuracy | 0.195 | 0.556 | No-OV much better | Lookup accuracy | 0.010 | 0.012 | about same | Value-content ablation | V: ×79.2 | K: ×18.4 | K becomes load-bearing | Empirically, these results look extremely interesting, and they probably deserve their own separate article. I plan on exploring this further, especially because the results seem to suggest that transformers might be able to run without explicit OV circuits if this pattern continues to hold at a larger scale. For now, I am including the table above to make a simpler point: the components inside a transformer may be able to compensate for each other when one component is missing. This is extremely interesting because induction heads have usually been thought of as a mechanism produced by the QK and OV circuits working together. The QK circuit tells the model where to look, and the OV circuit tells the model what to copy. But in the No-OV setup, we still see an induction bump. This suggests that removing the explicit OV circuit does not completely remove the model’s ability to perform copying-like behavior. Instead, the model may be finding another pathway to carry the same information. One possible explanation is that the key pathway starts taking on some of the role normally played by the value pathway. Since the model no longer has a separate value matrix and output projection, it is forced to reuse the key representation as the thing being written forward. In other words, the key is no longer only used for matching; it also starts carrying information that can be useful for prediction. This means that the model may still be able to form an induction-like mechanism, but through a different route. The QK circuit can still help the model find the right previous position, and the key representation itself may contain enough token information to support copying. So even without a normal OV circuit, the model can still produce an induction bump because the information needed for copying has been partially moved into the key pathway. This does not necessarily mean that OV circuits are unimportant. In fact, the baseline model still performs better on language modeling. But it does suggest that the transformer is more flexible than the standard story might imply. If one pathway is removed, another pathway may reorganize to carry part of the missing function. To me, this is the most interesting part of the result. It suggests that induction may not be tied to one fixed implementation. Instead, induction might be a more general behavior that can emerge through different internal circuits, depending on what the architecture allows.