{"slug": "time-when-more-layers-meant-worse-model-birth-of-residual", "title": "Time When More Layers Meant Worse Model ... Birth Of Residual", "summary": "A developer discovered that adding skip connections—the line `x = x + output`—solved the problem of deeper neural networks producing more errors, a challenge documented in a Microsoft Research paper. This technique, which preserves the original embedding vector by merging it back into the computation flow, prevents gradients from vanishing through successive layers and enabled the creation of the residual network architecture. The finding marked a turning point where adding more layers finally improved model performance rather than degrading it.", "body_md": "\n\n``` python\nclass TinyTransformer(nn.Module):\n    def __init__(self):\n        super().__init__()\n        # setting the constructor for the initial values that we are every gonna need for the training of the data\n        self.char_embedding = nn.Embedding(65, 64)\n        self.pos_embedding = nn.Embedding(64, 64)\n        self.query = nn.Linear(64, 64)\n        self.key = nn.Linear(64, 64)\n        self.value = nn.Linear(64, 64)\n        self.mask = torch.tril(torch.ones(64, 64))\n        # these are for changing the dimensions we are doing this to enlarge the matrix as to make it of higher resolution so as to make the \n        # data and weights more refined \n        self.ff1 = nn.Linear(64, 128)\n        # this is to join them back again \n        self.ff2 = nn.Linear(128, 64)\n        self.output_head = nn.Linear(64, 65)\n        self.norm1 = nn.LayerNorm(64)\n        self.norm2 = nn.LayerNorm(64)\n        self.out_proj = nn.Linear(64, 64)\n\n    def forward(self, x):\n        # feed forward function\n        x = self.char_embedding(x) + self.pos_embedding(torch.arange(64))\n        # this is the start of the attention stuff i am writing this as a way to separate the code in section inside a functions\n        #\n        Q = self.query(x)\n        Q = Q.view(32, 64, 2, 32)\n        Q = Q.transpose(1, 2)\n        K = self.key(x)\n        K = K.view(32, 64, 2, 32)\n        K = K.transpose(1, 2)\n        V = self.value(x)\n        V = V.view(32, 64, 2, 32)\n        V = V.transpose(1, 2)\n        A = (Q @ K.transpose(-2, -1)) / 32**0.5\n\n        A = A.masked_fill(self.mask == 0, float(\"-inf\"))\n        At = A.softmax(dim=-1)\n        # the -1 this is just to tell the\n        output = At @ V\n        output = output.transpose(1, 2).contiguous().view(32, 64, 64)\n        output = self.out_proj(output)\n        # this is where the attention ends and we start with the feed forward thing that will give us the predictions\n        # added another form of normalization below to improve accuracy the first time the loss function reached 1.8 max now after adding the\n        # below line it reached to like 1.5 something\n        x = x + output\n        x = self.norm1(x)\n\n        output = self.ff1(x)\n        output = torch.relu(output)\n        output = self.ff2(output)\n\n        x = x + output  # ← merge back into main flow\n        x = self.norm2(x)\n        x = self.output_head(x)\n        return x\n```\n\nthis code is basically boilerplate at this point for training a transformer to anyone in the ai space .\n\ni just want to understand one little line that is here and that has a history behind it that is really interesting .\n\n```\n x = x + output\n```\n\nwhy are we doing this - X=X+output ?\n\nthe neural networks learn through the process of back propagation which basically means that they are essentially looking for the change that moves us closer to the correct predictions by changing the filters that is it now there is a particular problem with this and that is that as we move from one layer to the other the gradient becomes smaller and smaller and this is huge cause the computation would also become harder and harder and more computationally expensive . this happens basically due to the chain rule of the the partial derivative .. but how does this thing solve that ?\n\nread this paper here --\n\nthis paper is done by the microsoft research team and this is basically about how they solved the problem that more is not always better . in the case of training a deep learning models before this paper the more depth the model had i.e the layers the more error it produced too and that is a huge problem and people didnt know how to solve it cause on one hand you had the depth of better understanding and on the other hand you were having this problem of getting more errors too .\n\nnow we might think that the solution is to just add the original embeddings vector ( for my case ) to the context matrix we got after all the computations and you would be right to think that but not for the reason that you might think here in the paper itself it says that its not the reason for this problem .\n\nWe argue that this optimization difficulty is unlikely to\n\nbe caused by vanishing gradients.\n\nwhy ? - the reason for removing our suspicion from the diminishing gradient is because there are stuff done to minimize and stop the diminishing gradient problem these are done with the help of stuff like batch normalization and in this case ReLu here are the ways we do it in out code -\n\n```\n    x = self.norm1(x) # the batch normalization equivalent in transformers \n\n    output = self.ff1(x)\n    output = torch.relu(output) # another way to solve the vanishing gradient problem \n    output = self.ff2(output)\n\n    x = x + output  # ← merge back into main flow\n    x = self.norm2(x)\n    x = self.output_head(x)\n```\n\nas you can see this that these does solve the problem of vanishing gradient and yet if we remove the x=x+output the result would be worse you know what lets try it alright --\n\nthis is when we do this normally and dont change anything now lets change one thing and that is we remove the line x=x+output that is it and see how it affects the loss function .\n\nso the loss function jumped from 1.70 to 2.47 by just this one line and it might not seem a lot but , remember that this is just a 1 layer model for simplicity and more layers we add the more we move up in the errors too . to solidify my point i want to show the gradient that live by making some of the small adjustments here -\n\n```\nstep 44500, loss: 2.5130\n  char_embedding.weight          grad_norm: 0.006248\n  pos_embedding.weight           grad_norm: 0.005838\n  ff1.weight                     grad_norm: 0.024721\n  ff2.weight                     grad_norm: 0.053932\n  output_head.weight             grad_norm: 0.163109\n  norm1.weight                   grad_norm: 0.007271\n  norm2.weight                   grad_norm: 0.024594\nstep 45000, loss: 2.4751\n  char_embedding.weight          grad_norm: 0.005574\n  pos_embedding.weight           grad_norm: 0.005913\n  ff1.weight                     grad_norm: 0.023506\n  ff2.weight                     grad_norm: 0.056331\n  output_head.weight             grad_norm: 0.161182\n  norm1.weight                   grad_norm: 0.007898\n  norm2.weight                   grad_norm: 0.020992\nstep 45500, loss: 2.4623\n  char_embedding.weight          grad_norm: 0.006224\n  pos_embedding.weight           grad_norm: 0.006075\n  ff1.weight                     grad_norm: 0.025461\n  ff2.weight                     grad_norm: 0.051210\n  output_head.weight             grad_norm: 0.145062\n  norm1.weight                   grad_norm: 0.008452\n  norm2.weight                   grad_norm: 0.018521\nstep 46000, loss: 2.4764\n  char_embedding.weight          grad_norm: 0.006709\n  pos_embedding.weight           grad_norm: 0.006148\n  ff1.weight                     grad_norm: 0.026940\n  ff2.weight                     grad_norm: 0.057071\n  output_head.weight             grad_norm: 0.163159\n  norm1.weight                   grad_norm: 0.008988\n  norm2.weight                   grad_norm: 0.025112\nstep 46500, loss: 2.4746\n  char_embedding.weight          grad_norm: 0.006127\n  pos_embedding.weight           grad_norm: 0.006181\n  ff1.weight                     grad_norm: 0.025931\n  ff2.weight                     grad_norm: 0.056799\n  output_head.weight             grad_norm: 0.158272\n  norm1.weight                   grad_norm: 0.008369\n  norm2.weight                   grad_norm: 0.025981\n```\n\nso here is the thing that i was saying even though it looks like it solves the diminishing gradient but in fact it doesnt at all .\n\nthe true thing that it does is something way more interesting .-\n\nevery layer and non linear does some changes and these change compound fast like really fast and for like 20 layers it might work cause even though its a large number of layers the complexity further shoots when we go from this to something like 50 layers and these \"small\" changes may change the values a lot even though these changes themselves are very very small and the values that is creates after maybe completely different from the original like way to different . here is an example -\n\n[5, 3, 8](https://dev.tooriginal) --->[5, 3, 8](https://dev.tooutput%20weights)--->[0.3, 0.01, 0.2](https://dev.tothese%20are%20almost%200) and notice something that these are not due to something like diminishing gradient at all these are due to the small changes that we do in between and so what would happen if we add the original in this ? -\n\n[5, 3, 8] + [0.3, 0.01, 0.2] = [5.3, 3.01, 8.2]\n\nso this resultant one is very close to the original right ? that is the main idea of the residual and there are many residual algorithms too but for simplicity we are gonna just stick with the good old addition and frankly its better this way .", "url": "https://wpnews.pro/news/time-when-more-layers-meant-worse-model-birth-of-residual", "canonical_source": "https://dev.to/avirals554/time-when-more-layers-meant-worse-model-birth-of-residual-26f6", "published_at": "2026-05-27 19:16:54+00:00", "updated_at": "2026-05-27 19:41:06.774443+00:00", "lang": "en", "topics": ["neural-networks", "machine-learning", "artificial-intelligence", "large-language-models", "ai-research"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/time-when-more-layers-meant-worse-model-birth-of-residual", "markdown": "https://wpnews.pro/news/time-when-more-layers-meant-worse-model-birth-of-residual.md", "text": "https://wpnews.pro/news/time-when-more-layers-meant-worse-model-birth-of-residual.txt", "jsonld": "https://wpnews.pro/news/time-when-more-layers-meant-worse-model-birth-of-residual.jsonld"}}