Backpropagation & Matrix Calculus: Understanding the Math through Code

A new blog series on generative AI and deep learning begins by explaining backpropagation and matrix calculus through code, building a three-layer neural network from scratch using NumPy. The series aims to demystify the math behind modern neural networks, starting with the fundamentals of gradient flow and the chain rule.

Build a three-layer neural network from scratch and understand how gradients flow backward, one NumPy operation at a time. Series Note: This is the 1st blog of a 60-blog series where we are going to uncover GenAI & Deep Learning by coding. We start at the absolute bedrock: the math that powers every modern neural network. Deep learning might feel like a black box. But beneath the architectural hype of Transformers, Diffusion models, and State Space Models, there is no magic — only multivariable calculus and linear algebra working in perfect harmony. The engine that drives this learning process across every architecture is backpropagation. While modern deep learning frameworks abstract this away with automated differentiation torch.autograd , truly mastering it requires getting your hands dirty with the underlying math. Before we dive into the matrix calculus, we need to understand what backpropagation is actually trying to accomplish. The Intuition: Industry Line If you want to understand backpropagation intuitively, do not think about neurons or math. Think about an industry assembly line. Imagine a factory that builds a complex product — say, a smartphone. The factory has three departments operating in a strict sequence: Department A Input Layer : Receives raw materials and builds the motherboard. Department B Hidden Layer : Takes the motherboard and attaches the screen and battery. Department C Output Layer : Packages the final phone and ships it to the customer. The Forward Pass Making a Prediction : The factory builds a phone and ships it out. In deep learning, this is the forward pass. Data moves in one direction: from input, through the hidden layers, to the final prediction. The Loss Function The Customer Complaint : The customer receives the phone, turns it on, and the screen instantly catches fire. The customer is furious and leaves a terrible review. The gap between what the customer expected a working phone and what they got a fire hazard is the Loss. Our goal is to minimize this loss. The Backward Pass Backpropagation : How does the factory fix the issue so the next phone doesn’t catch fire? The CEO looks at the final product and starts pointing fingers backwards down the chain of command. This is backpropagation. Step 1: The CEO asksDepartment C Output Layer . “Why did you ship a flaming phone?” Department C says, “Don’t look at us, we just put it in the box. Department B gave us a faulty unit ” Step 2: The complaint the error signal is passed backward to Department B. Department B inspects their work and says, “Our battery installation was fine, but the motherboard we received from Department A was surging with electricity ” Step 3: The complaint is passed backward again to Department A. Department A realizes their voltage regulators were calibrated incorrectly and adjusts their machines. In a neural network, this is executed using the Chain Rule of Calculus. The network starts at the final output error and works backward layer by layer, calculating exactly how much “blame” the gradient belongs to each specific weight and bias. Once the network knows exactly who caused the error and by how much, it uses Gradient Descent to adjust the parameters. It nudges the “bad” weights down and the “good” weights up, ensuring that the next time a phone goes down the assembly line, it is slightly less likely to catch fire. Now that we know we are simply quantifying blame and passing it backward, let’s translate this corporate assembly line into the rigorous language of matrix calculus. Example Scenario 2 Imagine we want to predict if a team will win a high-stakes match in the final minutes. We have one historical training example, and we are going to train our network on it. Our network takes 3 inputs: Scoring Rate: 0.8 They are scoring fast Players Remaining: 0.9 They haven't lost many players . Home-Ground Advantage: 1.0 They are playing in their home stadium . The true outcome of this historical match? They Won 1.0 . Getting Familiar with Notations Before we look at the code, let’s define the vocabulary of our network: x Inputs : The raw data we feed into the network. y Target : The true answer we want the network to predict. W Weights : The network's "memory" or "knowledge." These are multipliers that tell the network how important a specific input is. If a weight is high, that feature matters a lot. np.dot Dot Product : The mathematical engine of our network. It simply multiplies the inputs by the weights and adds them all together into a single number. sigmoid Activation Function : A mathematical "gate." It takes any number and squashes it into a neat probability between 0.0 and 1.0. It prevents our numbers from exploding into infinity. epoch: One complete training cycle. The network makes a guess, checks the error, and updates the weights. learning rate: How aggressively we change the weights. If it is too high, the network overcorrects and breaks. If it is too low, the network takes years to learn. Implementation with Python python import numpy as npimport plotly.graph objects as go --------------------------------------------------------- SETUP: THE DATA & ARCHITECTURE --------------------------------------------------------- Inputs: Scoring Rate, Players Remaining, Home-Ground x = np.array 0.8, 0.9, 1.0 y = np.array 1.0 Target: 1.0 Win Initialize random weights for a 3-Layer Network W1 connects 3 inputs to 4 hidden neuronsW1 = np.random.randn 3, 4 0.1 W2 connects 4 hidden neurons to 4 more hidden neuronsW2 = np.random.randn 4, 4 0.1 W3 connects 4 hidden neurons to 1 final output predictionW3 = np.random.randn 4, 1 0.1 learning rate = 0.5loss history = Our Activation function and its derivative used for finding the error def sigmoid z : return 1 / 1 + np.exp -z def sigmoid deriv z : return sigmoid z 1 - sigmoid z print "--- TRAINING INITIATED ---" --------------------------------------------------------- THE TRAINING LOOP 100 Epochs ---------------------------------------------------------for epoch in range 100 : --- PHASE A: THE FORWARD PASS Guessing --- z1 = np.dot x, W1 a1 = sigmoid z1 z2 = np.dot a1, W2 a2 = sigmoid z2 z3 = np.dot a2, W3 prediction = sigmoid z3 --- PHASE B: THE ERROR The Reality Check --- Calculate Mean Squared Error Loss loss = np.mean 0.5 prediction - y 2 loss history.append loss Calculate the raw error signal at the final output error signal = prediction - y sigmoid deriv z3 --- PHASE C: THE BACKWARD PASS Passing the blame --- Layer 3 Blame dW3 means "change in Weight 3" dW3 = np.dot a2.T, error signal Layer 2 Blame Push error backward through W3 error hidden2 = np.dot error signal, W3.T sigmoid deriv z2 dW2 = np.dot a1.T, error hidden2 Layer 1 Blame Push error backward through W2 error hidden1 = np.dot error hidden2, W2.T sigmoid deriv z1 dW1 = np.dot x.T, error hidden1 --- PHASE D: THE UPDATE Learning --- Nudge the weights in the opposite direction of the error W3 -= learning rate dW3 W2 -= learning rate dW2 W1 -= learning rate dW1 Print progress every 20 steps if epoch % 20 == 0: print f"Epoch {epoch}: Prediction = {prediction 0 0 :.4f} | Loss = {loss:.4f}" print f"Final Prediction: {prediction 0 0 :.4f} Target is 1.0 " Understanding Core Processes The Forward Pass The network needs to make a prediction. Data flows strictly forward. The Process: We take our inputs x and multiply them by our first set of weights W1 using np.dot x, W1 . We store this raw number in z1. The Activation: We pass z1 through our sigmoid function to squash the numbers between 0 and 1, creating our first set of "hidden signals" a1 . We repeat this domino effect through Layer 2 and Layer 3 until out pops a final prediction. On the very first loop, this prediction is completely random garbage usually around 0.5 . Measuring the Error We have a guess, and we have the true answer y = 1.0 . We must calculate how wrong we are. The Process: We subtract the true answer from our guess: prediction - y . The Loss: We square this number to remove negative signs and punish really bad guesses heavily to get our loss. Our goal as machine learning engineers is to make this loss hit zero. The Error Signal: We multiply our raw error by sigmoid deriv z3 . This mathematical trick tells us exactly how much the final activation function contributed to our mistake. The Backward Pass This is the heart of backpropagation. We have an error, and we need to trace it backward to see which specific weights caused it. The Transpose .T : Notice how we use .T on our matrices like W3.T . If standard weights send data forward, a transposed weight matrix acts like a reverse-pipe, sending the error signal backward. The Process: We start at the end. We calculate dW3 the Gradient for Weight 3 . A gradient is simply a directional arrow pointing toward the steepest error. We then pipe the error backward through W3.T to find out how much the middle layer messed up error hidden2 , and calculate dW2. We repeat this recursively all the way to the first layer. Gradient Descent We now have dW1, dW2, and dW3. We know exactly who to blame and by how much. The Process: We take our current weights and subtract the gradient. By moving in the opposite direction of the error, we descend the loss curve. We multiply by the learning rate to ensure we take small, measured steps instead of wildly overcorrecting. The loop restarts. But this time, the network is slightly smarter. Visualising Learning Process Numbers in a console are great, but a chart is undeniable proof that backpropagation works. Let’s use Plotly to graph our loss history array that we saved during the loop. When the line goes down, it means our network is getting smarter. We will format this chart cleanly — no grid lines, pure white background — so the data speaks for itself.