{"slug": "andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and", "title": "Andrej Karpathy's Neural Networks: Zero to Hero — 1) Intro to Neural Networks and Backpropagation", "summary": "Andrej Karpathy released a series of lecture videos and open-source code titled \"Neural Networks: Zero to Hero,\" beginning with a deep dive into backpropagation. In the first lecture, Karpathy built a small project called *micrograd* to demonstrate how neural networks function under the hood, arguing that backpropagation is the essential mechanism for training networks while everything else is primarily for efficiency. The project and accompanying code are available on GitHub.", "body_md": "Andrej Karpathy uploaded several lecture videos on YouTube and the accompanying code on GitHub. I think they are excellent lectures, even better than many paid online courses. Here's the link: [Neural Networks: Zero to Hero](https://github.com/karpathy/nn-zero-to-hero). So, I will summarize them and try to get meaningful insights from them. I'm going to cover all of them lecture by lecture (I hope...)\n\nThe first lecture is about backpropagation in neural networks.\n\nIn the video, Karpathy said that backpropagation is what you need to train neural networks, and everything else is mainly for efficiency. That is why he explained and demonstrated backpropagation in the very first video.\n\nI totally agree with him. Training neural networks and LLMs is essentially about reducing loss. And fundamentally, all methods for reducing loss are related to backpropagation, directly or indirectly.\n\nKarpathy built a small project called *micrograd*. You can see the code [here](https://github.com/karpathy/micrograd). This is made up of just a few simple lines of code, but it shows us how neural networks are built under the hood. In the video, he demonstrated how to build Micrograd and how it works step by step.\n\nWe all learned **differentiation and derivatives**, right? What do differentiation and derivatives actually mean?\n\nDifferentiation and derivatives tell us how much f(x)f(x)f(x) changes when xxx changes. That is, they show the effect of a variable and the slope or gradient of a function. Then, if the derivative is 0, that point may be a local maximum or minimum of the function—not always, but it can be.\n\nActually, Karpathy didn't explain differentiation and derivatives in detail in the video. However, I think this is one of the most important aspects for understanding neural networks. So I'm going to explain this in more detail.\n\nThis concept is fundamental to linear regression as well. What is the core idea of linear regression? The goal is to minimize the residual sum of squares (RSS).\n\nThis is what linear regression is all about: finding the optimal β\\betaβ values. Then, how to find them? This is where the derivative comes in. The RSS formula is a kind of quadratic function, so when it comes to quadratic functions, the minimum point is the point at which the derivative becomes 0. Therefore, if we differentiate the equation and find where the derivative is zero, we can find the best β1\\beta_1β1 and β0\\beta_0β0 .\n\nTo find the optimal β0\\beta_0β0 and β1\\beta_1β1 , we take the partial derivatives of RSS with respect to each parameter and set them equal to zero. I will demonstrate how to derive them.\n\nFirst, differentiate the RSS with respect to β0\\beta_0β0 :\n\nThen, differentiate the RSS with respect to β1\\beta_1β1 :\n\nAt the minimum point, both partial derivatives are zero:\n\nThis gives us the normal equations:\n\nSolving these equations gives the optimal values:\n\nHere, xˉ\\bar{x}xˉ is the mean of the input values, and yˉ\\bar{y}yˉ is the mean of the target values.\n\nNeural networks are also built from these kinds of linear expressions, usually combined with nonlinear activation functions. However, the way we calculate the parameters is totally different because neural networks are much more complicated and have so many parameters. So it is almost impossible to find the parameters in this way.\n\n[Finding your way in pitch darkness\n](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyxf4v0f9t5q9ov6o4a8.png)\n\nAssume that while hiking in the mountains, we get lost and trying to get down the mountain. But it is night, so we are in pitch darkness. We can only see a few inches around us. In this case, how can we get down the mountain? The answer is simple: by following the slope downward. At least, if we can see the slope around us, we can tell which way leads downward. This is how we find the optimal point when the equation is so complex that we are not able to solve for the optimum analytically.\n\n``` python\ndef f(x):\n  return 3*x**2 - 4*x + 5\n\nf(3.0)\n\nxs = np.arange(-5, 5, 0.25)\nys = f(xs)\nplt.plot(xs, ys)\n```\n\n[Visualization of the quadratic function\n](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvpebqpadyzrfsmex5o05.png)\n\nThis is an example from Karpathy's code. The function is f(x)=3x2−4x+5f(x) = 3x^2 - 4x + 5f(x)=3x2−4x+5 . Its derivative is df(x)dx=6x−4\\frac{d f(x)}{d x} = 6x - 4dxdf(x)=6x−4 .\n\nIf we solve the equation 6x−4=06x - 4 = 06x−4=0 , the derivative is 0 at x=2/3x = 2/3x=2/3 . Then, if xxx is at some other point, how can we move xxx to find the minimum point of f(x)f(x)f(x) ?\n\nThe answer is simple. If the derivative value at a certain point is greater than 0, we have to decrease xxx ; if it is less than 0, we have to increase xxx . Therefore, if we consistently subtract λ(6x−4)\\lambda (6x - 4)λ(6x−4) from xxx , with a proper learning rate λ\\lambdaλ , f(x)f(x)f(x) will converge to its minimum.\n\nThis finite-difference approach is useful for building intuition and for gradient checking.\n\n```\nh = 0.000001\nx = 2/3\n(f(x + h) - f(x))/h\n```\n\nThe output is 2.999378523327323e-06, which is almost zero. It is not perfectly exact because floating-point numbers have limited precision, and because this is a finite-difference approximation, but it is close enough for this simple demonstration.\n\nSimilarly, when you have an expression with several variables, you can get the slope with respect to a specific variable in this way.\n\n```\nh = 0.0001\n\n# inputs\na = 2.0\nb = -3.0\nc = 10.0\n\nd1 = a*b + c\nc += h\nd2 = a*b + c\n\nprint('d1', d1)\nprint('d2', d2)\nprint('slope', (d2 - d1)/h)\n```\n\nOutput:\n\n```\nd1 4.0\nd2 4.0001\nslope 0.9999999999976694\n```\n\nThis is also Karpathy's code. This example shows how\nddd\nchanges when\nccc\nchanges from 10.0. The gradient is 1.0, of course, since the derivative\nddc(ab+c)\\frac{d}{dc}(ab + c)dcd(ab+c)\nis 1.\n\nKarpathy demonstrates a hands-on example of how to calculate the gradients. Let's see one of his examples.\n\nWhen fff is the final output, we should calculate the gradients with respect to all intermediate values. Let's do this one by one.\n\nThe gradient of fff with respect to itself is 1 because ddff=1\\frac{d}{df}f = 1dfdf=1 . Easy, right?\n\nThe important thing is that addition passes the gradient through. For example, if x=a+bx = a + bx=a+b and ddxf(x)=k\\frac{d}{dx}f(x) = kdxdf(x)=k , the gradients of aaa and bbb are also kkk . For subtraction, the subtracted term receives the negative of the upstream gradient. And when it comes to multiplication, the upstream gradient is multiplied by the other variable. If x=a×bx = a \\times bx=a×b , the gradient of aaa is k×bk \\times bk×b because the local derivative with respect to aaa is bbb .\n\nTherefore, the gradient of eee is ddef\\frac{d}{de}fdedf . fff is d×ed \\times ed×e , so the gradient eee is d=−6.0d=-6.0d=−6.0 . On the other hand, the gradient of ddd is 1.0.\n\nFinally, the gradient of aaa is the contribution from eee plus the contribution from ddd . So it is −6+3=−3-6 + 3 = -3−6+3=−3 . The gradient of bbb is also the contribution from eee plus the contribution from ddd , and that is −6−2=−8-6 - 2 = -8−6−2=−8 .\n\nThe computation graph for the final output looks like this:\n\nWhat if we want to minimize fff by tuning the value of aaa ? By subtracting the gradient, −3λ-3\\lambda−3λ , fff will get smaller since the function with respect to aaa is an upward-opening quadratic function. If λ\\lambdaλ is 0.1, we subtract -0.3 from aaa . Then aaa becomes -1.7. As a result, fff becomes -6.63, which is smaller.\n\nNow, let's apply this algorithm to a neural network.\n\nI organized another example from Karpathy in the image above. This is a very simple neural network architecture that he made. Actually, this is not the whole story yet. What we want to minimize is the loss function. So, if ooo is y^\\hat{y}y^ , gradient descent should minimize ∑i=1n(yi−oi)2\\sum_{i=1}^{n} \\left( y_i - o_i \\right)^2∑i=1n(yi−oi)2\n\nIn this way, Karpathy shows hands-on code that runs gradient descent on a simple MLP.\n\nHere's the MLP training loop Karpathy builds in the Micrograd lecture:\n\n``` python\nclass Neuron:\n\n  def __init__(self, nin):\n    self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]\n    self.b = Value(random.uniform(-1,1))\n\n  def __call__(self, x):\n    # w * x + b\n    act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)\n    out = act.tanh()\n    return out\n\n  def parameters(self):\n    return self.w + [self.b]\n\nclass Layer:\n\n  def __init__(self, nin, nout):\n    self.neurons = [Neuron(nin) for _ in range(nout)]\n\n  def __call__(self, x):\n    outs = [n(x) for n in self.neurons]\n    return outs[0] if len(outs) == 1 else outs\n\n  def parameters(self):\n    return [p for neuron in self.neurons for p in neuron.parameters()]\n\nclass MLP:\n\n  def __init__(self, nin, nouts):\n    sz = [nin] + nouts\n    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]\n\n  def __call__(self, x):\n    for layer in self.layers:\n      x = layer(x)\n    return x\n\n  def parameters(self):\n    return [p for layer in self.layers for p in layer.parameters()]\n\nx = [2.0, 3.0, -1.0]\nn = MLP(3, [4, 4, 1])\nn(x)\n\nxs = [\n  [2.0, 3.0, -1.0],\n  [3.0, -1.0, 0.5],\n  [0.5, 1.0, 1.0],\n  [1.0, 1.0, -1.0],\n]\nys = [1.0, -1.0, -1.0, 1.0] # desired targets\n\nfor k in range(20):\n\n  # forward pass\n  ypred = [n(x) for x in xs]\n  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))\n\n  # backward pass\n  for p in n.parameters():\n    p.grad = 0.0\n  loss.backward()\n\n  # update\n  for p in n.parameters():\n    p.data += -0.1 * p.grad\n\n  print(k, loss.data)\n```\n\nWith his Micrograd code, you can see how the gradient descent algorithm works step by step. You can also calculate the gradient of each variable on your own. I strongly recommend doing these hands-on examples. After watching the video, I was able to clearly understand how gradient descent works, why we should use zero out gradients, why ReLU function is the most efficient, and so on. This is definitely worth your time.\n\nI have shown Karpathy's demonstrations of gradient descent. As he said, this is the core concept for training neural networks. The rest is just for efficiency. Reducing loss using gradients: this is what makes neural network training possible and, ultimately, helped usher in the AI era.", "url": "https://wpnews.pro/news/andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and", "canonical_source": "https://dev.to/jun07/andrej-karpathys-neural-networks-zero-to-hero-1-intro-to-neural-networks-and-backpropagation-4j5f", "published_at": "2026-05-29 16:56:39+00:00", "updated_at": "2026-05-29 17:12:10.175668+00:00", "lang": "en", "topics": ["neural-networks", "machine-learning", "artificial-intelligence", "ai-research"], "entities": ["Andrej Karpathy", "micrograd", "Neural Networks: Zero to Hero", "GitHub", "YouTube"], "alternates": {"html": "https://wpnews.pro/news/andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and", "markdown": "https://wpnews.pro/news/andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and.md", "text": "https://wpnews.pro/news/andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and.txt", "jsonld": "https://wpnews.pro/news/andrej-karpathy-s-neural-networks-zero-to-hero-1-intro-to-neural-networks-and.jsonld"}}