{"slug": "show-hn-microcrad-micrograd-reimplemented-in-c", "title": "Show HN: Microcrad – Micrograd Reimplemented in C", "summary": "A developer released Microcrad, a reimplementation of Andrej Karpathy's Micrograd automatic differentiation engine in C, designed for educational purposes to demonstrate backpropagation on scalars. The library uses reference counting and builds a computation graph of scalar values, enabling gradient computation via reverse-mode differentiation.", "body_md": "microcrad is a tiny scalar-valued automatic differentiation engine for C, with\na small neural network implementation built on top of it. It is a re-implementation\nof Andrej Karpathy's [micrograd](https://github.com/karpathy/micrograd) in C,\nwritten for people who want to understand how backpropagation really works.\n\nLike the Python original, microcrad operates on **scalars**, not tensors. Every\nnumber that takes part in a computation is a node in a graph, every operation\nrecords how it was produced, and a single backward pass walks the graph in\nreverse to compute the derivative of the output with respect to every input.\nThere is no vectorization, no GPU, no clever tricks: just the chain rule applied\none scalar at a time. On top of this engine sits a multi-layer perceptron, so\nyou can build a network, run a forward pass, call backward, and do gradient\ndescent, all in C.\n\nThis repository is first and foremost an **educational implementation**. It is\nmeant to be read, experimented with, and tested. It is **not** a production\nautograd package, not a practical deep-learning framework, and not optimized\nfor large datasets or numerical robustness.\n\nThe whole thing is built around two ideas:\n\n- A\n`Value`\n\n, which is a single node in the computation graph. **Reference counting**, which is how microcrad knows when a`Value`\n\nis no longer part of any graph and can be freed.\n\nAlmost everything in the documentation below is a consequence of these two ideas, so it is worth keeping them in mind.\n\nThe fundamental type is `Value`\n\n. A `Value`\n\nwraps a single `double`\n\nand, when it\nis the result of an operation, remembers the operands it was computed from:\n\n```\ntypedef struct Value {\n    uint32_t ref_count;   /* How many references point at this Value. */\n    uint32_t n_prevs;     /* How many operands produced this Value. */\n    double data;          /* The scalar this node holds. */\n    double extra_data;    /* Operation parameter (e.g. the exponent in pow). */\n    struct Value **prev;  /* The operands (previous nodes in the graph). */\n    int32_t op_code;      /* Which operation produced this Value. */\n    uint32_t magic;       /* Debug canary for some invalid or stale pointers. */\n    double grad;          /* dLoss/dThisValue, filled in by backward. */\n} Value;\n```\n\nA leaf `Value`\n\n(an input, a weight, a constant) has `n_prevs == 0`\n\nand no\noperands. A `Value`\n\nproduced by an operation such as addition has `n_prevs > 0`\n\nand a `prev`\n\narray pointing at the operands it depends on. Because every\noperation links its result back to its operands, the set of all `Value`\n\ns\nreachable through `prev`\n\npointers forms a directed acyclic graph: the\n**computation graph**.\n\nThis is the simplest microcrad program that computes something and its gradient:\n\n```\nValue *a = value_create_leaf(2.0);\nValue *b = value_create_leaf(3.0);\nValue *c = value_mul(a, b);   /* c = a * b = 6 */\n\nvalue_backward(c);            /* returns 0 on success, and fills every grad field */\n\nprintf(\"c    = %f\\n\", c->data);   /* 6.000000 */\nprintf(\"dc/da= %f\\n\", a->grad);   /* 3.000000  (== b) */\nprintf(\"dc/db= %f\\n\", b->grad);   /* 2.000000  (== a) */\n\nvalue_release(c);             /* c freed; releases its hold on a and b */\nvalue_release(a);             /* a freed (its other reference was yours) */\nvalue_release(b);             /* b freed */\n```\n\nThis small program already shows the essentials:\n\n`Value`\n\ns are heap allocated with`value_create()`\n\n.- Operations such as\n`value_mul()`\n\nbuild new`Value`\n\ns wired into the graph. `value_backward()`\n\ncomputes the gradient of its argument with respect to every node it depends on.`Value`\n\ns are reference counted, and releasing the root of a graph releases the whole graph (more on this below).\n\n```\nValue *value_create(double data, int32_t n_prevs, Value **prev);\nValue *value_create_leaf(double data);\n```\n\n`value_create_leaf`\n\nis the convenience constructor for a leaf node, that is, an\ninput, a weight, a bias, or a constant:\n\n```\nValue *x = value_create_leaf(42.0);\n```\n\nThe `n_prevs`\n\n/`prev`\n\narguments exist because the operation functions (`value_add`\n\nand friends) use `value_create`\n\ninternally to build result nodes. Most user code\nshould call `value_create_leaf`\n\nand let the operations do the wiring.\n\nA freshly created `Value`\n\nstarts with a reference count of `1`\n\n: the pointer\nreturned to you *is* that one reference. It is your job to release it.\n\n```\nValue *value_add(Value *v1, Value *v2);   /* v1 + v2  */\nValue *value_mul(Value *v1, Value *v2);   /* v1 * v2  */\nValue *value_pow(Value *b, double e);     /* b ** e   */\nValue *value_exp(Value *v);               /* e ** v   */\nValue *value_log(Value *v);               /* ln(v)    */\nValue *value_relu(Value *v);              /* max(0,v) */\n```\n\nUnless stated otherwise, these functions expect non-NULL pointers and correctly shaped inputs. This code aims to keep the learning path clear; it documents important preconditions, but it does not try to harden every call like a production-grade API would.\n\nEach of these returns a **new** `Value`\n\nwhose `data`\n\nis the result of the\noperation and whose `prev`\n\narray points at the operands. Crucially, each\noperation **retains its operands**: it bumps their reference count so that the\nresult node keeps them alive for as long as it needs them for the backward pass.\n\nThis means a result node *co-owns* its operands. You still own the references\nyou were holding before the call, and you are still responsible for releasing\nthem:\n\n```\nValue *a = value_create_leaf(2.0);   /* a: ref_count 1 (yours) */\nValue *b = value_create_leaf(3.0);   /* b: ref_count 1 (yours) */\nValue *c = value_add(a, b);              /* a,b now ref_count 2; c ref_count 1 */\n\n/* ... use c ... */\n\nvalue_release(c);   /* c freed; it releases its hold on a and b   */\nvalue_release(a);   /* a freed (its other reference was yours)     */\nvalue_release(b);   /* b freed                                     */\n```\n\nNote that `value_pow`\n\ntakes a plain `double`\n\nexponent, not a `Value`\n\n: only\nconstant exponents are supported, and the exponent is stored in `extra_data`\n\n.\n\nThe available `op_code`\n\ns are addition, multiplication, power, exponential,\nnatural logarithm and ReLU. These are exactly the primitives needed to build a\nReLU network with a mean-squared-error or negative-log-likelihood loss, which is\nwhat the examples do. Subtraction and division are not separate operations:\nsubtraction is addition with a negated operand (the toy example builds its\n`(prediction - target)`\n\nterm this way), and division is multiplication by a\nreciprocal, either a constant precomputed with `value_create`\n\nas the examples\ndo for their loss scaling, or `value_pow(x, -1.0)`\n\nwhen the divisor is itself a\nnode in the graph.\n\n```\nint value_backward(Value *v);\n```\n\n`value_backward`\n\ncomputes the gradient of `v`\n\nwith respect to every node it\ntransitively depends on, storing each result in that node's `grad`\n\nfield. It\nworks in two steps, exactly like micrograd:\n\n- It performs a depth-first\n**topological sort** of the graph rooted at`v`\n\n, so that every node appears after all the nodes it depends on. This uses the internal`Vector`\n\nand`SimpleSet`\n\ntypes (see below) to record the ordering and to avoid visiting a shared node twice. - It seeds\n`v->grad = 1`\n\nand walks the sorted list in reverse, and for each node it pushes its gradient onto its operands according to the local derivative of the operation that produced it (the chain rule).\n\nIt returns `0`\n\non success and `-1`\n\non failure.\n\nPrecondition: `v`\n\nmust be non-NULL and must point at a valid computation graph\nroot. If you are training in a loop, you must also zero any gradients you do\nnot want to accumulate before calling it.\n\nBecause gradients **accumulate** (`+=`\n\n) onto the operands, a `Value`\n\nthat is\nused in more than one place in the graph correctly receives the sum of the\ngradients flowing back through each path. This is why `value_backward`\n\ndoes not\nreset gradients for you: if you are training in a loop, you must zero the `grad`\n\nfields yourself before each backward pass. Both training examples do exactly\nthis:\n\n``` php\nfor (size_t i = 0; i < parameters->size; i++)\n    vector_get(parameters, i)->grad = 0.0;   /* zero the gradients   */\n\nvalue_backward(loss);                        /* accumulate new ones  */\n\nfor (size_t i = 0; i < parameters->size; i++) {\n    Value *p = vector_get(parameters, i);\n    p->data -= learning_rate * p->grad;      /* gradient descent step */\n}\n```\n\nC has no garbage collector, and a computation graph is a tangle of shared\npointers: the same weight `Value`\n\ncan be an operand of thousands of result\nnodes, and the same intermediate result can feed several downstream operations.\nFreeing such a graph correctly by hand is error prone. microcrad solves this\nthe same way many long-lived C programs do, with **reference counting**.\n\n```\nvoid value_retain(Value *v);    /* take a reference: ref_count++ */\nvoid value_release(Value *v);   /* drop a reference: ref_count-- */\n```\n\nThe rules are simple:\n\n- Every\n`Value`\n\nis born with a reference count of`1`\n\n, owned by whoever called the function that created it. `value_retain`\n\nrecords that someone new is holding the`Value`\n\n.`value_release`\n\nrecords that a holder is done with it. When the count reaches zero, the`Value`\n\nis freed,**and it releases its own operands first**, which may in turn free them, and so on recursively down the graph.\n\nThe recursive release is the important part: you almost never free a graph node\nby node. You release the **root** of the graph (the loss, the output of a\nforward pass), and the reference counts cascade downward, freeing exactly the\nnodes that nothing else still holds. Weights, which are also held by the network\nstructure, survive; pure intermediates, which were only held by the result you\njust released, are freed.\n\n`value_release`\n\nis safe to call on `NULL`\n\n, so you do not need to guard against\nit. This makes cleanup paths in functions that may fail part way through much\neasier to write, you can release everything unconditionally:\n\n```\nvalue_release(maybe_null);   /* does nothing if maybe_null is NULL */\n```\n\nEvery `Value`\n\ncarries a `magic`\n\nmarker set to a known constant when the node is\ncreated. `value_retain`\n\nand `value_release`\n\ncheck it, and `value_release`\n\npoisons it before recursively freeing the node. This is a **debug canary**, not\na correctness guarantee: it can help catch some invalid or stale `Value *`\n\nusage while you are experimenting, but it is not a substitute for correct\nownership reasoning. If you ever see microcrad complaining that a `Value *`\n\nis\ninvalid or stale, you almost certainly have a reference-counting bug.\n\nReference counting buys correctness and composability, but at a cost.\n\n**Disadvantage #1**: you must balance every reference. Each `value_create`\n\n,\n`value_retain`\n\n, and each operation's implicit retain of its operands has to be\nmatched by a `value_release`\n\n. Forget one and you leak; do one too many and you\nfree memory that is still in use. The training examples in `examples/`\n\nare\nverbose precisely because they are scrupulous about this in their error paths;\nthat verbosity is the price of leak-free C.\n\n**Disadvantage #2**: operations co-own their operands, which can surprise you.\nAfter `Value *c = value_add(a, b)`\n\n, the nodes `a`\n\nand `b`\n\nare kept alive by `c`\n\neven if you release your own references to them. This is what makes recursive\nrelease work, but it means you cannot reason about a single `Value`\n\n's lifetime\nin isolation, you have to think about the whole graph.\n\n**Advantage #1**: graphs free themselves. Release the root and the entire\nsubgraph that nothing else references disappears, in one call. There is no graph\nwalk to write, no bookkeeping of which intermediates to free.\n\n**Advantage #2**: sharing is free and correct. A weight used in ten thousand\nmultiplications is just retained ten thousand times; it is freed neither too\nearly nor too late. The same property is what lets `value_backward`\n\naccumulate\ngradients correctly across shared nodes.\n\nOn top of the autograd engine, microcrad provides the three pieces you need for\na feed-forward network. Each is a thin structure whose parameters are `Value`\n\ns,\nso a forward pass automatically builds a computation graph you can backpropagate\nthrough.\n\n```\nNeuron *neuron_create(uint32_t nin);\nValue  *neuron_forward(Neuron *n, Value **x);\n\nLayer  *layer_create(uint32_t nin, uint32_t nout);\nValue **layer_forward(Layer *l, Value **x);\n\nMLP    *mlp_create(uint32_t nin, uint32_t *nouts, uint32_t n_layers);\nValue **mlp_forward(MLP *mlp, Value **x);\n```\n\nThese forward functions assume the caller passes arrays of the correct length:\n`neuron_forward`\n\nexpects `n->nin`\n\ninputs, `layer_forward`\n\nexpects the width used\nto build the layer, and `mlp_forward`\n\nexpects the width of the model's first\nlayer.\n\nA `Neuron`\n\nholds `nin`\n\nweight `Value`\n\ns and a bias, all initialized to small\nrandom numbers. Its forward pass computes `relu(w·x + b)`\n\nand returns the single\noutput `Value`\n\n. A `Layer`\n\nis an array of `nout`\n\nneurons sharing the same input,\nand its forward pass returns an array of `nout`\n\noutput `Value`\n\ns. An `MLP`\n\nchains\nseveral layers, feeding each layer's outputs into the next.\n\nNote that **every neuron applies a ReLU**, including those in the output layer.\nThis keeps the engine minimal but it shapes what the network can represent (its\noutputs are always non-negative), which is why the toy example targets a\nfunction that is itself non-negative. It is a deliberate simplification, not an\noversight.\n\nTo train, you need a flat list of every weight and bias in the network so you can zero gradients and apply the update in a single loop. Each level exposes one:\n\n```\nVector *neuron_parameters(Neuron *n);\nVector *layer_parameters(Layer *l);\nVector *mlp_parameters(MLP *mlp);\n```\n\n`mlp_parameters`\n\nreturns a `Vector`\n\ncontaining every trainable scalar in the\nnetwork. This is the list you iterate over to do gradient descent, as shown in\nthe backpropagation section above.\n\nHere is the shape of a full training step, the same shape both examples use:\n\n``` php\nuint32_t nouts[] = {8, 1};\nMLP *model = mlp_create(2, nouts, 2);     /* a 2 -> 8 -> 1 network      */\nVector *params = mlp_parameters(model);   /* flat list of all weights   */\n\n/* forward: build the graph */\nValue *inputs[] = { value_create_leaf(x1), value_create_leaf(x2) };\nValue **out = mlp_forward(model, inputs);\n\n/* ... build a loss Value from out[...] ... */\n\n/* backward + update */\nfor (size_t i = 0; i < params->size; i++) vector_get(params, i)->grad = 0.0;\nvalue_backward(loss);\nfor (size_t i = 0; i < params->size; i++) {\n    Value *p = vector_get(params, i);\n    p->data -= learning_rate * p->grad;\n}\n\n/* cleanup */\nvalue_release(loss);\n/* ... release out, inputs ... */\nvector_free(params);\nmlp_free(model);\n```\n\nThe two examples in `examples/`\n\nflesh this out with concrete training loops and\ndata loading. Read `train_on_toy_regression.c`\n\nfirst: it is the smallest complete program in the repository that creates a\nmodel, builds a graph, backpropagates, updates parameters, and runs inference.\n\nThe engine relies on two small, self-contained data structures. You normally do not interact with them directly, but they are worth knowing about.\n\n-\n(`Vector`\n\n`vector.h`\n\n) is a dynamically growing array of`Value`\n\npointers. It grows in fixed-size blocks, and it participates in reference counting:`vector_append`\n\nretains the`Value`\n\nit stores and`vector_free`\n\nreleases every`Value`\n\nit holds. The parameter lists returned by`*_parameters`\n\nare`Vector`\n\ns. -\n(`SimpleSet`\n\n`simpleset.h`\n\n) is a minimal set keyed on pointer identity (a`Value`\n\n's memory address). It supports only insertion and membership tests, which is exactly what the topological sort in`value_backward`\n\nneeds to avoid visiting a shared node twice.\n\nmicrocrad has no dependencies beyond a C compiler, the C standard library, and\n`libm`\n\nfor the math functions. Everything is driven by the `Makefile`\n\n.\n\nTo build and run the full test suite:\n\n```\nmake test\n```\n\nThe `test/`\n\ndirectory contains a standalone suite per component, `test_value`\n\n,\n`test_vector`\n\n, `test_set`\n\n, `test_neuron`\n\n, `test_layer`\n\n, and `test_mlp`\n\n, and you\ncan build and run any one of them on its own:\n\n```\nmake test_value\nmake test_mlp\n```\n\nTo build and run the examples:\n\n```\nmake example_toy_regression   # tiny synthetic regression, no external data\nmake example_mnist            # downloads MNIST, then runs a conceptual demo\n```\n\n`example_mnist`\n\nwill fetch the MNIST IDX files first via\n`examples/mnist/download_data.sh`\n\n. The toy regression example needs no data and\nis the fastest way to see the whole pipeline run end to end; it is the primary\nexample to treat as supported.\n\n`make clean`\n\nremoves the build directory.\n\n- Read\n`examples/toy_regression/train_on_toy_regression.c`\n\nfor the smallest complete training program. - Read\n`examples/mnist/train_on_mnist.c`\n\nonly as a structural demonstration of wiring the engine to a real dataset. It is not a practical training recipe: the engine is scalar, the model is ReLU-only, and the example intentionally prioritizes explicit code over optimization or numerically careful modeling. - Read\n`test/`\n\nfor compact, executable documentation of how each function is meant to be called and what it guarantees. - Read\n`src/value.c`\n\nitself, it is short, and the comments walk through the forward operations and the backward rules one case at a time.\n\nmicrocrad is a C re-implementation of Andrej Karpathy's\n[micrograd](https://github.com/karpathy/micrograd). The autograd design, the\nscalar `Value`\n\nabstraction, and the topological-sort backward pass all follow\nthe original; the reference-counted memory management and the C data structures\nare what this port adds in order to make those ideas work without a garbage\ncollector.", "url": "https://wpnews.pro/news/show-hn-microcrad-micrograd-reimplemented-in-c", "canonical_source": "https://github.com/oraziorillo/microcrad", "published_at": "2026-06-17 13:34:42+00:00", "updated_at": "2026-06-17 13:53:23.760519+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "developer-tools"], "entities": ["Andrej Karpathy", "Microcrad", "Micrograd"], "alternates": {"html": "https://wpnews.pro/news/show-hn-microcrad-micrograd-reimplemented-in-c", "markdown": "https://wpnews.pro/news/show-hn-microcrad-micrograd-reimplemented-in-c.md", "text": "https://wpnews.pro/news/show-hn-microcrad-micrograd-reimplemented-in-c.txt", "jsonld": "https://wpnews.pro/news/show-hn-microcrad-micrograd-reimplemented-in-c.jsonld"}}