microcrad is a tiny scalar-valued automatic differentiation engine for C, with a small neural network implementation built on top of it. It is a re-implementation of Andrej Karpathy's micrograd in C, written for people who want to understand how backpropagation really works.
Like the Python original, microcrad operates on scalars, not tensors. Every number that takes part in a computation is a node in a graph, every operation records how it was produced, and a single backward pass walks the graph in reverse to compute the derivative of the output with respect to every input. There is no vectorization, no GPU, no clever tricks: just the chain rule applied one scalar at a time. On top of this engine sits a multi-layer perceptron, so you can build a network, run a forward pass, call backward, and do gradient descent, all in C.
This repository is first and foremost an educational implementation. It is meant to be read, experimented with, and tested. It is not a production autograd package, not a practical deep-learning framework, and not optimized for large datasets or numerical robustness.
The whole thing is built around two ideas:
- A
Value
, which is a single node in the computation graph. Reference counting, which is how microcrad knows when aValue
is no longer part of any graph and can be freed.
Almost everything in the documentation below is a consequence of these two ideas, so it is worth keeping them in mind.
The fundamental type is Value
. A Value
wraps a single double
and, when it is the result of an operation, remembers the operands it was computed from:
typedef struct Value {
uint32_t ref_count; /* How many references point at this Value. */
uint32_t n_prevs; /* How many operands produced this Value. */
double data; /* The scalar this node holds. */
double extra_data; /* Operation parameter (e.g. the exponent in pow). */
struct Value **prev; /* The operands (previous nodes in the graph). */
int32_t op_code; /* Which operation produced this Value. */
uint32_t magic; /* Debug canary for some invalid or stale pointers. */
double grad; /* dLoss/dThisValue, filled in by backward. */
} Value;
A leaf Value
(an input, a weight, a constant) has n_prevs == 0
and no
operands. A Value
produced by an operation such as addition has n_prevs > 0
and a prev
array pointing at the operands it depends on. Because every
operation links its result back to its operands, the set of all Value
s
reachable through prev
pointers forms a directed acyclic graph: the computation graph.
This is the simplest microcrad program that computes something and its gradient:
Value *a = value_create_leaf(2.0);
Value *b = value_create_leaf(3.0);
Value *c = value_mul(a, b); /* c = a * b = 6 */
value_backward(c); /* returns 0 on success, and fills every grad field */
printf("c = %f\n", c->data); /* 6.000000 */
printf("dc/da= %f\n", a->grad); /* 3.000000 (== b) */
printf("dc/db= %f\n", b->grad); /* 2.000000 (== a) */
value_release(c); /* c freed; releases its hold on a and b */
value_release(a); /* a freed (its other reference was yours) */
value_release(b); /* b freed */
This small program already shows the essentials:
Value
s are heap allocated withvalue_create()
.- Operations such as
value_mul()
build newValue
s wired into the graph. value_backward()
computes the gradient of its argument with respect to every node it depends on.Value
s are reference counted, and releasing the root of a graph releases the whole graph (more on this below).
Value *value_create(double data, int32_t n_prevs, Value **prev);
Value *value_create_leaf(double data);
value_create_leaf
is the convenience constructor for a leaf node, that is, an input, a weight, a bias, or a constant:
Value *x = value_create_leaf(42.0);
The n_prevs
/prev
arguments exist because the operation functions (value_add
and friends) use value_create
internally to build result nodes. Most user code
should call value_create_leaf
and let the operations do the wiring.
A freshly created Value
starts with a reference count of 1
: the pointer returned to you is that one reference. It is your job to release it.
Value *value_add(Value *v1, Value *v2); /* v1 + v2 */
Value *value_mul(Value *v1, Value *v2); /* v1 * v2 */
Value *value_pow(Value *b, double e); /* b ** e */
Value *value_exp(Value *v); /* e ** v */
Value *value_log(Value *v); /* ln(v) */
Value *value_relu(Value *v); /* max(0,v) */
Unless stated otherwise, these functions expect non-NULL pointers and correctly shaped inputs. This code aims to keep the learning path clear; it documents important preconditions, but it does not try to harden every call like a production-grade API would.
Each of these returns a new Value
whose data
is the result of the
operation and whose prev
array points at the operands. Crucially, each operation retains its operands: it bumps their reference count so that the result node keeps them alive for as long as it needs them for the backward pass.
This means a result node co-owns its operands. You still own the references you were holding before the call, and you are still responsible for releasing them:
Value *a = value_create_leaf(2.0); /* a: ref_count 1 (yours) */
Value *b = value_create_leaf(3.0); /* b: ref_count 1 (yours) */
Value *c = value_add(a, b); /* a,b now ref_count 2; c ref_count 1 */
/* ... use c ... */
value_release(c); /* c freed; it releases its hold on a and b */
value_release(a); /* a freed (its other reference was yours) */
value_release(b); /* b freed */
Note that value_pow
takes a plain double
exponent, not a Value
: only
constant exponents are supported, and the exponent is stored in extra_data
.
The available op_code
s are addition, multiplication, power, exponential,
natural logarithm and ReLU. These are exactly the primitives needed to build a
ReLU network with a mean-squared-error or negative-log-likelihood loss, which is
what the examples do. Subtraction and division are not separate operations:
subtraction is addition with a negated operand (the toy example builds its
(prediction - target)
term this way), and division is multiplication by a
reciprocal, either a constant precomputed with value_create
as the examples
do for their loss scaling, or value_pow(x, -1.0)
when the divisor is itself a node in the graph.
int value_backward(Value *v);
value_backward
computes the gradient of v
with respect to every node it
transitively depends on, storing each result in that node's grad
field. It works in two steps, exactly like micrograd:
- It performs a depth-first
topological sort of the graph rooted at
v
, so that every node appears after all the nodes it depends on. This uses the internalVector
andSimpleSet
types (see below) to record the ordering and to avoid visiting a shared node twice. - It seeds
v->grad = 1
and walks the sorted list in reverse, and for each node it pushes its gradient onto its operands according to the local derivative of the operation that produced it (the chain rule).
It returns 0
on success and -1
on failure.
Precondition: v
must be non-NULL and must point at a valid computation graph root. If you are training in a loop, you must also zero any gradients you do not want to accumulate before calling it.
Because gradients accumulate (+=
) onto the operands, a Value
that is
used in more than one place in the graph correctly receives the sum of the
gradients flowing back through each path. This is why value_backward
does not
reset gradients for you: if you are training in a loop, you must zero the grad
fields yourself before each backward pass. Both training examples do exactly this:
for (size_t i = 0; i < parameters->size; i++)
vector_get(parameters, i)->grad = 0.0; /* zero the gradients */
value_backward(loss); /* accumulate new ones */
for (size_t i = 0; i < parameters->size; i++) {
Value *p = vector_get(parameters, i);
p->data -= learning_rate * p->grad; /* gradient descent step */
}
C has no garbage collector, and a computation graph is a tangle of shared
pointers: the same weight Value
can be an operand of thousands of result nodes, and the same intermediate result can feed several downstream operations. Freeing such a graph correctly by hand is error prone. microcrad solves this the same way many long-lived C programs do, with reference counting.
void value_retain(Value *v); /* take a reference: ref_count++ */
void value_release(Value *v); /* drop a reference: ref_count-- */
The rules are simple:
- Every
Value
is born with a reference count of1
, owned by whoever called the function that created it. value_retain
records that someone new is holding theValue
.value_release
records that a holder is done with it. When the count reaches zero, theValue
is freed,and it releases its own operands first, which may in turn free them, and so on recursively down the graph.
The recursive release is the important part: you almost never free a graph node by node. You release the root of the graph (the loss, the output of a forward pass), and the reference counts cascade downward, freeing exactly the nodes that nothing else still holds. Weights, which are also held by the network structure, survive; pure intermediates, which were only held by the result you just released, are freed.
value_release
is safe to call on NULL
, so you do not need to guard against it. This makes cleanup paths in functions that may fail part way through much easier to write, you can release everything unconditionally:
value_release(maybe_null); /* does nothing if maybe_null is NULL */
Every Value
carries a magic
marker set to a known constant when the node is
created. value_retain
and value_release
check it, and value_release
poisons it before recursively freeing the node. This is a debug canary, not
a correctness guarantee: it can help catch some invalid or stale Value *
usage while you are experimenting, but it is not a substitute for correct
ownership reasoning. If you ever see microcrad complaining that a Value *
is invalid or stale, you almost certainly have a reference-counting bug.
Reference counting buys correctness and composability, but at a cost.
Disadvantage #1: you must balance every reference. Each value_create
,
value_retain
, and each operation's implicit retain of its operands has to be
matched by a value_release
. Forget one and you leak; do one too many and you
free memory that is still in use. The training examples in examples/
are verbose precisely because they are scrupulous about this in their error paths; that verbosity is the price of leak-free C.
Disadvantage #2: operations co-own their operands, which can surprise you.
After Value *c = value_add(a, b)
, the nodes a
and b
are kept alive by c
even if you release your own references to them. This is what makes recursive
release work, but it means you cannot reason about a single Value
's lifetime in isolation, you have to think about the whole graph.
Advantage #1: graphs free themselves. Release the root and the entire subgraph that nothing else references disappears, in one call. There is no graph walk to write, no bookkeeping of which intermediates to free.
Advantage #2: sharing is free and correct. A weight used in ten thousand
multiplications is just retained ten thousand times; it is freed neither too
early nor too late. The same property is what lets value_backward
accumulate gradients correctly across shared nodes.
On top of the autograd engine, microcrad provides the three pieces you need for
a feed-forward network. Each is a thin structure whose parameters are Value
s, so a forward pass automatically builds a computation graph you can backpropagate through.
Neuron *neuron_create(uint32_t nin);
Value *neuron_forward(Neuron *n, Value **x);
Layer *layer_create(uint32_t nin, uint32_t nout);
Value **layer_forward(Layer *l, Value **x);
MLP *mlp_create(uint32_t nin, uint32_t *nouts, uint32_t n_layers);
Value **mlp_forward(MLP *mlp, Value **x);
These forward functions assume the caller passes arrays of the correct length:
neuron_forward
expects n->nin
inputs, layer_forward
expects the width used
to build the layer, and mlp_forward
expects the width of the model's first layer.
A Neuron
holds nin
weight Value
s and a bias, all initialized to small
random numbers. Its forward pass computes relu(w·x + b)
and returns the single
output Value
. A Layer
is an array of nout
neurons sharing the same input,
and its forward pass returns an array of nout
output Value
s. An MLP
chains several layers, feeding each layer's outputs into the next.
Note that every neuron applies a ReLU, including those in the output layer. This keeps the engine minimal but it shapes what the network can represent (its outputs are always non-negative), which is why the toy example targets a function that is itself non-negative. It is a deliberate simplification, not an oversight.
To train, you need a flat list of every weight and bias in the network so you can zero gradients and apply the update in a single loop. Each level exposes one:
Vector *neuron_parameters(Neuron *n);
Vector *layer_parameters(Layer *l);
Vector *mlp_parameters(MLP *mlp);
mlp_parameters
returns a Vector
containing every trainable scalar in the network. This is the list you iterate over to do gradient descent, as shown in the backpropagation section above.
Here is the shape of a full training step, the same shape both examples use:
uint32_t nouts[] = {8, 1};
MLP *model = mlp_create(2, nouts, 2); /* a 2 -> 8 -> 1 network */
Vector *params = mlp_parameters(model); /* flat list of all weights */
/* forward: build the graph */
Value *inputs[] = { value_create_leaf(x1), value_create_leaf(x2) };
Value **out = mlp_forward(model, inputs);
/* ... build a loss Value from out[...] ... */
/* backward + update */
for (size_t i = 0; i < params->size; i++) vector_get(params, i)->grad = 0.0;
value_backward(loss);
for (size_t i = 0; i < params->size; i++) {
Value *p = vector_get(params, i);
p->data -= learning_rate * p->grad;
}
/* cleanup */
value_release(loss);
/* ... release out, inputs ... */
vector_free(params);
mlp_free(model);
The two examples in examples/
flesh this out with concrete training loops and
data . Read train_on_toy_regression.c
first: it is the smallest complete program in the repository that creates a model, builds a graph, backpropagates, updates parameters, and runs inference.
The engine relies on two small, self-contained data structures. You normally do not interact with them directly, but they are worth knowing about.
(Vector
vector.h
) is a dynamically growing array ofValue
pointers. It grows in fixed-size blocks, and it participates in reference counting:vector_append
retains theValue
it stores andvector_free
releases everyValue
it holds. The parameter lists returned by*_parameters
areVector
s. -
(SimpleSet
simpleset.h
) is a minimal set keyed on pointer identity (aValue
's memory address). It supports only insertion and membership tests, which is exactly what the topological sort invalue_backward
needs to avoid visiting a shared node twice.
microcrad has no dependencies beyond a C compiler, the C standard library, and
libm
for the math functions. Everything is driven by the Makefile
.
To build and run the full test suite:
make test
The test/
directory contains a standalone suite per component, test_value
,
test_vector
, test_set
, test_neuron
, test_layer
, and test_mlp
, and you can build and run any one of them on its own:
make test_value
make test_mlp
To build and run the examples:
make example_toy_regression # tiny synthetic regression, no external data
make example_mnist # downloads MNIST, then runs a conceptual demo
example_mnist
will fetch the MNIST IDX files first via
examples/mnist/download_data.sh
. The toy regression example needs no data and is the fastest way to see the whole pipeline run end to end; it is the primary example to treat as supported.
make clean
removes the build directory.
- Read
examples/toy_regression/train_on_toy_regression.c
for the smallest complete training program. - Read
examples/mnist/train_on_mnist.c
only as a structural demonstration of wiring the engine to a real dataset. It is not a practical training recipe: the engine is scalar, the model is ReLU-only, and the example intentionally prioritizes explicit code over optimization or numerically careful modeling. - Read
test/
for compact, executable documentation of how each function is meant to be called and what it guarantees. - Read
src/value.c
itself, it is short, and the comments walk through the forward operations and the backward rules one case at a time.
microcrad is a C re-implementation of Andrej Karpathy's
micrograd. The autograd design, the
scalar Value
abstraction, and the topological-sort backward pass all follow the original; the reference-counted memory management and the C data structures are what this port adds in order to make those ideas work without a garbage collector.