# Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

> Source: <https://github.com/JustVugg/nanoeuler>
> Published: 2026-06-19 18:18:35+00:00

A GPT-2-class language model built **entirely from scratch in C/CUDA** — no PyTorch, no
autograd, no ML libraries. The forward **and** backward passes are written and verified by
hand, and the whole training pipeline lives in this repo: a hand-written **byte-level BPE
tokenizer**, **pretraining** on a books + web corpus, and **supervised fine-tuning** into a
chat model (RLHF/DPO planned). It runs on CPU (`libm`

+ OpenMP) for a small showcase model,
and a full from-scratch **CUDA engine** — cuBLAS matmuls, a hand-written **FlashAttention**,
validated against a CPU reference by a full-model gradient check — trains a **~116M-parameter**
model on a single RTX 4070.

Status & honesty.This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is atext generator in the spirit of GPT-2-small: fluent-ish English,no real world knowledge. It isnota capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.

```
make check              # verify the backward pass (gradient check, double precision)
make                    # build the training binary
./nanoeuler train       # train the small showcase model (~0.76M params)
./nanoeuler train big   # train the larger model (~10M params; meant for a GPU)
./nanoeuler chat        # REPL: type a prompt, the model continues it
```

A residual block computes

```
x = x + f(x)
```

Read it as a step of numerical integration. The **forward-Euler method** advances an
ordinary differential equation `dx/dt = f(x)`

by

```
x(t+Δt) = x(t) + Δt · f(x(t))
```

With step size `Δt = 1`

this is exactly the residual update. So a deep residual
network is a *discretized ODE*: **depth is integration time**, and each layer
integrates the hidden state forward by one Euler step. This is the view behind work
like Neural ODEs (a ResNet is the Euler discretization of a continuous flow). The
project is named after **Leonhard Euler**, who gave us that integration method.

A sample from the ~116M model after a partial pretraining run on the books + web corpus
(prompt `Alessandro eat a`

):

```
Alessandro eat a icing textile: the satisfied by the servants in order to keep your weight
[Using to a heated, collaborated young people that attend the metric process where the rank
is authorized and to contain the sedentary. Some state lawyers were able to insert ...
```

The content is not meaningful, but notice what it learned on its own: real grammar, long clauses, and an encyclopedic register picked up from the web data. This is the expected behaviour of a small model trained on a single GPU — fluent shape, shallow substance. More training and (far) more data improve fluency; world knowledge needs scale this project does not pretend to have.

Decoder-only transformer with the building blocks common to current models:

**RMSNorm**(pre-norm, no bias)** Rotary position embeddings (RoPE)**applied to queries and keys** SwiGLU**feed-forward:`down(silu(gate(x)) * up(x))`

**Grouped-query attention (GQA)**: query heads share a smaller set of key/value heads** Multi-token prediction (MTP)**:`K`

output heads predict the next`K`

tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.**No biases** anywhere.**Byte-level BPE tokenizer**, hand-written, with GPT-2-style pretokenization (a single leading space attaches to the following word, so spaces are not wasted as standalone tokens). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).

Each block is `x = x + attn(rmsnorm(x))`

followed by `x = x + swiglu(rmsnorm(x))`

.
A residual connection `x = x + f(x)`

is one step of the forward-Euler method for the
ODE `dx/dt = f(x)`

— hence the name, and a nod to Leonhard Euler.

Configurations:

| where | dim | q/kv heads | layers | context | vocab | params |
|---|---|---|---|---|---|---|
`small` (CPU, `nanoeuler.c` ) |
128 | 4 / 2 | 4 | 128 | 512 | ~1.05M |
GPU pipeline (`cuda/` , `run_train` ) |
768 | 12 / 4 | 16 | 512 | 4096 | ~116M |

The CPU `small`

model trains in a few hours on 12 cores and is a self-contained showcase.
The ~116M GPU model is the real pipeline: it pretrains on the books + web mix and is then
fine-tuned into a chat model (see below). The head size is 64 (`768/12`

), which fits the
FlashAttention kernel.

Hand-written back-propagation is easy to get subtly wrong, so every analytic gradient is compared against a central finite difference. The check runs in double precision so floating-point cancellation does not hide correct gradients:

``` bash
$ make check
  tok      : max rel err 1.02e-04
  qkvw     : max rel err 7.20e-07
  gatew    : max rel err 6.86e-08
  ...
max relative error: 1.02e-04
>>> backward OK (error < 1e-2)
```

Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.

`make`

builds with `-O3 -march=native -ffast-math -fopenmp`

. Matrix multiplies and
attention are parallelized with OpenMP and vectorized; on a 12-core machine the
training loop uses all cores. `make check`

builds a separate double-precision binary
used only for the gradient check.

No external dependencies. Tested with gcc 13 on Linux.

This is a from-scratch text generator and a complete, understandable training pipeline —
not a product. A model of this size trained on one GPU produces fluent-looking English with
little real knowledge; the fine-tuned chat model answers in assistant *form* but its content
is shallow. A usable conversational model needs orders of magnitude more parameters, data and
compute (a ~135M model only becomes a basic assistant after ~600B training tokens; this repo
trains on a far smaller corpus on a single GPU). The goal is to own every piece — every
parameter, every gradient, the tokenizer, the kernels, the pretraining and the fine-tuning.

`cuda/nanoeuler_cuda.cu`

is a full from-scratch CUDA port — forward, backward, training
and inference on the GPU. Every kernel is validated on the device against a CPU reference,
and the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).

Kernels: matmul (delegated to **cuBLAS** with TF32 tensor cores), RMSNorm, RoPE,
grouped-query attention with a hand-written **FlashAttention** (tiled, online softmax,
no T×T matrix in memory), SwiGLU, softmax/cross-entropy and AdamW. FlashAttention made
the training step about 3× faster.

Build (RTX 40-series = Ada = `sm_89`

; the host-compiler flag avoids a gcc ICE on the large file):

```
cd cuda
nvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas
```

Modes:

```
./nanoeuler_cuda              # run all kernel self-tests (GPU vs CPU)
./nanoeuler_cuda g            # full-model gradient check (GPU grads vs CPU)
./nanoeuler_cuda t            # pretrain from scratch, checkpoint to ../nanoeuler.bin every 5k steps
./nanoeuler_cuda tr           # resume pretraining from the latest ../nanoeuler.bin checkpoint
./nanoeuler_cuda i "It was"   # autoregressive generation on GPU
./nanoeuler_cuda s            # supervised fine-tune on Alpaca, save ../nanoeuler_chat.bin
./nanoeuler_cuda c            # interactive chat with the fine-tuned model
```

Training checkpoints every 5000 steps, so a long run can be stopped (Ctrl-C) and resumed with
`tr`

. A model trained on the GPU is saved in the CPU program's format, so `./nanoeuler chat`

can also load and run it.

The chat pipeline is two stages. First **pretrain** the ~116M base on the books + web mix
(`./nanoeuler_cuda t`

, resumable with `tr`

). Then **supervised fine-tuning** turns it into an
assistant: `./nanoeuler_cuda s`

loads the pretrained base, renders each
[Alpaca](https://github.com/tatsu-lab/stanford_alpaca) example with the standard instruction
template, and trains with the **loss masked to the response tokens only** (prompt and padding
positions get a target of `-1`

, which the cross-entropy kernel turns into zero gradient). The
result is saved to `nanoeuler_chat.bin`

; `./nanoeuler_cuda c`

then wraps each line you type in
the same template and samples a reply, stopping at the `</s>`

end marker.

After fine-tuning the model answers in the right *shape* — it follows the instruction→response
format, writes complete sentences and stops on its own. The *content*, though, is shallow and
often wrong: this is a small model trained on a single GPU, so it has little world knowledge to
express. SFT teaches the model *how* to respond, not *what* it knows — that comes from
pretraining and scale. This is a faithful, fully-from-scratch demonstration that the
pretrain→SFT pipeline works end to end, not a capable assistant.

Pretraining uses a real **books + web** mix:

**Books**—`data/get_gutenberg.sh`

downloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's Project Gutenberg license header/footer is stripped (only the text between the`*** START ... ***`

/`*** END ... ***`

markers is kept) so the model trains on prose.**Web**—`data/get_web.sh`

pulls a slice of[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)(high-quality educational web text) straight from the Hugging Face parquet files using the**DuckDB** CLI (a single static binary — no Python, no libraries).

Then concatenate them into the pretraining corpus the trainer reads:

``` php
sh data/get_gutenberg.sh                       # books  -> data/gutenberg.txt
sh data/get_web.sh                             # web    -> data/web.txt (~1 GB by default)
cat data/gutenberg.txt data/web.txt > data/pretrain.txt
sh data/get_alpaca.sh                          # instruction data for SFT -> data/alpaca.json
```

Corpora and model checkpoints are git-ignored (regenerable).

- ✅ Hand-written byte-level BPE with GPT-2-style pretokenization.
- ✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.
- ✅ Pretraining on a books + web mix, with checkpoint/resume.
- ✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.
- ⏳
**DPO**(preference optimization) — the alignment stage, next to build. - ⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.

```
nanoeuler.c             CPU model: forward, backward, training, sampling, chat REPL
cuda/nanoeuler_cuda.cu  GPU engine: BPE, kernels, FlashAttention, pretrain/SFT/infer/chat, gradient check
data/get_gutenberg.sh   downloads + cleans the Gutenberg books corpus
data/get_web.sh         downloads a FineWeb-Edu web slice via the DuckDB CLI (no Python)
data/get_alpaca.sh      downloads the Alpaca instruction data for fine-tuning
Makefile  LICENSE  shakespeare.txt  .gitignore
```

MIT. See `LICENSE`

.
