Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch A developer released NanoEuler, a GPT-2-scale language model built entirely from scratch in C/CUDA without any machine learning libraries. The project includes a hand-written byte-level BPE tokenizer, pretraining on books and web data, and supervised fine-tuning into a chat model, with a ~116M-parameter version trainable on a single RTX 4070. It is an educational artifact demonstrating the full training pipeline from scratch, not a production-ready assistant. A GPT-2-class language model built entirely from scratch in C/CUDA — no PyTorch, no autograd, no ML libraries. The forward and backward passes are written and verified by hand, and the whole training pipeline lives in this repo: a hand-written byte-level BPE tokenizer , pretraining on a books + web corpus, and supervised fine-tuning into a chat model RLHF/DPO planned . It runs on CPU libm + OpenMP for a small showcase model, and a full from-scratch CUDA engine — cuBLAS matmuls, a hand-written FlashAttention , validated against a CPU reference by a full-model gradient check — trains a ~116M-parameter model on a single RTX 4070. Status & honesty.This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is atext generator in the spirit of GPT-2-small: fluent-ish English,no real world knowledge. It isnota capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline. make check verify the backward pass gradient check, double precision make build the training binary ./nanoeuler train train the small showcase model ~0.76M params ./nanoeuler train big train the larger model ~10M params; meant for a GPU ./nanoeuler chat REPL: type a prompt, the model continues it A residual block computes x = x + f x Read it as a step of numerical integration. The forward-Euler method advances an ordinary differential equation dx/dt = f x by x t+Δt = x t + Δt · f x t With step size Δt = 1 this is exactly the residual update. So a deep residual network is a discretized ODE : depth is integration time , and each layer integrates the hidden state forward by one Euler step. This is the view behind work like Neural ODEs a ResNet is the Euler discretization of a continuous flow . The project is named after Leonhard Euler , who gave us that integration method. A sample from the ~116M model after a partial pretraining run on the books + web corpus prompt Alessandro eat a : Alessandro eat a icing textile: the satisfied by the servants in order to keep your weight Using to a heated, collaborated young people that attend the metric process where the rank is authorized and to contain the sedentary. Some state lawyers were able to insert ... The content is not meaningful, but notice what it learned on its own: real grammar, long clauses, and an encyclopedic register picked up from the web data. This is the expected behaviour of a small model trained on a single GPU — fluent shape, shallow substance. More training and far more data improve fluency; world knowledge needs scale this project does not pretend to have. Decoder-only transformer with the building blocks common to current models: RMSNorm pre-norm, no bias Rotary position embeddings RoPE applied to queries and keys SwiGLU feed-forward: down silu gate x up x Grouped-query attention GQA : query heads share a smaller set of key/value heads Multi-token prediction MTP : K output heads predict the next K tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0. No biases anywhere. Byte-level BPE tokenizer , hand-written, with GPT-2-style pretokenization a single leading space attaches to the following word, so spaces are not wasted as standalone tokens . Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary ~3.4 bytes/token on English . Each block is x = x + attn rmsnorm x followed by x = x + swiglu rmsnorm x . A residual connection x = x + f x is one step of the forward-Euler method for the ODE dx/dt = f x — hence the name, and a nod to Leonhard Euler. Configurations: | where | dim | q/kv heads | layers | context | vocab | params | |---|---|---|---|---|---|---| small CPU, nanoeuler.c | 128 | 4 / 2 | 4 | 128 | 512 | ~1.05M | GPU pipeline cuda/ , run train | 768 | 12 / 4 | 16 | 512 | 4096 | ~116M | The CPU small model trains in a few hours on 12 cores and is a self-contained showcase. The ~116M GPU model is the real pipeline: it pretrains on the books + web mix and is then fine-tuned into a chat model see below . The head size is 64 768/12 , which fits the FlashAttention kernel. Hand-written back-propagation is easy to get subtly wrong, so every analytic gradient is compared against a central finite difference. The check runs in double precision so floating-point cancellation does not hide correct gradients: bash $ make check tok : max rel err 1.02e-04 qkvw : max rel err 7.20e-07 gatew : max rel err 6.86e-08 ... max relative error: 1.02e-04 backward OK error < 1e-2 Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP. make builds with -O3 -march=native -ffast-math -fopenmp . Matrix multiplies and attention are parallelized with OpenMP and vectorized; on a 12-core machine the training loop uses all cores. make check builds a separate double-precision binary used only for the gradient check. No external dependencies. Tested with gcc 13 on Linux. This is a from-scratch text generator and a complete, understandable training pipeline — not a product. A model of this size trained on one GPU produces fluent-looking English with little real knowledge; the fine-tuned chat model answers in assistant form but its content is shallow. A usable conversational model needs orders of magnitude more parameters, data and compute a ~135M model only becomes a basic assistant after ~600B training tokens; this repo trains on a far smaller corpus on a single GPU . The goal is to own every piece — every parameter, every gradient, the tokenizer, the kernels, the pretraining and the fine-tuning. cuda/nanoeuler cuda.cu is a full from-scratch CUDA port — forward, backward, training and inference on the GPU. Every kernel is validated on the device against a CPU reference, and the whole model has a GPU gradient check GPU grads vs CPU grads to ~1e-6 . Kernels: matmul delegated to cuBLAS with TF32 tensor cores , RMSNorm, RoPE, grouped-query attention with a hand-written FlashAttention tiled, online softmax, no T×T matrix in memory , SwiGLU, softmax/cross-entropy and AdamW. FlashAttention made the training step about 3× faster. Build RTX 40-series = Ada = sm 89 ; the host-compiler flag avoids a gcc ICE on the large file : cd cuda nvcc -O3 -arch=sm 89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler cuda.cu -o nanoeuler cuda -lcublas Modes: ./nanoeuler cuda run all kernel self-tests GPU vs CPU ./nanoeuler cuda g full-model gradient check GPU grads vs CPU ./nanoeuler cuda t pretrain from scratch, checkpoint to ../nanoeuler.bin every 5k steps ./nanoeuler cuda tr resume pretraining from the latest ../nanoeuler.bin checkpoint ./nanoeuler cuda i "It was" autoregressive generation on GPU ./nanoeuler cuda s supervised fine-tune on Alpaca, save ../nanoeuler chat.bin ./nanoeuler cuda c interactive chat with the fine-tuned model Training checkpoints every 5000 steps, so a long run can be stopped Ctrl-C and resumed with tr . A model trained on the GPU is saved in the CPU program's format, so ./nanoeuler chat can also load and run it. The chat pipeline is two stages. First pretrain the ~116M base on the books + web mix ./nanoeuler cuda t , resumable with tr . Then supervised fine-tuning turns it into an assistant: ./nanoeuler cuda s loads the pretrained base, renders each Alpaca https://github.com/tatsu-lab/stanford alpaca example with the standard instruction template, and trains with the loss masked to the response tokens only prompt and padding positions get a target of -1 , which the cross-entropy kernel turns into zero gradient . The result is saved to nanoeuler chat.bin ; ./nanoeuler cuda c then wraps each line you type in the same template and samples a reply, stopping at the