{"slug": "show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch", "title": "Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch", "summary": "A developer released NanoEuler, a GPT-2-scale language model built entirely from scratch in C/CUDA without any machine learning libraries. The project includes a hand-written byte-level BPE tokenizer, pretraining on books and web data, and supervised fine-tuning into a chat model, with a ~116M-parameter version trainable on a single RTX 4070. It is an educational artifact demonstrating the full training pipeline from scratch, not a production-ready assistant.", "body_md": "A GPT-2-class language model built **entirely from scratch in C/CUDA** — no PyTorch, no\nautograd, no ML libraries. The forward **and** backward passes are written and verified by\nhand, and the whole training pipeline lives in this repo: a hand-written **byte-level BPE\ntokenizer**, **pretraining** on a books + web corpus, and **supervised fine-tuning** into a\nchat model (RLHF/DPO planned). It runs on CPU (`libm`\n\n+ OpenMP) for a small showcase model,\nand a full from-scratch **CUDA engine** — cuBLAS matmuls, a hand-written **FlashAttention**,\nvalidated against a CPU reference by a full-model gradient check — trains a **~116M-parameter**\nmodel on a single RTX 4070.\n\nStatus & honesty.This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is atext generator in the spirit of GPT-2-small: fluent-ish English,no real world knowledge. It isnota capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.\n\n```\nmake check              # verify the backward pass (gradient check, double precision)\nmake                    # build the training binary\n./nanoeuler train       # train the small showcase model (~0.76M params)\n./nanoeuler train big   # train the larger model (~10M params; meant for a GPU)\n./nanoeuler chat        # REPL: type a prompt, the model continues it\n```\n\nA residual block computes\n\n```\nx = x + f(x)\n```\n\nRead it as a step of numerical integration. The **forward-Euler method** advances an\nordinary differential equation `dx/dt = f(x)`\n\nby\n\n```\nx(t+Δt) = x(t) + Δt · f(x(t))\n```\n\nWith step size `Δt = 1`\n\nthis is exactly the residual update. So a deep residual\nnetwork is a *discretized ODE*: **depth is integration time**, and each layer\nintegrates the hidden state forward by one Euler step. This is the view behind work\nlike Neural ODEs (a ResNet is the Euler discretization of a continuous flow). The\nproject is named after **Leonhard Euler**, who gave us that integration method.\n\nA sample from the ~116M model after a partial pretraining run on the books + web corpus\n(prompt `Alessandro eat a`\n\n):\n\n```\nAlessandro eat a icing textile: the satisfied by the servants in order to keep your weight\n[Using to a heated, collaborated young people that attend the metric process where the rank\nis authorized and to contain the sedentary. Some state lawyers were able to insert ...\n```\n\nThe content is not meaningful, but notice what it learned on its own: real grammar, long clauses, and an encyclopedic register picked up from the web data. This is the expected behaviour of a small model trained on a single GPU — fluent shape, shallow substance. More training and (far) more data improve fluency; world knowledge needs scale this project does not pretend to have.\n\nDecoder-only transformer with the building blocks common to current models:\n\n**RMSNorm**(pre-norm, no bias)** Rotary position embeddings (RoPE)**applied to queries and keys** SwiGLU**feed-forward:`down(silu(gate(x)) * up(x))`\n\n**Grouped-query attention (GQA)**: query heads share a smaller set of key/value heads** Multi-token prediction (MTP)**:`K`\n\noutput heads predict the next`K`\n\ntokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.**No biases** anywhere.**Byte-level BPE tokenizer**, hand-written, with GPT-2-style pretokenization (a single leading space attaches to the following word, so spaces are not wasted as standalone tokens). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).\n\nEach block is `x = x + attn(rmsnorm(x))`\n\nfollowed by `x = x + swiglu(rmsnorm(x))`\n\n.\nA residual connection `x = x + f(x)`\n\nis one step of the forward-Euler method for the\nODE `dx/dt = f(x)`\n\n— hence the name, and a nod to Leonhard Euler.\n\nConfigurations:\n\n| where | dim | q/kv heads | layers | context | vocab | params |\n|---|---|---|---|---|---|---|\n`small` (CPU, `nanoeuler.c` ) |\n128 | 4 / 2 | 4 | 128 | 512 | ~1.05M |\nGPU pipeline (`cuda/` , `run_train` ) |\n768 | 12 / 4 | 16 | 512 | 4096 | ~116M |\n\nThe CPU `small`\n\nmodel trains in a few hours on 12 cores and is a self-contained showcase.\nThe ~116M GPU model is the real pipeline: it pretrains on the books + web mix and is then\nfine-tuned into a chat model (see below). The head size is 64 (`768/12`\n\n), which fits the\nFlashAttention kernel.\n\nHand-written back-propagation is easy to get subtly wrong, so every analytic gradient is compared against a central finite difference. The check runs in double precision so floating-point cancellation does not hide correct gradients:\n\n``` bash\n$ make check\n  tok      : max rel err 1.02e-04\n  qkvw     : max rel err 7.20e-07\n  gatew    : max rel err 6.86e-08\n  ...\nmax relative error: 1.02e-04\n>>> backward OK (error < 1e-2)\n```\n\nEvery parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.\n\n`make`\n\nbuilds with `-O3 -march=native -ffast-math -fopenmp`\n\n. Matrix multiplies and\nattention are parallelized with OpenMP and vectorized; on a 12-core machine the\ntraining loop uses all cores. `make check`\n\nbuilds a separate double-precision binary\nused only for the gradient check.\n\nNo external dependencies. Tested with gcc 13 on Linux.\n\nThis is a from-scratch text generator and a complete, understandable training pipeline —\nnot a product. A model of this size trained on one GPU produces fluent-looking English with\nlittle real knowledge; the fine-tuned chat model answers in assistant *form* but its content\nis shallow. A usable conversational model needs orders of magnitude more parameters, data and\ncompute (a ~135M model only becomes a basic assistant after ~600B training tokens; this repo\ntrains on a far smaller corpus on a single GPU). The goal is to own every piece — every\nparameter, every gradient, the tokenizer, the kernels, the pretraining and the fine-tuning.\n\n`cuda/nanoeuler_cuda.cu`\n\nis a full from-scratch CUDA port — forward, backward, training\nand inference on the GPU. Every kernel is validated on the device against a CPU reference,\nand the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).\n\nKernels: matmul (delegated to **cuBLAS** with TF32 tensor cores), RMSNorm, RoPE,\ngrouped-query attention with a hand-written **FlashAttention** (tiled, online softmax,\nno T×T matrix in memory), SwiGLU, softmax/cross-entropy and AdamW. FlashAttention made\nthe training step about 3× faster.\n\nBuild (RTX 40-series = Ada = `sm_89`\n\n; the host-compiler flag avoids a gcc ICE on the large file):\n\n```\ncd cuda\nnvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas\n```\n\nModes:\n\n```\n./nanoeuler_cuda              # run all kernel self-tests (GPU vs CPU)\n./nanoeuler_cuda g            # full-model gradient check (GPU grads vs CPU)\n./nanoeuler_cuda t            # pretrain from scratch, checkpoint to ../nanoeuler.bin every 5k steps\n./nanoeuler_cuda tr           # resume pretraining from the latest ../nanoeuler.bin checkpoint\n./nanoeuler_cuda i \"It was\"   # autoregressive generation on GPU\n./nanoeuler_cuda s            # supervised fine-tune on Alpaca, save ../nanoeuler_chat.bin\n./nanoeuler_cuda c            # interactive chat with the fine-tuned model\n```\n\nTraining checkpoints every 5000 steps, so a long run can be stopped (Ctrl-C) and resumed with\n`tr`\n\n. A model trained on the GPU is saved in the CPU program's format, so `./nanoeuler chat`\n\ncan also load and run it.\n\nThe chat pipeline is two stages. First **pretrain** the ~116M base on the books + web mix\n(`./nanoeuler_cuda t`\n\n, resumable with `tr`\n\n). Then **supervised fine-tuning** turns it into an\nassistant: `./nanoeuler_cuda s`\n\nloads the pretrained base, renders each\n[Alpaca](https://github.com/tatsu-lab/stanford_alpaca) example with the standard instruction\ntemplate, and trains with the **loss masked to the response tokens only** (prompt and padding\npositions get a target of `-1`\n\n, which the cross-entropy kernel turns into zero gradient). The\nresult is saved to `nanoeuler_chat.bin`\n\n; `./nanoeuler_cuda c`\n\nthen wraps each line you type in\nthe same template and samples a reply, stopping at the `</s>`\n\nend marker.\n\nAfter fine-tuning the model answers in the right *shape* — it follows the instruction→response\nformat, writes complete sentences and stops on its own. The *content*, though, is shallow and\noften wrong: this is a small model trained on a single GPU, so it has little world knowledge to\nexpress. SFT teaches the model *how* to respond, not *what* it knows — that comes from\npretraining and scale. This is a faithful, fully-from-scratch demonstration that the\npretrain→SFT pipeline works end to end, not a capable assistant.\n\nPretraining uses a real **books + web** mix:\n\n**Books**—`data/get_gutenberg.sh`\n\ndownloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's Project Gutenberg license header/footer is stripped (only the text between the`*** START ... ***`\n\n/`*** END ... ***`\n\nmarkers is kept) so the model trains on prose.**Web**—`data/get_web.sh`\n\npulls a slice of[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)(high-quality educational web text) straight from the Hugging Face parquet files using the**DuckDB** CLI (a single static binary — no Python, no libraries).\n\nThen concatenate them into the pretraining corpus the trainer reads:\n\n``` php\nsh data/get_gutenberg.sh                       # books  -> data/gutenberg.txt\nsh data/get_web.sh                             # web    -> data/web.txt (~1 GB by default)\ncat data/gutenberg.txt data/web.txt > data/pretrain.txt\nsh data/get_alpaca.sh                          # instruction data for SFT -> data/alpaca.json\n```\n\nCorpora and model checkpoints are git-ignored (regenerable).\n\n- ✅ Hand-written byte-level BPE with GPT-2-style pretokenization.\n- ✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.\n- ✅ Pretraining on a books + web mix, with checkpoint/resume.\n- ✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.\n- ⏳\n**DPO**(preference optimization) — the alignment stage, next to build. - ⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.\n\n```\nnanoeuler.c             CPU model: forward, backward, training, sampling, chat REPL\ncuda/nanoeuler_cuda.cu  GPU engine: BPE, kernels, FlashAttention, pretrain/SFT/infer/chat, gradient check\ndata/get_gutenberg.sh   downloads + cleans the Gutenberg books corpus\ndata/get_web.sh         downloads a FineWeb-Edu web slice via the DuckDB CLI (no Python)\ndata/get_alpaca.sh      downloads the Alpaca instruction data for fine-tuning\nMakefile  LICENSE  shakespeare.txt  .gitignore\n```\n\nMIT. See `LICENSE`\n\n.", "url": "https://wpnews.pro/news/show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch", "canonical_source": "https://github.com/JustVugg/nanoeuler", "published_at": "2026-06-19 18:18:35+00:00", "updated_at": "2026-06-19 18:37:32.552815+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-research", "developer-tools"], "entities": ["NanoEuler", "GPT-2", "CUDA", "RTX 4070", "Leonhard Euler", "FlashAttention", "cuBLAS", "OpenMP"], "alternates": {"html": "https://wpnews.pro/news/show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch", "markdown": "https://wpnews.pro/news/show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch.md", "text": "https://wpnews.pro/news/show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch.txt", "jsonld": "https://wpnews.pro/news/show-hn-nanoeuler-gpt-2-scale-model-in-pure-c-cuda-from-scratch.jsonld"}}