Vincenzo's NanoEuler rebuilds a GPT-2-scale training stack in C and CUDA

wpnews.pro

Vincenzo (@justvugg), a developer whose GitHub profile gives only his first name, has put NanoEuler in public view as a from-scratch GPT-2-class language-model training project written in C and CUDA, with no PyTorch, no autograd layer, and no model framework in the implementation.

The project surfaced on Sunday, June 28, 2026, when a Hacker News user posted the repository in a post on Hacker News. The repository itself predates the discussion cycle: the visible GitHub history shows the initial commit on June 15, README work on June 17, and a latest visible commit on June 18. At capture, it had 7 stars, 1 fork, 45 commits, 1 branch, no tags, and an MIT license.

NanoEuler is not a startup launch and not a funded infrastructure company. That is part of why it is worth watching. The repo is a solo, public engineering artifact from a developer trying to understand the model stack by removing the layers that usually hide it. In the README, Vincenzo frames NanoEuler as a way to understand how LLMs are built end to end, not just how to interface with them.

His public footprint is thin in the conventional founder-profile sense. The JustVugg GitHub profile identifies him as Vincenzo, links to @justvugg, lists organizations @snmpware and @llm-use, and shows 116 public repositories, 22 followers, and 23 following. It does not establish a surname, employer, school, location, or company role. What it does establish is a wide trail of small systems projects across AI agents, databases, SNMP tooling, developer infrastructure, and LLM utilities. NanoEuler fits that pattern: less a polished product than a proof that the author can hold the whole stack in his head.

What Vincenzo actually built

According to the NanoEuler README, the repository includes hand-written forward and backward passes, a hand-written byte-level BPE tokenizer, pretraining on a books plus web corpus, and supervised fine-tuning into a chat model. The CPU path uses libm and OpenMP for small showcase runs. The CUDA path uses cuBLAS matrix multiplications, a hand-written FlashAttention kernel, GPU-vs-CPU validation, and full-model gradient checks.

The headline model size is about 116 million parameters, roughly GPT-2-small territory. The README says the CUDA pipeline uses a 16-layer, 768-dimensional decoder-only transformer with 12 query heads, 4 key-value heads, a 512-token context length, and a 4,096-token vocabulary. The architecture includes RMSNorm, RoPE, SwiGLU, grouped-query attention, multi-token prediction heads, no biases, and a byte-level BPE tokenizer with GPT-2-style pretokenization.

The lower-level implementation details are the point. NanoEuler delegates matrix multiplication to cuBLAS but implements the training engine around it: RMSNorm, RoPE, grouped-query attention, FlashAttention, SwiGLU, softmax/cross-entropy, AdamW, checkpointing, inference, and chat.

For data, the repo targets a books-plus-web pretraining mix and adds supervised instruction-tuning for the chat path.

The honesty is the product signal

The strongest part of NanoEuler is not a benchmark claim. It is the README's refusal to oversell the model.

Vincenzo describes the roughly 116 million-parameter model as a text generator "in the spirit of GPT-2-small" with no real-world knowledge, and says it is "not a capable assistant." The fine-tuned chat path is framed as evidence that pretraining followed by supervised fine-tuning works end to end, not as a useful chatbot. That distinction matters because the public market for AI tooling still rewards demo-shaped claims that skip over data scale, evals, and distribution.

NanoEuler does not skip the limitation. A sample in the README shows grammar and long clauses, but not reliable meaning. The stated scope is educational and research-oriented: own every piece of the system, from parameters and gradients to tokenizer, kernels, pretraining, and fine-tuning.

That also explains the name. The README ties residual blocks to the forward-Euler method for numerical integration: a residual update of the form x = x + f(x)

can be read as one discretized step of an ordinary differential equation. It is a small naming choice, but it reveals the project as a learning path through the math and the machine code rather than a wrapper around an API.

The shadow of llm.c

NanoEuler sits in the same cultural lane as Andrej Karpathy's llm.c, which describes itself as LLM training in simple raw C/CUDA and focuses on reproducing GPT-2 and GPT-3-style training without leaning on the bulk of PyTorch or CPython. Karpathy's nanoGPT is the adjacent PyTorch reference point: a compact training and fine-tuning repo that its own README now calls old and deprecated in favor of newer work.

The comparison is useful but not flattening. llm.c is the known reference project with a large community and a sharper benchmarking posture. NanoEuler is smaller, newer, and less externally validated. Its claim to attention is not that it beats the incumbent low-level stacks. It is that an individual developer can now reproduce enough of the training pipeline, tokenizer, CUDA kernels, gradient checks, and SFT loop to make the system inspectable.

That distinction also separates NanoEuler from llama.cpp, whose center of gravity is local inference and broad model support in C/C++, and from tinygrad, which is a broader deep-learning stack with autograd, compiler work, JIT, graph execution, neural-network layers, optimizers, and datasets. NanoEuler is narrower: one repository, one GPT-style training stack, and a deliberate decision to write the backward pass by hand.

No financing, no product, no shortcut

There is no disclosed financing behind NanoEuler, no company entity, no pricing page, no customer list, no cloud service, and no claim of commercial adoption. That absence should not be read as a weakness. It is the boundary of the story.

Low-level AI infrastructure is attracting venture money elsewhere, especially around inference engines, alternative accelerators, cloud capacity, and developer tooling. NanoEuler is not competing in that market today. It is competing for something earlier in the pipeline: credibility among developers who believe the next layer of AI infrastructure cannot be built only by prompting black boxes and importing Python libraries.

That is also why the repo's roadmap is revealing. Vincenzo lists DPO and preference optimization, scaling toward about 270 million parameters, and publishing a trained checkpoint as next steps. Those are not product milestones in the SaaS sense. They are the obvious next rungs in an engineer's attempt to recreate the modern LLM lifecycle at a size that still fits in a single-person project.

RuntimeWire has been tracking a similar pattern in small developer tools: Daniel Lyons' Treedocs 0.2.0 made repo structure explicit for humans and coding agents, while AISlop turned AI-generated code smells into CI checks. NanoEuler is further down the stack. Instead of policing AI output or documenting codebases for agents, Vincenzo is rebuilding the training machinery those agents depend on.

The open question is not whether NanoEuler is a useful assistant. Its own author says it is not. The question is whether this kind of from-scratch work becomes a hiring signal, a research notebook, or the seed of a more durable infrastructure project. On the verified facts available today, NanoEuler is the first two: a compact public record of a developer learning the LLM stack by writing the hard parts himself.

source & further reading

runtimewire.com — original article Sean Du brings a reasoning-model hallucination detector to ICML 2026 Elon Musk says Grok 4.5 is in private beta at SpaceX and Tesla Head to head: Bagel vs Rundiffusion Photo Flux

Vincenzo's NanoEuler rebuilds a GPT-2-scale training stack in C and CUDA

What Vincenzo actually built

The honesty is the product signal

The shadow of llm.c

No financing, no product, no shortcut

Run your AI side-project on zahid.host