{"slug": "tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books", "title": "Tiny GPT in Go. Optimised for Understanding. Trained on Jules Verne Books", "summary": "A developer released a minimal GPT implementation written entirely in Go, trained on Jules Verne novels. The model generates short text fragments like \"Mysterious Island\" and takes about 40 minutes to train on an M3 MacBook Air. The project prioritizes educational clarity over performance, removing batch dimensions and external dependencies to serve as a companion to Karpathy's \"Neural Networks: Zero to Hero\" course.", "body_md": "Simple GPT implementation in pure Go. Trained on favourite Jules Verne books.\n\nWhat kind of response you can expect from the model:\n\n```\nMysterious Island.\nWell.\nMy days must follow\n```\n\nOr this:\n\n```\nCaptain Nemo, in two hundred thousand feet weary in\nthe existence of the world.\nbash\n$ go run .\n```\n\nIt takes about 40 minutes to train on MacBook Air M3. The trained weights will be saved to `model-1.234M`\n\nfile. If you rerun the model, it will pick up the saved weights and continue training. The loss should decrease each time, indicating that the model is learning something useful.\n\nYou can train on your own dataset by pointing the `data.dataset`\n\nvariable to your text corpus.\n\nTo run in chat-only mode once the training is done:\n\n``` bash\n$ go run . -chat\n```\n\nYou can use this repository as a companion to the [Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html) course. Use `git checkout <tag>`\n\nto see how the model has evolved over time: `naive`\n\n, `bigram`\n\n, `multihead`\n\n, `block`\n\n, `residual`\n\n, `full`\n\n.\n\nIn [main_test.go](https://github.com/zakirullin/gpt-go/blob/main/main_test.go) you will find explanations starting from basic neuron example:\n\n```\n// Our neuron has 2 inputs and 1 output (number of columns in weight matrix).\n// Its goal is to predict next number in the sequence.\ninput := V{1, 2} // {x1, x2}\nweight := M{\n    {2}, // how much x1 contributes to the output\n    {3}, // how much x2 contributes to the output\n}\n```\n\nAll the way to self-attention mechanism:\n\n```\n// To calculate the sum of all previous tokens, we can multiply by this triangular matrix:\ntril := M{\n    {1, 0, 0, 0}, // first token attends only at itself (\"cat\"), it can't look into the future\n    {1, 1, 0, 0}, // second token attends at itself and the previous token ( \"cat\" + \", \")\n    {1, 1, 1, 0}, // third token attends at itself and the two previous tokens (\"cat\" + \", \" + \"dog\")\n    {1, 1, 1, 1}, // fourth token attends at itself and all the previous tokens (\"cat\" + \", \" + \"dog\" + \" and\")\n}.Var()\n// So, at this point each embedding is enriched with the information from all the previous tokens.\n// That's the crux of self-attention.\nenrichedEmbeds := MatMul(tril, inputEmbeds)\n```\n\nNo batches.\n\nI've given up the complexity of the batch dimension for the sake of better understanding. It's far easier to build intuition with 2D matrices, rather than with 3D tensors. Besides, batches aren't inherent to the transformer architecture. For better gradient smoothing gradient accumulation was tried. The effect was negligible, so it was removed as well.\n\nRemoved `gonum`\n\n.\n\nThe `gonum.matmul`\n\ngave us ~30% performance boost, but it brought additional dependency. We're not striving for maximum efficiency here, rather for radical simplicity. Current matmul implementation is quite effective, and it's only 40 lines of plain readable code.\n\nYou don't need to read them to understand the code :)\n\n[Attention Is All You Need](https://arxiv.org/abs/1706.03762)\n\n[Deep Residual Learning](https://arxiv.org/abs/1512.03385)\n\n[DeepMind WaveNet](https://arxiv.org/abs/1609.03499)\n\n[Batch Normalization](https://arxiv.org/abs/1502.03167)\n\n[Deep NN + huge data = breakthrough performance](https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)\n\n[OpenAI GPT-3 paper](https://arxiv.org/abs/2005.14165)\n\n[Analyzing the Structure of Attention](https://arxiv.org/abs/1906.04284)\n\nMany thanks to [Andrej Karpathy](https://github.com/karpathy) for his brilliant [Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html) course.", "url": "https://wpnews.pro/news/tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books", "canonical_source": "https://github.com/zakirullin/gpt-go", "published_at": "2026-06-02 21:22:25+00:00", "updated_at": "2026-06-02 21:49:08.230312+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "generative-ai", "large-language-models", "natural-language-processing"], "entities": ["Jules Verne", "MacBook Air M3", "Karpathy", "Neural Networks: Zero to Hero", "GPT", "Go"], "alternates": {"html": "https://wpnews.pro/news/tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books", "markdown": "https://wpnews.pro/news/tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books.md", "text": "https://wpnews.pro/news/tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books.txt", "jsonld": "https://wpnews.pro/news/tiny-gpt-in-go-optimised-for-understanding-trained-on-jules-verne-books.jsonld"}}