{"slug": "build-your-own-shakespearean-llm", "title": "Build Your Own Shakespearean LLM", "summary": "A developer built a character-level language model from scratch using Shakespeare's complete works, training a nanoGPT model on a consumer-grade MacBook Pro in about 15 minutes. The project demonstrates the fundamental steps of LLM creation, from data preparation to training, using PyTorch with Metal Performance Shaders acceleration.", "body_md": "You know about LLMs (Large Language Models), but how are they created? Let's build our own to find out!\n\nBy the end of this guide, you'll have trained your very own working LLM from scratch on Shakespeare's complete works (about 1MB of text). The model will learn character-level patterns and generate text that *sounds like* Shakespeare, not particularly coherent, but with a similarish rhythm and style. The whole process takes about 15 minutes.\n\nOur goal isn't to end up with anything genuinely useful. Big AI vendors spend millions of dollars and months of compute time to achieve that. Our goal is to step through the process at a scale that fits on a typical consumer-grade desktop computer. As we go, we'll learn by doing the same basic steps that every LLM creator (from tiny to mammoth) follows.\n\n*Note:* This guide was written for the author's MacBook Pro M4 Pro with 24 GB RAM. You may need to adjust specific settings to match your own hardware.\n\nOpen Terminal and run these commands.\n\nFirst, make sure you have Python 3.10 or later:\n\n```\npython3 --version # Python 3.11.0\n```\n\nIf you **don't** have Python 3.10+, install it (on Mac) via [Homebrew](https://brew.sh/):\n\n```\nbrew install python@3.11\n```\n\nNow clone the nanoGPT repository and set up a virtual environment:\n\n```\n# Clone the repository\ngit clone https://github.com/karpathy/nanoGPT.git\ncd nanoGPT\n\n# Create and activate a virtual environment\npython3 -m venv venv\nsource venv/bin/activate\n```\n\n*Note:* This guide focuses on Apple Silicon execution using Metal Performance Shaders (MPS). For Compute Unified Device Architecture (CUDA), modify this guide to suit your hardware (e.g., swapping `--device=mps`\n\nfor `--device=cuda`\n\n).\n\n**Grab PyTorch:**\n\n```\npip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu\n```\n\n**Verify MPS is working** before proceeding:\n\n``` python\npython -c \"import torch; print(f'PyTorch version: {torch.__version__}'); print(f'MPS available: {torch.backends.mps.is_available()}')\"\n  PyTorch version: 2.5.1\n  MPS available: True\n```\n\nYou should see `MPS available: True`\n\n. If you see `False`\n\n, reinstall PyTorch using the command above.\n\nThen **install the remaining dependencies**:\n\n```\npip install numpy transformers datasets tiktoken wandb tqdm\n```\n\nRun the preparation script to fetch and process the dataset we intend to train our model on:\n\n```\npython data/shakespeare_char/prepare.py\n```\n\nYou should see output like this:\n\n```\nlength of dataset in characters: 1,115,394\nall the unique characters:\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\nvocab size: 65\ntrain has 1,003,854 tokens\nval has 111,540 tokens\n```\n\nThe `prepare`\n\nscript did four things in sequence:\n\n`train.bin`\n\nand 10% into `val.bin`\n\n. The `.bin`\n\nformat stores raw integers efficiently, so loading during training is fast.The 90/10 train/validation split is intentional: the model learns from the training set and is periodically tested against the validation set — data it has never seen — to check whether it's genuinely learning or just memorising.\n\nThe log from the `prepare`\n\nscript reveals some interesting details about the dataset. For example, there were 65 unique characters used throughout the source text: upper and lower-case letters, the number three, some punctuation, and a space. This is particularly relevant to us because we're building a **character-level language model**: one that selects a candidate for each of those 65 characters to be the next one.\n\nFor example, imagine the model starts with the text \"fare\". Drawing on everything it has learned from the training data, it assigns a probability to each of the 65 available characters being next: for example, it would assign a high probability to the character \"w\", a low probability for \"Q\" or \"3\", and some medium probability for other plausible options like space and \"s\".\n\nNow we are finally ready to start training our LLM!\n\nRun the training script with the settings shown (optimised for a MacBook Pro M4 Pro with 24 GB RAM):\n\n```\n# --device=mps: Uses the Apple Silicon GPU\n# --compile=False: Required as MPS does not fully support torch.compile\npython train.py config/train_shakespeare_char.py \\\n  --device=mps \\\n  --compile=False \\\n  --eval_iters=20 \\\n  --log_interval=1 \\\n  --block_size=64 \\\n  --batch_size=18 \\\n  --n_layer=4 \\\n  --n_head=4 \\\n  --n_embd=128 \\\n  --max_iters=2000 \\\n  --lr_decay_iters=2000 \\\n  --dropout=0.0\n```\n\n**What these flags mean:**\n\n`--device=mps`\n\n→ Uses Mac M-series GPU`--compile=False`\n\n→ Required for MPS, as `torch.compile`\n\ndoesn't support it yet`--block_size=64`\n\n→ Context window (64 characters at a time)`--n_layer=4`\n\n, `--n_head=4`\n\n, `--n_embd=128`\n\n→ A small but capable 4-layer Transformer`--max_iters=2000`\n\n→ Training steps (about 10 minutes on a Pro-series Apple Silicon computer)`--dropout=0.0`\n\n→ Less regularization for a small network**Training progress** updates will rapidly scroll by showing information similar to:\n\n```\niter 0: loss 4.1234, time 0.45s\niter 100: loss 2.3456, time 0.43s\niter 200: loss 1.9876, time 0.44s\n...\n```\n\nThe loss value decreasing steadily down to below `1.6`\n\nmeans the model is learning. When it finishes, checkpoints will be saved to `out-shakespeare-char/`\n\n.\n\nOnce training is complete, generate samples from your trained model:\n\n```\npython sample.py --out_dir=out-shakespeare-char --device=mps --compile=False\n```\n\nYou'll get output a bit like this:\n\n```\nSICINIUS:\nWall, my lords mean:\nThe death a just save.\n\nKING RICHARD III:\nThe cannot still the shall sweet the given arm.\n\nVIRGILIA:\nRomeo, say, this me, what, he that the gracious,\nAnd jullow'd me were beace him,\nNor dear meased unnor so to this first:\nHad, looks, to you. who arms my say modeurly in!\n\nMENENIUS:\nMarry him greaten that be shall cousing;\nPutilance might vilentiw to him in.\n\nProvost:\nI proud the doubt fore for 'tis answel?\n```\n\nIt's not perfect (or even very coherent) Shakespeare, but after only about 10 minutes of training on a laptop, it's still impressive.\n\nOnce you've got Shakespeare working, swap in any public domain author:\n\n**Replace the input file**:\n\n```\n# Backup the Shakespeare file\nmv data/shakespeare_char/input.txt data/shakespeare_char/input_shakespeare.txt\n\n# Copy your author's text\ncp ~/Downloads/on-the-origin-of-species.txt data/shakespeare_char/input.txt\n```\n\n**Re-run preparation** (this builds a new character vocabulary based on your author):\n\n```\npython data/shakespeare_char/prepare.py\n```\n\nYou'll see a different vocab size: Darwin used a wider range of characters than Shakespeare, for example.\n\n**Train again** using the same command from Step 4\n\nTweaks you may try:\n\n| Setting | Value | Why |\n|---|---|---|\n`--device=mps` |\nRequired | Uses GPU, 2-3x faster than CPU |\n`--compile=False` |\nRequired | MPS doesn't support torch.compile |\n`--batch_size` |\n12-24 | Increase to 24 if training is stable (24GB RAM can handle it) |\n`--n_layer` |\n4-6 | Start with 4, try 6 if you want better results (slower) |\n`--max_iters` |\n2000-5000 | 2000 is quick (~10 min); 5000 gives noticeably better text |\n\n**Regenerate** the output\n\nNotice that the output from the model trained on Charles Darwin's *On the Origin of Species By Means of Natural Selection* is dramatically different than the one trained on Shakespearean text.\n\n```\nHave groups conothy extinct and character this dunder\nof the quability and thus common to some caustful\nmodifications, will arctic plaid, and the all genus of\nthe continents of increase (are HaGuare and distinct\nwith placed instincts of transports any present\ncontinents in the same genus, and this tendency, for\nthe same groups? I doe not as our consequent bur worn,\nare constantly to marked, if these cases to which it\nwould such two series to the same genus\n```\n\n**Training is slow** Make sure `--device=mps`\n\nis in your command line, without it, you're running on the CPU, and training will take 3-4x longer.\n\n**Memory errors** Reduce `--batch_size`\n\nfrom 12 to 8. 24GB RAM is fine, but MPS GPU memory has different limits.\n\n**The generated text is gibberish** That's expected! After only a few thousand iterations, you're seeing a \"baby GPT\" just starting to learn patterns. Run for `--max_iters=5000`\n\nor `10000`\n\nfor better results.\n\nWhen comparing our 10-minute hobby project to commercial LLM products, it is easy to assume they must be built using fundamentally different technologies. But the basic principles behind both are identical.\n\nThe massive performance leap achieved by commercial LLMs comes down to three specific scaling vectors:\n\nIn this guide, our model breaks the text down character by character. It sees the word \"the\" as three separate atomic parts: the letters `t`\n\n, `h`\n\n, and `e`\n\n. Commercial models use Byte-Pair Encoding (BPE) tokenisation instead, grouping common letter sequences into single tokens and assigning them unique identifiers (e.g., the word \"the\" becomes token `1169`\n\n). This vastly increases efficiency and allows the model to grasp greater meaning versus just spelling patterns.\n\nOur tiny model has a `--block_size=64`\n\n, meaning it can only look at a mere 64 characters at a time to predict the next one. Commercial models use vastly larger context windows, ranging from many thousands to over a million tokens. This means they can maintain context across entire codebases or lengthy documentation files rather than just a few sentences.\n\nOur model uses 4 layers and an embedding size of 128; compare that to a base production model that might use 32 layers and over 4,000 dimensions.\n\n| Feature | Our Lab Model (`llmlab.dev` ) |\nCommercial Base Model |\n|---|---|---|\nToken Unit |\nSingle Character | Sub-word Fragments (BPE) |\nContext Window |\n64 characters | 32,000+ tokens |\nLayers (`n_layer` )\n|\n4 | 32 to 80+ |\nEmbedding Size (`n_embd` )\n|\n128 | 4,096 to 8,192+ |\n\nOnce you master creating your own LLMs from scratch, try:\n\n`data/shakespeare/`\n\n(BPE tokeniser) instead of `shakespeare_char/`\n\n`--n_layer=6 --n_head=6 --n_embd=384`\n\n(closer to GPT-2 small)— *Photo credit: Brett Sayles*", "url": "https://wpnews.pro/news/build-your-own-shakespearean-llm", "canonical_source": "https://dev.to/micmath/build-your-own-shakespearean-llm-49oa", "published_at": "2026-06-13 09:54:42+00:00", "updated_at": "2026-06-13 10:17:54.356124+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "generative-ai", "developer-tools"], "entities": ["nanoGPT", "PyTorch", "Metal Performance Shaders", "Shakespeare", "MacBook Pro", "Apple Silicon", "Homebrew", "Karpathy"], "alternates": {"html": "https://wpnews.pro/news/build-your-own-shakespearean-llm", "markdown": "https://wpnews.pro/news/build-your-own-shakespearean-llm.md", "text": "https://wpnews.pro/news/build-your-own-shakespearean-llm.txt", "jsonld": "https://wpnews.pro/news/build-your-own-shakespearean-llm.jsonld"}}