Build Your Own Shakespearean LLM

A developer built a character-level language model from scratch using Shakespeare's complete works, training a nanoGPT model on a consumer-grade MacBook Pro in about 15 minutes. The project demonstrates the fundamental steps of LLM creation, from data preparation to training, using PyTorch with Metal Performance Shaders acceleration.

You know about LLMs Large Language Models , but how are they created? Let's build our own to find out By the end of this guide, you'll have trained your very own working LLM from scratch on Shakespeare's complete works about 1MB of text . The model will learn character-level patterns and generate text that sounds like Shakespeare, not particularly coherent, but with a similarish rhythm and style. The whole process takes about 15 minutes. Our goal isn't to end up with anything genuinely useful. Big AI vendors spend millions of dollars and months of compute time to achieve that. Our goal is to step through the process at a scale that fits on a typical consumer-grade desktop computer. As we go, we'll learn by doing the same basic steps that every LLM creator from tiny to mammoth follows. Note: This guide was written for the author's MacBook Pro M4 Pro with 24 GB RAM. You may need to adjust specific settings to match your own hardware. Open Terminal and run these commands. First, make sure you have Python 3.10 or later: python3 --version Python 3.11.0 If you don't have Python 3.10+, install it on Mac via Homebrew https://brew.sh/ : brew install python@3.11 Now clone the nanoGPT repository and set up a virtual environment: Clone the repository git clone https://github.com/karpathy/nanoGPT.git cd nanoGPT Create and activate a virtual environment python3 -m venv venv source venv/bin/activate Note: This guide focuses on Apple Silicon execution using Metal Performance Shaders MPS . For Compute Unified Device Architecture CUDA , modify this guide to suit your hardware e.g., swapping --device=mps for --device=cuda . Grab PyTorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu Verify MPS is working before proceeding: python python -c "import torch; print f'PyTorch version: {torch. version }' ; print f'MPS available: {torch.backends.mps.is available }' " PyTorch version: 2.5.1 MPS available: True You should see MPS available: True . If you see False , reinstall PyTorch using the command above. Then install the remaining dependencies : pip install numpy transformers datasets tiktoken wandb tqdm Run the preparation script to fetch and process the dataset we intend to train our model on: python data/shakespeare char/prepare.py You should see output like this: length of dataset in characters: 1,115,394 all the unique characters: $&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens The prepare script did four things in sequence: train.bin and 10% into val.bin . The .bin format stores raw integers efficiently, so loading during training is fast.The 90/10 train/validation split is intentional: the model learns from the training set and is periodically tested against the validation set — data it has never seen — to check whether it's genuinely learning or just memorising. The log from the prepare script reveals some interesting details about the dataset. For example, there were 65 unique characters used throughout the source text: upper and lower-case letters, the number three, some punctuation, and a space. This is particularly relevant to us because we're building a character-level language model : one that selects a candidate for each of those 65 characters to be the next one. For example, imagine the model starts with the text "fare". Drawing on everything it has learned from the training data, it assigns a probability to each of the 65 available characters being next: for example, it would assign a high probability to the character "w", a low probability for "Q" or "3", and some medium probability for other plausible options like space and "s". Now we are finally ready to start training our LLM Run the training script with the settings shown optimised for a MacBook Pro M4 Pro with 24 GB RAM : --device=mps: Uses the Apple Silicon GPU --compile=False: Required as MPS does not fully support torch.compile python train.py config/train shakespeare char.py \ --device=mps \ --compile=False \ --eval iters=20 \ --log interval=1 \ --block size=64 \ --batch size=18 \ --n layer=4 \ --n head=4 \ --n embd=128 \ --max iters=2000 \ --lr decay iters=2000 \ --dropout=0.0 What these flags mean: --device=mps → Uses Mac M-series GPU --compile=False → Required for MPS, as torch.compile doesn't support it yet --block size=64 → Context window 64 characters at a time --n layer=4 , --n head=4 , --n embd=128 → A small but capable 4-layer Transformer --max iters=2000 → Training steps about 10 minutes on a Pro-series Apple Silicon computer --dropout=0.0 → Less regularization for a small network Training progress updates will rapidly scroll by showing information similar to: iter 0: loss 4.1234, time 0.45s iter 100: loss 2.3456, time 0.43s iter 200: loss 1.9876, time 0.44s ... The loss value decreasing steadily down to below 1.6 means the model is learning. When it finishes, checkpoints will be saved to out-shakespeare-char/ . Once training is complete, generate samples from your trained model: python sample.py --out dir=out-shakespeare-char --device=mps --compile=False You'll get output a bit like this: SICINIUS: Wall, my lords mean: The death a just save. KING RICHARD III: The cannot still the shall sweet the given arm. VIRGILIA: Romeo, say, this me, what, he that the gracious, And jullow'd me were beace him, Nor dear meased unnor so to this first: Had, looks, to you. who arms my say modeurly in MENENIUS: Marry him greaten that be shall cousing; Putilance might vilentiw to him in. Provost: I proud the doubt fore for 'tis answel? It's not perfect or even very coherent Shakespeare, but after only about 10 minutes of training on a laptop, it's still impressive. Once you've got Shakespeare working, swap in any public domain author: Replace the input file : Backup the Shakespeare file mv data/shakespeare char/input.txt data/shakespeare char/input shakespeare.txt Copy your author's text cp ~/Downloads/on-the-origin-of-species.txt data/shakespeare char/input.txt Re-run preparation this builds a new character vocabulary based on your author : python data/shakespeare char/prepare.py You'll see a different vocab size: Darwin used a wider range of characters than Shakespeare, for example. Train again using the same command from Step 4 Tweaks you may try: | Setting | Value | Why | |---|---|---| --device=mps | Required | Uses GPU, 2-3x faster than CPU | --compile=False | Required | MPS doesn't support torch.compile | --batch size | 12-24 | Increase to 24 if training is stable 24GB RAM can handle it | --n layer | 4-6 | Start with 4, try 6 if you want better results slower | --max iters | 2000-5000 | 2000 is quick ~10 min ; 5000 gives noticeably better text | Regenerate the output Notice that the output from the model trained on Charles Darwin's On the Origin of Species By Means of Natural Selection is dramatically different than the one trained on Shakespearean text. Have groups conothy extinct and character this dunder of the quability and thus common to some caustful modifications, will arctic plaid, and the all genus of the continents of increase are HaGuare and distinct with placed instincts of transports any present continents in the same genus, and this tendency, for the same groups? I doe not as our consequent bur worn, are constantly to marked, if these cases to which it would such two series to the same genus Training is slow Make sure --device=mps is in your command line, without it, you're running on the CPU, and training will take 3-4x longer. Memory errors Reduce --batch size from 12 to 8. 24GB RAM is fine, but MPS GPU memory has different limits. The generated text is gibberish That's expected After only a few thousand iterations, you're seeing a "baby GPT" just starting to learn patterns. Run for --max iters=5000 or 10000 for better results. When comparing our 10-minute hobby project to commercial LLM products, it is easy to assume they must be built using fundamentally different technologies. But the basic principles behind both are identical. The massive performance leap achieved by commercial LLMs comes down to three specific scaling vectors: In this guide, our model breaks the text down character by character. It sees the word "the" as three separate atomic parts: the letters t , h , and e . Commercial models use Byte-Pair Encoding BPE tokenisation instead, grouping common letter sequences into single tokens and assigning them unique identifiers e.g., the word "the" becomes token 1169 . This vastly increases efficiency and allows the model to grasp greater meaning versus just spelling patterns. Our tiny model has a --block size=64 , meaning it can only look at a mere 64 characters at a time to predict the next one. Commercial models use vastly larger context windows, ranging from many thousands to over a million tokens. This means they can maintain context across entire codebases or lengthy documentation files rather than just a few sentences. Our model uses 4 layers and an embedding size of 128; compare that to a base production model that might use 32 layers and over 4,000 dimensions. | Feature | Our Lab Model llmlab.dev | Commercial Base Model | |---|---|---| Token Unit | Single Character | Sub-word Fragments BPE | Context Window | 64 characters | 32,000+ tokens | Layers n layer | 4 | 32 to 80+ | Embedding Size n embd | 128 | 4,096 to 8,192+ | Once you master creating your own LLMs from scratch, try: data/shakespeare/ BPE tokeniser instead of shakespeare char/ --n layer=6 --n head=6 --n embd=384 closer to GPT-2 small — Photo credit: Brett Sayles