# Build Your Own Shakespearean LLM

> Source: <https://dev.to/micmath/build-your-own-shakespearean-llm-49oa>
> Published: 2026-06-13 09:54:42+00:00

You know about LLMs (Large Language Models), but how are they created? Let's build our own to find out!

By the end of this guide, you'll have trained your very own working LLM from scratch on Shakespeare's complete works (about 1MB of text). The model will learn character-level patterns and generate text that *sounds like* Shakespeare, not particularly coherent, but with a similarish rhythm and style. The whole process takes about 15 minutes.

Our goal isn't to end up with anything genuinely useful. Big AI vendors spend millions of dollars and months of compute time to achieve that. Our goal is to step through the process at a scale that fits on a typical consumer-grade desktop computer. As we go, we'll learn by doing the same basic steps that every LLM creator (from tiny to mammoth) follows.

*Note:* This guide was written for the author's MacBook Pro M4 Pro with 24 GB RAM. You may need to adjust specific settings to match your own hardware.

Open Terminal and run these commands.

First, make sure you have Python 3.10 or later:

```
python3 --version # Python 3.11.0
```

If you **don't** have Python 3.10+, install it (on Mac) via [Homebrew](https://brew.sh/):

```
brew install python@3.11
```

Now clone the nanoGPT repository and set up a virtual environment:

```
# Clone the repository
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
```

*Note:* This guide focuses on Apple Silicon execution using Metal Performance Shaders (MPS). For Compute Unified Device Architecture (CUDA), modify this guide to suit your hardware (e.g., swapping `--device=mps`

for `--device=cuda`

).

**Grab PyTorch:**

```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
```

**Verify MPS is working** before proceeding:

``` python
python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'MPS available: {torch.backends.mps.is_available()}')"
  PyTorch version: 2.5.1
  MPS available: True
```

You should see `MPS available: True`

. If you see `False`

, reinstall PyTorch using the command above.

Then **install the remaining dependencies**:

```
pip install numpy transformers datasets tiktoken wandb tqdm
```

Run the preparation script to fetch and process the dataset we intend to train our model on:

```
python data/shakespeare_char/prepare.py
```

You should see output like this:

```
length of dataset in characters: 1,115,394
all the unique characters:
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
```

The `prepare`

script did four things in sequence:

`train.bin`

and 10% into `val.bin`

. The `.bin`

format stores raw integers efficiently, so loading during training is fast.The 90/10 train/validation split is intentional: the model learns from the training set and is periodically tested against the validation set — data it has never seen — to check whether it's genuinely learning or just memorising.

The log from the `prepare`

script reveals some interesting details about the dataset. For example, there were 65 unique characters used throughout the source text: upper and lower-case letters, the number three, some punctuation, and a space. This is particularly relevant to us because we're building a **character-level language model**: one that selects a candidate for each of those 65 characters to be the next one.

For example, imagine the model starts with the text "fare". Drawing on everything it has learned from the training data, it assigns a probability to each of the 65 available characters being next: for example, it would assign a high probability to the character "w", a low probability for "Q" or "3", and some medium probability for other plausible options like space and "s".

Now we are finally ready to start training our LLM!

Run the training script with the settings shown (optimised for a MacBook Pro M4 Pro with 24 GB RAM):

```
# --device=mps: Uses the Apple Silicon GPU
# --compile=False: Required as MPS does not fully support torch.compile
python train.py config/train_shakespeare_char.py \
  --device=mps \
  --compile=False \
  --eval_iters=20 \
  --log_interval=1 \
  --block_size=64 \
  --batch_size=18 \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=128 \
  --max_iters=2000 \
  --lr_decay_iters=2000 \
  --dropout=0.0
```

**What these flags mean:**

`--device=mps`

→ Uses Mac M-series GPU`--compile=False`

→ Required for MPS, as `torch.compile`

doesn't support it yet`--block_size=64`

→ Context window (64 characters at a time)`--n_layer=4`

, `--n_head=4`

, `--n_embd=128`

→ A small but capable 4-layer Transformer`--max_iters=2000`

→ Training steps (about 10 minutes on a Pro-series Apple Silicon computer)`--dropout=0.0`

→ Less regularization for a small network**Training progress** updates will rapidly scroll by showing information similar to:

```
iter 0: loss 4.1234, time 0.45s
iter 100: loss 2.3456, time 0.43s
iter 200: loss 1.9876, time 0.44s
...
```

The loss value decreasing steadily down to below `1.6`

means the model is learning. When it finishes, checkpoints will be saved to `out-shakespeare-char/`

.

Once training is complete, generate samples from your trained model:

```
python sample.py --out_dir=out-shakespeare-char --device=mps --compile=False
```

You'll get output a bit like this:

```
SICINIUS:
Wall, my lords mean:
The death a just save.

KING RICHARD III:
The cannot still the shall sweet the given arm.

VIRGILIA:
Romeo, say, this me, what, he that the gracious,
And jullow'd me were beace him,
Nor dear meased unnor so to this first:
Had, looks, to you. who arms my say modeurly in!

MENENIUS:
Marry him greaten that be shall cousing;
Putilance might vilentiw to him in.

Provost:
I proud the doubt fore for 'tis answel?
```

It's not perfect (or even very coherent) Shakespeare, but after only about 10 minutes of training on a laptop, it's still impressive.

Once you've got Shakespeare working, swap in any public domain author:

**Replace the input file**:

```
# Backup the Shakespeare file
mv data/shakespeare_char/input.txt data/shakespeare_char/input_shakespeare.txt

# Copy your author's text
cp ~/Downloads/on-the-origin-of-species.txt data/shakespeare_char/input.txt
```

**Re-run preparation** (this builds a new character vocabulary based on your author):

```
python data/shakespeare_char/prepare.py
```

You'll see a different vocab size: Darwin used a wider range of characters than Shakespeare, for example.

**Train again** using the same command from Step 4

Tweaks you may try:

| Setting | Value | Why |
|---|---|---|
`--device=mps` |
Required | Uses GPU, 2-3x faster than CPU |
`--compile=False` |
Required | MPS doesn't support torch.compile |
`--batch_size` |
12-24 | Increase to 24 if training is stable (24GB RAM can handle it) |
`--n_layer` |
4-6 | Start with 4, try 6 if you want better results (slower) |
`--max_iters` |
2000-5000 | 2000 is quick (~10 min); 5000 gives noticeably better text |

**Regenerate** the output

Notice that the output from the model trained on Charles Darwin's *On the Origin of Species By Means of Natural Selection* is dramatically different than the one trained on Shakespearean text.

```
Have groups conothy extinct and character this dunder
of the quability and thus common to some caustful
modifications, will arctic plaid, and the all genus of
the continents of increase (are HaGuare and distinct
with placed instincts of transports any present
continents in the same genus, and this tendency, for
the same groups? I doe not as our consequent bur worn,
are constantly to marked, if these cases to which it
would such two series to the same genus
```

**Training is slow** Make sure `--device=mps`

is in your command line, without it, you're running on the CPU, and training will take 3-4x longer.

**Memory errors** Reduce `--batch_size`

from 12 to 8. 24GB RAM is fine, but MPS GPU memory has different limits.

**The generated text is gibberish** That's expected! After only a few thousand iterations, you're seeing a "baby GPT" just starting to learn patterns. Run for `--max_iters=5000`

or `10000`

for better results.

When comparing our 10-minute hobby project to commercial LLM products, it is easy to assume they must be built using fundamentally different technologies. But the basic principles behind both are identical.

The massive performance leap achieved by commercial LLMs comes down to three specific scaling vectors:

In this guide, our model breaks the text down character by character. It sees the word "the" as three separate atomic parts: the letters `t`

, `h`

, and `e`

. Commercial models use Byte-Pair Encoding (BPE) tokenisation instead, grouping common letter sequences into single tokens and assigning them unique identifiers (e.g., the word "the" becomes token `1169`

). This vastly increases efficiency and allows the model to grasp greater meaning versus just spelling patterns.

Our tiny model has a `--block_size=64`

, meaning it can only look at a mere 64 characters at a time to predict the next one. Commercial models use vastly larger context windows, ranging from many thousands to over a million tokens. This means they can maintain context across entire codebases or lengthy documentation files rather than just a few sentences.

Our model uses 4 layers and an embedding size of 128; compare that to a base production model that might use 32 layers and over 4,000 dimensions.

| Feature | Our Lab Model (`llmlab.dev` ) |
Commercial Base Model |
|---|---|---|
Token Unit |
Single Character | Sub-word Fragments (BPE) |
Context Window |
64 characters | 32,000+ tokens |
Layers (`n_layer` )
|
4 | 32 to 80+ |
Embedding Size (`n_embd` )
|
128 | 4,096 to 8,192+ |

Once you master creating your own LLMs from scratch, try:

`data/shakespeare/`

(BPE tokeniser) instead of `shakespeare_char/`

`--n_layer=6 --n_head=6 --n_embd=384`

(closer to GPT-2 small)— *Photo credit: Brett Sayles*