# Building an LLM from Scratch with Pytorch

> Source: <https://pub.towardsai.net/building-an-llm-from-scratch-with-pytorch-b211a0d6793d?source=rss----98111c9905da---4>
> Published: 2026-06-25 12:31:00+00:00

Who this is for:You have used PyTorch at least once, or you are comfortable reading it. You know python and have seen and read PyTorch code before. The rest you will pick up as we build. Lets go!

There are two ways to understand how a language model works. The first is to read descriptions of it: attention mechanisms, transformer blocks, token embeddings. The second is to build one yourself, watch it fail, fix the failure, and watch it work. The second way produces a completely different quality of understanding.

This article takes the second approach. We are going to build a small but complete language model from the ground up using PyTorch. Not a wrapper around an existing model. Not a fine-tuning tutorial. The actual components, assembled in the actual order, with the actual reasons explained at each step.

By the end, you will have a working model that can generate text. More importantly, you will know why every piece is there, what it is doing, and what breaks if you remove it.

Before anything: make sure these three imports are at the top of your file. Every code block in this article assumes them.

``` python
import torchimport torch.nn as nnimport torch.nn.functional as F
```

Before writing a single class, you need the right mental model. A language model does one thing: given a sequence of words, it predicts what word comes next.

That is it. That is the entire job.

Read the sentence “The cat sat on the”. A language model looks at those five words and assigns a probability to every word in its vocabulary. “Mat” gets a high probability. “Quantum” gets a low one. “Floor” gets a medium one. The model picks the most likely next word, adds it to the sequence, and repeats.

Everything in this article : the tokenizer, the embeddings, the attention mechanism, the transformer blocks exists to make that one prediction as accurate as possible.

Here is the architecture we are building, in order:

Each section below builds one row of this diagram. By the time we reach the bottom, we will assemble them into one complete model.

Neural networks only understand numbers. They cannot read the word “hello.” They can read the number 8730. A tokenizer is the bridge between human language and machine arithmetic , it converts raw text into a sequence of numbers, and converts sequences of numbers back into text.

Every language model has a tokenizer. The tokenizer decides what the smallest unit of language is. In this article, we use character-level tokenization: each individual character is one token. This keeps the implementation simple and clear without hiding anything important.

The tokenizer builds a vocabulary: a dictionary that maps every character it has seen to a unique integer. The word “hello” becomes five integers, one per character. To convert back, you reverse the dictionary.

``` php
class Tokenizer:@staticmethod    def build_vocab(text: str) -> dict:        """        Scan the text and assign a unique integer to every unique character.        The result is the vocabulary: the complete set of characters this        tokenizer knows how to handle.        A special <unk> token is added for characters seen at inference        time that were not in the training text.        """        # sorted() makes the vocabulary deterministic across runs.        # Without sorting, the same text could produce different integer        # assignments each time, making saved models incompatible.        vocab = {            char: idx            for idx, char in enumerate(sorted(set(text)))        }        vocab["<unk>"] = len(vocab)        return vocab    def __init__(self, vocab: dict):        self.encode_map = {str(k): int(v) for k, v in vocab.items()}        # The decode map is just the encode map reversed.        self.decode_map = {v: k for k, v in self.encode_map.items()}    def encode(self, text: str) -> list[int]:        """        Convert a string into a list of integers.        Unknown characters map to the <unk> token rather than crashing.        """        return [            self.encode_map.get(char, self.encode_map["<unk>"])            for char in text        ]    def decode(self, indices: list[int]) -> str:        """        Convert a list of integers back into a string.        """        return "".join([            self.decode_map.get(idx, "<unk>")            for idx in indices        ])
```

Word-level tokenization has a simpler vocabulary but breaks on any word it has never seen. Character-level tokenization has a tiny vocabulary (roughly 100 characters cover most English text) and handles any input without crashing. For learning purposes, character level is the right choice: the vocabulary fits in memory, the code is transparent, and nothing is hidden by an external library.

```
text = "Hello, PyTorch."vocab = Tokenizer.build_vocab(text)tokenizer = Tokenizer(vocab)encoded = tokenizer.encode("Hello")print(encoded)   # Something like [3, 4, 8, 8, 11] depending on vocab orderdecoded = tokenizer.decode(encoded)print(decoded)   # Hello
```

After tokenization, each word is a single integer. The problem is that a single integer carries no information about meaning. The integer 42 is not “closer” to the integer 43 than it is to the integer 1. But the word “car” should be conceptually close to the word “vehicle.” Single integers cannot represent that relationship.

An embedding layer solves this by replacing each integer with a vector: a list of floating-point numbers, typically between 50 and 1024 values long. These vectors are learnable during training, the model adjusts them so that words with similar meanings end up with similar vectors. Similarity between vectors can be measured, and that measurement reflects real semantic similarity between words.

The simplest mental model for an embedding layer is a spreadsheet. It has one row per token in the vocabulary, and one column per dimension of the embedding. When the model sees token number 42, it looks up row 42 of the spreadsheet and extracts that entire row as a vector. That vector is what travels through the rest of the model.

The values in that spreadsheet start random and get adjusted during training. That is the entire mechanism.

Here is something subtle that trips up most people the first time: the attention mechanism you will see in the next section has no inherent sense of order. If you fed it “the cat sat” and “sat the cat,” it would see the same tokens and, without extra information, produce the same output. But word order clearly matters.

Positional encoding fixes this. For each position in the sequence (position 0, position 1, position 2…), we create a learnable vector that represents that position. We add this positional vector to the word embedding vector before anything enters the attention mechanism. Now the model sees both the meaning of the word and where it appears in the sequence.

``` python
class Embedding(nn.Module):def __init__(self, vocab_size: int, embedding_dim: int):        """        Two embedding tables live here:          self.token_embedding - one row per vocabulary token, learns word meaning          self.position_embedding - one row per position, learns word order        Both tables have the same number of columns (embedding_dim) so they        can be added together.        Args:            vocab_size (int): Total number of unique tokens in the vocabulary.            embedding_dim (int): Size of the vector representing each token.                                 Larger = more expressive, but more expensive.        """        super().__init__()        self.token_embedding    = nn.Embedding(vocab_size, embedding_dim)        self.position_embedding = nn.Embedding(vocab_size, embedding_dim)    def forward(self, x: torch.Tensor) -> torch.Tensor:        """        Args:            x: Token indices of shape (batch_size, sequence_length)        Returns:            Combined embedding of shape (batch_size, sequence_length, embedding_dim)        """        # Token meaning        word_vectors     = self.token_embedding(x)        # Positional order - x serves double duty as position indices        position_vectors = self.position_embedding(x)        # Adding them gives the model both dimensions of information        return word_vectors + position_vectors
```

Reading a sentence is not just about knowing what each word means in isolation. It is about understanding which words relate to which other words. In the sentence “The bank by the river was steep,” the word “bank” means something entirely different than it does in “The bank approved my loan.” The model needs to look at the surrounding words to disambiguate.

The attention mechanism is how a language model looks at surrounding words. For each word in the sequence, it asks: “which other words in this sequence are most relevant to understanding this word right now?” It then produces a new representation of that word that blends in information from the relevant other words.

The attention mechanism uses three concepts with names borrowed from information retrieval.

The **Query** represents what a word is looking for. The **Key** represents what each word has to offer. The **Value** is the actual information that gets blended in once relevance is established.

The simplest analogy is a search engine. You type a search query. The engine compares your query against the keys (page titles, metadata) of every page it knows about. The pages with the most relevant keys get their values (actual content) returned to you. Attention does the same thing, but for every word in the sequence simultaneously.

In practice, Query, Key, and Value are all computed by passing the input through three separate linear layers. They start as the same data and get projected into three different “views” of it.

There is one more piece: the mask. During training, the model sees the full sentence at once. But it is being trained to predict the next word, which means it should only be allowed to look at previous words, not future ones. If it could see future words, it would cheat — the answer would be right there in its context.

The mask enforces this by setting the attention score between a word and any future word to negative infinity. After the softmax step, negative infinity becomes zero, which means those positions contribute nothing to the output. Each word attends only to itself and to everything before it.

``` python
class AttentionBlock(nn.Module):def __init__(self, embedding_dim: int, head_dim: int, context_size: int):        """        One attention head. Computes relevance between every pair of tokens        in the sequence and uses that relevance to build enriched representations.        Args:            embedding_dim (int): Size of the input vectors coming in.            head_dim (int): Size of the Query, Key, and Value projections.                            In multi-head attention this is embedding_dim // num_heads.            context_size (int): Maximum sequence length the model can handle.                                 Determines the size of the causal mask.        """        super().__init__()        self.query = nn.Linear(embedding_dim, head_dim, bias=False)        self.key   = nn.Linear(embedding_dim, head_dim, bias=False)        self.value = nn.Linear(embedding_dim, head_dim, bias=False)        # The mask is not a learned parameter - it is a fixed triangular matrix.        # register_buffer makes it part of the module (moves to GPU with the model)        # without making it a trainable weight.        ones = torch.ones(context_size, context_size)        self.register_buffer("mask", torch.tril(ones))    def forward(self, x: torch.Tensor) -> torch.Tensor:        """        Args:            x: Input tensor of shape (batch_size, seq_len, embedding_dim)        Returns:            Enriched representation of shape (batch_size, seq_len, head_dim)        """        B, T, C = x.shape        Q = self.query(x)   # What each token is looking for        K = self.key(x)     # What each token has to offer        V = self.value(x)   # The actual content to blend in        # Compute relevance scores between every pair of positions.        # Scaling by head_dim**-0.5 prevents the scores from getting so large        # that the softmax collapses to one-hot distributions (one dominant        # token drowning out everything else).        scores = Q @ K.transpose(-2, -1) * (C ** -0.5)        # Apply the causal mask: future positions become -inf, then 0 after softmax.        scores = scores.masked_fill(self.mask[:T, :T] == 0, float("-inf"))        # Softmax turns the scores into weights that sum to 1.        weights = F.softmax(scores, dim=-1)        # Blend the Value vectors according to those weights.        return weights @ V
```

Every other component in a transformer is supporting this one. The embeddings exist to give attention something meaningful to compare. The positional encoding exists to give attention a sense of order. The feed-forward layer that comes next exists to process what attention discovered. Attention is the mechanism that makes language models different from every other neural network that came before, and it is the piece worth spending extra time understanding before moving on.

A single attention head looks for one type of relationship between tokens. But language has many simultaneous relationship types. In “She gave him the book she wrote,” one relationship is between “she” and “wrote” (same person). Another relationship is between “him” and “gave” (recipient). A third is between “book” and “wrote” (what she wrote).

Multi-head attention runs several attention heads in parallel, each with its own Query, Key, and Value weights. Each head specialises in finding different patterns. Their outputs are then concatenated and projected back to the original dimension.

``` python
class MultiHeadAttention(nn.Module):def __init__(self, embedding_dim: int, num_heads: int, context_size: int):        """        Runs num_heads attention heads in parallel and combines their outputs.        Each head operates on a slice of the embedding dimension (head_dim),        so the total computation stays proportional to embedding_dim regardless        of how many heads are used.        Args:            embedding_dim (int): Size of the input and output vectors.            num_heads (int): Number of parallel attention heads.                             Must divide evenly into embedding_dim.            context_size (int): Maximum sequence length.        """        super().__init__()        assert embedding_dim % num_heads == 0, (            f"embedding_dim ({embedding_dim}) must be divisible by "            f"num_heads ({num_heads})"        )        head_dim = embedding_dim // num_heads        # Each head is a full AttentionBlock with its own learnable weights.        # nn.ModuleList registers them properly so PyTorch tracks their parameters.        self.heads = nn.ModuleList([            AttentionBlock(embedding_dim, head_dim, context_size)            for _ in range(num_heads)        ])        # After concatenation, project back to embedding_dim.        # This lets the model mix information across heads.        self.projection = nn.Linear(embedding_dim, embedding_dim)    def forward(self, x: torch.Tensor) -> torch.Tensor:        """        Run all heads in parallel, concatenate their outputs, project.        Args:            x: Input of shape (batch_size, seq_len, embedding_dim)        Returns:            Output of shape (batch_size, seq_len, embedding_dim)        """        # Each head produces (batch_size, seq_len, head_dim).        # Concatenating along the last dimension gives (batch_size, seq_len, embedding_dim).        concatenated = torch.cat([head(x) for head in self.heads], dim=-1)        return self.projection(concatenated)
```

Attention is very good at moving information between positions in the sequence. What it does not do well is process that information once it has been gathered. The feed-forward layer is where that processing happens.

After attention has figured out “what is relevant to what,” the feed-forward layer applies a learnable non-linear transformation to each position independently. This is where the model learns to reason about the relationships that attention surfaced.

The structure is always the same: expand the dimension, apply a non-linearity, compress back down. The expansion (typically to four times the embedding dimension) gives the model a larger space to work in during the reasoning step.

``` python
class FeedForward(nn.Module):  def __init__(self, embedding_dim: int, ff_dim: int):          """          A two-layer network applied to each position independently.          The expansion to ff_dim (typically 4 * embedding_dim) creates room          for the model to represent complex transformations before compressing          back down. ReLU introduces the non-linearity that lets the network          learn things a pure linear model cannot.          Args:              embedding_dim (int): Input and output dimension.              ff_dim (int): Intermediate dimension. Typically 4 * embedding_dim.          """          super().__init__()          self.expand   = nn.Linear(embedding_dim, ff_dim)          self.activate = nn.ReLU()          self.compress = nn.Linear(ff_dim, embedding_dim)      def forward(self, x: torch.Tensor) -> torch.Tensor:          x = self.expand(x)          x = self.activate(x)          x = self.compress(x)          return x
```

A transformer block is the full reasoning unit of the model: multi-head attention, followed by a feed-forward layer, with two important additions: residual connections and layer normalisation.

**Residual connections** (also called skip connections) add the block’s input back to its output. If the attention or feed-forward layer learns something useful, the residual carries it forward. If it learns something harmful, the original signal is preserved. This is what makes it possible to stack many transformer blocks without training becoming impossible.

**Layer normalisation** stabilises the values flowing through the network. Without it, values can grow or shrink uncontrollably across many layers. Normalisation keeps them in a predictable range, which makes training faster and more stable. The standard practice is to normalise the input before passing it to each sub-layer, so the attention and feed-forward layers always receive well-scaled inputs.

``` python
class TransformerBlock(nn.Module):  def __init__(          self,          embedding_dim: int,          num_heads: int,          context_size: int,          ff_dim: int      ):          """          One complete transformer block: the repeating unit stacked N times          to form the body of the language model.          Processing order (Pre-LN variant, most common today):            1. Normalise → Attention → Add residual            2. Normalise → FeedForward → Add residual          Normalising before the sub-layer (Pre-LN) rather than after          produces more stable gradients during training.          Args:              embedding_dim (int): Model's core vector dimension.              num_heads (int): Number of attention heads.              context_size (int): Maximum sequence length.              ff_dim (int): Feed-forward intermediate dimension.          """          super().__init__()          self.norm_1      = nn.LayerNorm(embedding_dim)          self.attention   = MultiHeadAttention(embedding_dim, num_heads, context_size)          self.norm_2      = nn.LayerNorm(embedding_dim)          self.feed_forward = FeedForward(embedding_dim, ff_dim)      def forward(self, x: torch.Tensor) -> torch.Tensor:          # Step 1: Attention with residual          # Normalise first, run attention, add the result back to the original x.          x = x + self.attention(self.norm_1(x))          # Step 2: Feed-forward with residual          # Normalise first, run feed-forward, add the result back.          x = x + self.feed_forward(self.norm_2(x))          return x
```

Every piece is now built. The language model stacks them in order:

The embedding layer converts token indices into vectors with positional information. A sequence of transformer blocks processes those vectors, with each block refining the representation by attending to context and reasoning about it. A final layer normalisation stabilises the output. A linear projection (the “language model head”) maps from the embedding dimension to the vocabulary size, producing one score per token. The token with the highest score is the model’s prediction.

``` python
import torchimport torch.nn as nnimport torch.nn.functional as Fimport requestsclass LanguageModel(nn.Module):    @staticmethod    def load_shakespeare():        """        Downloads Tiny Shakespeare (~1MB) and returns the raw text string.        Cached in memory — call once and pass the result around.        """        url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"        response = requests.get(url)        response.raise_for_status()        print(f"Dataset loaded: {len(response.text):,} characters")        return response.text    def __init__(        self,        vocab_size: int,        embedding_dim: int,        num_heads: int,        num_layers: int,        context_size: int,        ff_dim: int    ):        """        A complete decoder-only language model.        Args:            vocab_size (int): Number of unique tokens. Determines embedding table                              size and the output dimension of the LM head.            embedding_dim (int): Core vector dimension throughout the model.                                 All layers communicate using vectors of this size.            num_heads (int): Attention heads per transformer block.                             Must divide evenly into embedding_dim.            num_layers (int): Number of stacked transformer blocks.                              More layers = deeper reasoning, more computation.            context_size (int): Maximum number of tokens the model processes at once.                                The attention mask is sized by this value.            ff_dim (int): Feed-forward intermediate dimension in each block.                          Typically 4 * embedding_dim.        """        super().__init__()        self.embedding = Embedding(vocab_size, embedding_dim)        self.blocks = nn.Sequential(*[            TransformerBlock(embedding_dim, num_heads, context_size, ff_dim)            for _ in range(num_layers)        ])        self.final_norm = nn.LayerNorm(embedding_dim)        self.lm_head = nn.Linear(embedding_dim, vocab_size)    def forward(        self,        x: torch.Tensor,        targets: torch.Tensor = None    ) -> tuple[torch.Tensor, torch.Tensor | None]:        """        Forward pass.        Args:            x: Token indices of shape (batch_size, seq_len)            targets: Optional ground-truth next tokens for training.                     If provided, computes and returns cross-entropy loss.        Returns:            logits: Raw scores of shape (batch_size, seq_len, vocab_size).            loss:   Cross-entropy loss if targets provided, otherwise None.        """        x = self.embedding(x)        x = self.blocks(x)        x = self.final_norm(x)        logits = self.lm_head(x)        loss = None        if targets is not None:            B, T, V = logits.shape            loss = F.cross_entropy(                logits.view(B * T, V),                targets.view(B * T)            )        return logits, loss    @torch.no_grad()    def generate(        self,        prompt: torch.Tensor,        max_new_tokens: int,        context_size: int    ) -> torch.Tensor:        """        Autoregressively generate new tokens from a prompt.        Args:            prompt: Starting token indices of shape (1, seq_len)            max_new_tokens: How many tokens to generate            context_size: The model's maximum context window        Returns:            Full sequence including prompt + generated tokens        """        self.eval()        for _ in range(max_new_tokens):            context    = prompt[:, -context_size:]            logits, _  = self(context)            last_logits = logits[:, -1, :]            probs       = F.softmax(last_logits, dim=-1)            next_token  = torch.multinomial(probs, num_samples=1)            prompt      = torch.cat([prompt, next_token], dim=1)        return prompt
```

Here is a complete, runnable training loop. Each section is commented with what it is doing and why, not just what the code says.

``` python
import torchimport torch.nn as nnimport torch.nn.functional as Fimport requests# ── Configuration ─────────────────────────────────────────────EMBEDDING_DIM = 128    # Vector size throughout the modelNUM_HEADS     = 4      # Must divide evenly into EMBEDDING_DIMNUM_LAYERS    = 4      # Number of stacked transformer blocksCONTEXT_SIZE  = 64     # Maximum tokens the model sees at onceFF_DIM        = 512    # Feed-forward intermediate size (4 × EMBEDDING_DIM)BATCH_SIZE    = 32     # Sequences per training stepLEARNING_RATE = 3e-4   # Standard starting point for transformersMAX_STEPS     = 5000   # Total training steps# ── Data preparation ──────────────────────────────────────────raw_text = LanguageModel.load_shakespeare()vocab      = Tokenizer.build_vocab(raw_text)tokenizer  = Tokenizer(vocab)vocab_size = len(vocab)data       = torch.tensor(tokenizer.encode(raw_text), dtype=torch.long)split      = int(0.9 * len(data))train_data = data[:split]val_data   = data[split:]def get_batch(split: str) -> tuple[torch.Tensor, torch.Tensor]:    """    Sample a random batch of (input, target) pairs from the dataset.    For each sequence in the batch:      input  = tokens at positions [i : i + CONTEXT_SIZE]      target = tokens at positions [i+1 : i + CONTEXT_SIZE + 1]    The target is the input shifted one position to the right.    This is how next-token prediction is set up: for every token in the    input sequence, the target is the very next token.    """    source  = train_data if split == "train" else val_data    indices = torch.randint(len(source) - CONTEXT_SIZE, (BATCH_SIZE,))    x = torch.stack([source[i     : i + CONTEXT_SIZE    ] for i in indices])    y = torch.stack([source[i + 1 : i + CONTEXT_SIZE + 1] for i in indices])    return x, y# ── Model and optimiser ───────────────────────────────────────model = LanguageModel(    vocab_size    = vocab_size,    embedding_dim = EMBEDDING_DIM,    num_heads     = NUM_HEADS,    num_layers    = NUM_LAYERS,    context_size  = CONTEXT_SIZE,    ff_dim        = FF_DIM,)optimiser = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")# ── Training loop ─────────────────────────────────────────────for step in range(MAX_STEPS):    model.train()    x_batch, y_batch = get_batch("train")    logits, loss     = model(x_batch, targets=y_batch)    optimiser.zero_grad()    loss.backward()    optimiser.step()    if step % 500 == 0:        model.eval()        with torch.no_grad():            _, val_loss = model(*get_batch("val"))        print(f"Step {step:5d} | train loss: {loss.item():.4f} | val loss: {val_loss.item():.4f}")# ── Generation ────────────────────────────────────────────────model.eval()prompt_text   = "T"prompt_tokens = torch.tensor(tokenizer.encode(prompt_text)).unsqueeze(0)generated = model.generate(    prompt         = prompt_tokens,    max_new_tokens = 200,    context_size   = CONTEXT_SIZE,)output_text = tokenizer.decode(generated[0].tolist())print("\nGenerated text:")print(output_text)
```

Let’s be explicit about every piece and what it contributes, now that the full picture is visible.

**The Tokenizer** converts text to integers so the model can process it mathematically, and converts integers back to text so you can read the output. Without it, the model cannot handle language at all.

**The Embedding Layer** converts those integers into vectors that can carry meaning. The token embedding learns what each word means. The positional embedding learns where in the sequence each word appears. Together, they give the model both content and context.

**The Attention Block** figures out which tokens in the sequence are relevant to each other. The causal mask ensures the model only looks backward, not forward, so it cannot cheat during training by peeking at the answer.

**Multi-Head Attention** runs several attention heads in parallel, each specialising in a different type of relationship. This gives the model the ability to track multiple linguistic patterns simultaneously.

**The Feed-Forward Layer** processes the information that attention surfaced. Attention moves information between positions; the feed-forward layer reasons about that information at each position independently.

**The Transformer Block** combines attention, feed-forward, residual connections, and normalisation into one repeating unit. Stacking these blocks is what gives the model its depth of reasoning.

**The Language Model Head** maps the final representation back into vocabulary space, producing a score for every possible next token. The highest score wins.

**The Training Loop** teaches the model to make better predictions by repeatedly sampling sequences, computing how wrong the model’s predictions were, and adjusting every parameter in the model slightly toward being more correct.

You have now built every component by hand. You understand what a Query is, why the mask exists, what a residual connection protects against, and why layer normalisation belongs before the sub-layer rather than after. That knowledge is what makes the next version readable.

PyTorch ships with a built-in class called nn.Transformer. It packages the multi-head attention and feed-forward layers into a ready-made block. Using it, you can build the same language model in significantly fewer lines. The trade-off is that the internals are hidden behind the API — which is exactly why this version comes second, not first. Reading it without having built the components yourself produces understanding of the shape of the code, not understanding of the mechanism.

This version uses exactly the same tokenizer, the same embedding layer, the same training loop, and the same generation logic as the from-scratch version. The only thing that changes is the transformer block itself.

nn.Transformer gives you a complete encoder-decoder transformer. For a language model, which is decoder-only, you use only the decoder side. PyTorch also provides nn.TransformerDecoderLayer and nn.TransformerDecoder for exactly this purpose: a single decoder block and a stack of them, respectively.

Internally, nn.TransformerDecoderLayer contains the same pieces you built: multi-head attention, a feed-forward network, layer normalisation, and residual connections. The configuration parameters map directly to the ones you already know.

The one thing nn.Transformer does not build for you is the causal mask. You hand it in explicitly at the forward pass. PyTorch provides a utility function to generate it: nn.Transformer.generate_square_subsequent_mask(size). This produces the same upper-triangular mask of negative infinity values that your AttentionBlock registered as a buffer — just generated on demand rather than stored on the module.

```
# ── The nn.Transformer Version ───────────────────────────────class TransformerLanguageModel(nn.Module):    def __init__(        self,        vocab_size: int,        embedding_dim: int,        num_heads: int,        num_layers: int,        context_size: int,        ff_dim: int,        dropout: float = 0.1    ):        """        A decoder-only language model built with PyTorch's nn.TransformerDecoder.        Compared to the from-scratch version:          - The Embedding layer is identical          - nn.TransformerDecoderLayer replaces your TransformerBlock          - nn.TransformerDecoder replaces the nn.Sequential stack of blocks          - The LM head and generation logic are identical          - The causal mask is generated at forward time, not stored as a buffer        Args:            vocab_size (int):    Total number of unique tokens.            embedding_dim (int): Core vector dimension. Called d_model in PyTorch docs.            num_heads (int):     Attention heads per layer. Must divide embedding_dim.            num_layers (int):    Number of stacked decoder layers.            context_size (int):  Maximum sequence length.            ff_dim (int):        Feed-forward intermediate dimension (dim_feedforward).            dropout (float):     Dropout rate applied inside each decoder layer.                                 0.1 is the standard starting value.        """        super().__init__()        self.context_size = context_size        # Identical to the from-scratch version        self.embedding = Embedding(vocab_size, embedding_dim)        # One decoder layer: the nn.Transformer equivalent of your TransformerBlock.        # batch_first=True tells PyTorch our tensors are (batch, seq, features)        # rather than the legacy (seq, batch, features) default.        decoder_layer = nn.TransformerDecoderLayer(            d_model         = embedding_dim,            nhead           = num_heads,            dim_feedforward = ff_dim,            dropout         = dropout,            batch_first     = True,        )        # Stack num_layers decoder layers.        # nn.TransformerDecoder is the equivalent of your nn.Sequential block stack.        self.transformer = nn.TransformerDecoder(            decoder_layer = decoder_layer,            num_layers    = num_layers,        )        self.final_norm = nn.LayerNorm(embedding_dim)        self.lm_head    = nn.Linear(embedding_dim, vocab_size)    def forward(        self,        x: torch.Tensor,        targets: torch.Tensor = None    ) -> tuple[torch.Tensor, torch.Tensor | None]:        """        Args:            x:       Token indices of shape (batch_size, seq_len)            targets: Optional ground-truth next tokens for loss computation        Returns:            logits: Shape (batch_size, seq_len, vocab_size)            loss:   Cross-entropy loss if targets provided, else None        """        B, T = x.shape        x = self.embedding(x)   # (B, T, embedding_dim)        # Generate the causal mask for the current sequence length.        # This is the same triangular mask as in your AttentionBlock,        # expressed as additive negative-infinity values rather than a        # boolean fill. PyTorch's attention implementation adds this        # directly to the attention scores before softmax.        causal_mask = nn.Transformer.generate_square_subsequent_mask(            T,            device=x.device        )        # nn.TransformerDecoder expects both a target sequence and a memory        # sequence (the encoder's output in a full encoder-decoder model).        # For a decoder-only language model, we pass the same tensor for both.        # The tgt_mask enforces causality on the self-attention side.        x = self.transformer(            tgt      = x,            memory   = x,            tgt_mask = causal_mask,        )        x      = self.final_norm(x)        logits = self.lm_head(x)        loss = None        if targets is not None:            B, T, V = logits.shape            loss = F.cross_entropy(                logits.view(B * T, V),                targets.view(B * T)            )        return logits, loss    @torch.no_grad()    def generate(        self,        prompt: torch.Tensor,        max_new_tokens: int,        context_size: int    ) -> torch.Tensor:        """        Identical generation logic to the from-scratch version.        The interface is the same - swap the model, keep the loop.        """        self.eval()        for _ in range(max_new_tokens):            context       = prompt[:, -context_size:]            logits, _     = self(context)            last_logits   = logits[:, -1, :]            probs         = F.softmax(last_logits, dim=-1)            next_token    = torch.multinomial(probs, num_samples=1)            prompt        = torch.cat([prompt, next_token], dim=1)        return prompt
```

The training loop and generation code from Part 8 require exactly one change: replace LanguageModel(...) with TransformerLanguageModel(...). Every other line stays identical, because the two classes share the same forward signature and the same generate interface.

```
# Replace this line in the training setup from Part 8:model = LanguageModel(    vocab_size    = vocab_size,    embedding_dim = EMBEDDING_DIM,    num_heads     = NUM_HEADS,    num_layers    = NUM_LAYERS,    context_size  = CONTEXT_SIZE,    ff_dim        = FF_DIM,)# With this:model = TransformerLanguageModel(    vocab_size    = vocab_size,    embedding_dim = EMBEDDING_DIM,    num_heads     = NUM_HEADS,    num_layers    = NUM_LAYERS,    context_size  = CONTEXT_SIZE,    ff_dim        = FF_DIM,    dropout       = 0.1,)# Everything else in the training loop and generation block is unchanged.
```

The two versions produce the same architecture and will converge to similar loss values on the same data. The difference is entirely about control and transparency.

nn.TransformerDecoderLayer is a complete, tested, PyTorch-maintained implementation. It uses PyTorch's optimised F.scaled_dot_product_attention under the hood, which is meaningfully faster than the manual matrix multiplication in your AttentionBlock, especially on GPU. If you are training something real and do not need to modify the attention internals, the nn.Transformer version is the faster and more practical choice.

What it does not give you is the ability to change the attention mechanism itself. If you want to experiment with a different masking strategy, a modified scoring function, a different way of combining heads, or any of the architectural variations that appear in modern research, you need the from-scratch version. The internals of nn.TransformerDecoderLayer are not designed to be overridden piecemeal.

This is the honest trade-off: convenience and speed versus control and transparency. Most application developers want the former. Most researchers need the latter. Knowing how to build both means you can choose rather than being forced.

This model is small by modern standards, and deliberately so. The architecture is real. The training loop is real. The generation logic is real. What is small is the configuration: the embedding dimension, the number of layers, the context size, the dataset.

Scale any of those up and you are on the same path that produced every large language model you have heard of. The architecture is not fundamentally different. The ingredients are identical. What changes is size, data, and compute.

The most productive next experiments, in order of effort:

Train on a larger and more interesting text file. Shakespeare, a Wikipedia dump, or a domain-specific dataset will produce noticeably different generations and teach you a lot about what the model actually learns.

Increase the number of layers and the embedding dimension, then observe how the validation loss changes and how long training takes. This gives you a direct intuition for the compute-versus-quality trade-off that governs all of modern AI.

Add a dropout layer inside the transformer block for regularisation. When the model has more parameters than training data to justify them, dropout prevents overfitting by randomly zeroing activations during training.

Replace the character-level tokenizer with a subword tokenizer from the tiktoken library. This dramatically increases the practical vocabulary coverage while keeping the vocabulary size manageable, and it is what every production language model uses.

Build it. Break it. Fix it. Extend it. That is the only path from reading an article to understanding a technology.

[Building an LLM from Scratch with Pytorch](https://pub.towardsai.net/building-an-llm-from-scratch-with-pytorch-b211a0d6793d) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
