Building an LLM from Scratch with Pytorch

A tutorial walks through building a small but complete language model from scratch using PyTorch, using character-level tokenization and explaining each component's purpose. The guide targets developers familiar with PyTorch and aims to provide a deeper understanding of language model internals by having them build and debug the model themselves.

Who this is for:You have used PyTorch at least once, or you are comfortable reading it. You know python and have seen and read PyTorch code before. The rest you will pick up as we build. Lets go There are two ways to understand how a language model works. The first is to read descriptions of it: attention mechanisms, transformer blocks, token embeddings. The second is to build one yourself, watch it fail, fix the failure, and watch it work. The second way produces a completely different quality of understanding. This article takes the second approach. We are going to build a small but complete language model from the ground up using PyTorch. Not a wrapper around an existing model. Not a fine-tuning tutorial. The actual components, assembled in the actual order, with the actual reasons explained at each step. By the end, you will have a working model that can generate text. More importantly, you will know why every piece is there, what it is doing, and what breaks if you remove it. Before anything: make sure these three imports are at the top of your file. Every code block in this article assumes them. python import torchimport torch.nn as nnimport torch.nn.functional as F Before writing a single class, you need the right mental model. A language model does one thing: given a sequence of words, it predicts what word comes next. That is it. That is the entire job. Read the sentence “The cat sat on the”. A language model looks at those five words and assigns a probability to every word in its vocabulary. “Mat” gets a high probability. “Quantum” gets a low one. “Floor” gets a medium one. The model picks the most likely next word, adds it to the sequence, and repeats. Everything in this article : the tokenizer, the embeddings, the attention mechanism, the transformer blocks exists to make that one prediction as accurate as possible. Here is the architecture we are building, in order: Each section below builds one row of this diagram. By the time we reach the bottom, we will assemble them into one complete model. Neural networks only understand numbers. They cannot read the word “hello.” They can read the number 8730. A tokenizer is the bridge between human language and machine arithmetic , it converts raw text into a sequence of numbers, and converts sequences of numbers back into text. Every language model has a tokenizer. The tokenizer decides what the smallest unit of language is. In this article, we use character-level tokenization: each individual character is one token. This keeps the implementation simple and clear without hiding anything important. The tokenizer builds a vocabulary: a dictionary that maps every character it has seen to a unique integer. The word “hello” becomes five integers, one per character. To convert back, you reverse the dictionary. php class Tokenizer:@staticmethod def build vocab text: str - dict: """ Scan the text and assign a unique integer to every unique character. The result is the vocabulary: the complete set of characters this tokenizer knows how to handle. A special <unk token is added for characters seen at inference time that were not in the training text. """ sorted makes the vocabulary deterministic across runs. Without sorting, the same text could produce different integer assignments each time, making saved models incompatible. vocab = { char: idx for idx, char in enumerate sorted set text } vocab "<unk " = len vocab return vocab def init self, vocab: dict : self.encode map = {str k : int v for k, v in vocab.items } The decode map is just the encode map reversed. self.decode map = {v: k for k, v in self.encode map.items } def encode self, text: str - list int : """ Convert a string into a list of integers. Unknown characters map to the <unk token rather than crashing. """ return self.encode map.get char, self.encode map "<unk " for char in text def decode self, indices: list int - str: """ Convert a list of integers back into a string. """ return "".join self.decode map.get idx, "<unk " for idx in indices Word-level tokenization has a simpler vocabulary but breaks on any word it has never seen. Character-level tokenization has a tiny vocabulary roughly 100 characters cover most English text and handles any input without crashing. For learning purposes, character level is the right choice: the vocabulary fits in memory, the code is transparent, and nothing is hidden by an external library. text = "Hello, PyTorch."vocab = Tokenizer.build vocab text tokenizer = Tokenizer vocab encoded = tokenizer.encode "Hello" print encoded Something like 3, 4, 8, 8, 11 depending on vocab orderdecoded = tokenizer.decode encoded print decoded Hello After tokenization, each word is a single integer. The problem is that a single integer carries no information about meaning. The integer 42 is not “closer” to the integer 43 than it is to the integer 1. But the word “car” should be conceptually close to the word “vehicle.” Single integers cannot represent that relationship. An embedding layer solves this by replacing each integer with a vector: a list of floating-point numbers, typically between 50 and 1024 values long. These vectors are learnable during training, the model adjusts them so that words with similar meanings end up with similar vectors. Similarity between vectors can be measured, and that measurement reflects real semantic similarity between words. The simplest mental model for an embedding layer is a spreadsheet. It has one row per token in the vocabulary, and one column per dimension of the embedding. When the model sees token number 42, it looks up row 42 of the spreadsheet and extracts that entire row as a vector. That vector is what travels through the rest of the model. The values in that spreadsheet start random and get adjusted during training. That is the entire mechanism. Here is something subtle that trips up most people the first time: the attention mechanism you will see in the next section has no inherent sense of order. If you fed it “the cat sat” and “sat the cat,” it would see the same tokens and, without extra information, produce the same output. But word order clearly matters. Positional encoding fixes this. For each position in the sequence position 0, position 1, position 2… , we create a learnable vector that represents that position. We add this positional vector to the word embedding vector before anything enters the attention mechanism. Now the model sees both the meaning of the word and where it appears in the sequence. python class Embedding nn.Module :def init self, vocab size: int, embedding dim: int : """ Two embedding tables live here: self.token embedding - one row per vocabulary token, learns word meaning self.position embedding - one row per position, learns word order Both tables have the same number of columns embedding dim so they can be added together. Args: vocab size int : Total number of unique tokens in the vocabulary. embedding dim int : Size of the vector representing each token. Larger = more expressive, but more expensive. """ super . init self.token embedding = nn.Embedding vocab size, embedding dim self.position embedding = nn.Embedding vocab size, embedding dim def forward self, x: torch.Tensor - torch.Tensor: """ Args: x: Token indices of shape batch size, sequence length Returns: Combined embedding of shape batch size, sequence length, embedding dim """ Token meaning word vectors = self.token embedding x Positional order - x serves double duty as position indices position vectors = self.position embedding x Adding them gives the model both dimensions of information return word vectors + position vectors Reading a sentence is not just about knowing what each word means in isolation. It is about understanding which words relate to which other words. In the sentence “The bank by the river was steep,” the word “bank” means something entirely different than it does in “The bank approved my loan.” The model needs to look at the surrounding words to disambiguate. The attention mechanism is how a language model looks at surrounding words. For each word in the sequence, it asks: “which other words in this sequence are most relevant to understanding this word right now?” It then produces a new representation of that word that blends in information from the relevant other words. The attention mechanism uses three concepts with names borrowed from information retrieval. The Query represents what a word is looking for. The Key represents what each word has to offer. The Value is the actual information that gets blended in once relevance is established. The simplest analogy is a search engine. You type a search query. The engine compares your query against the keys page titles, metadata of every page it knows about. The pages with the most relevant keys get their values actual content returned to you. Attention does the same thing, but for every word in the sequence simultaneously. In practice, Query, Key, and Value are all computed by passing the input through three separate linear layers. They start as the same data and get projected into three different “views” of it. There is one more piece: the mask. During training, the model sees the full sentence at once. But it is being trained to predict the next word, which means it should only be allowed to look at previous words, not future ones. If it could see future words, it would cheat — the answer would be right there in its context. The mask enforces this by setting the attention score between a word and any future word to negative infinity. After the softmax step, negative infinity becomes zero, which means those positions contribute nothing to the output. Each word attends only to itself and to everything before it. python class AttentionBlock nn.Module :def init self, embedding dim: int, head dim: int, context size: int : """ One attention head. Computes relevance between every pair of tokens in the sequence and uses that relevance to build enriched representations. Args: embedding dim int : Size of the input vectors coming in. head dim int : Size of the Query, Key, and Value projections. In multi-head attention this is embedding dim // num heads. context size int : Maximum sequence length the model can handle. Determines the size of the causal mask. """ super . init self.query = nn.Linear embedding dim, head dim, bias=False self.key = nn.Linear embedding dim, head dim, bias=False self.value = nn.Linear embedding dim, head dim, bias=False The mask is not a learned parameter - it is a fixed triangular matrix. register buffer makes it part of the module moves to GPU with the model without making it a trainable weight. ones = torch.ones context size, context size self.register buffer "mask", torch.tril ones def forward self, x: torch.Tensor - torch.Tensor: """ Args: x: Input tensor of shape batch size, seq len, embedding dim Returns: Enriched representation of shape batch size, seq len, head dim """ B, T, C = x.shape Q = self.query x What each token is looking for K = self.key x What each token has to offer V = self.value x The actual content to blend in Compute relevance scores between every pair of positions. Scaling by head dim -0.5 prevents the scores from getting so large that the softmax collapses to one-hot distributions one dominant token drowning out everything else . scores = Q @ K.transpose -2, -1 C -0.5 Apply the causal mask: future positions become -inf, then 0 after softmax. scores = scores.masked fill self.mask :T, :T == 0, float "-inf" Softmax turns the scores into weights that sum to 1. weights = F.softmax scores, dim=-1 Blend the Value vectors according to those weights. return weights @ V Every other component in a transformer is supporting this one. The embeddings exist to give attention something meaningful to compare. The positional encoding exists to give attention a sense of order. The feed-forward layer that comes next exists to process what attention discovered. Attention is the mechanism that makes language models different from every other neural network that came before, and it is the piece worth spending extra time understanding before moving on. A single attention head looks for one type of relationship between tokens. But language has many simultaneous relationship types. In “She gave him the book she wrote,” one relationship is between “she” and “wrote” same person . Another relationship is between “him” and “gave” recipient . A third is between “book” and “wrote” what she wrote . Multi-head attention runs several attention heads in parallel, each with its own Query, Key, and Value weights. Each head specialises in finding different patterns. Their outputs are then concatenated and projected back to the original dimension. python class MultiHeadAttention nn.Module :def init self, embedding dim: int, num heads: int, context size: int : """ Runs num heads attention heads in parallel and combines their outputs. Each head operates on a slice of the embedding dimension head dim , so the total computation stays proportional to embedding dim regardless of how many heads are used. Args: embedding dim int : Size of the input and output vectors. num heads int : Number of parallel attention heads. Must divide evenly into embedding dim. context size int : Maximum sequence length. """ super . init assert embedding dim % num heads == 0, f"embedding dim {embedding dim} must be divisible by " f"num heads {num heads} " head dim = embedding dim // num heads Each head is a full AttentionBlock with its own learnable weights. nn.ModuleList registers them properly so PyTorch tracks their parameters. self.heads = nn.ModuleList AttentionBlock embedding dim, head dim, context size for in range num heads After concatenation, project back to embedding dim. This lets the model mix information across heads. self.projection = nn.Linear embedding dim, embedding dim def forward self, x: torch.Tensor - torch.Tensor: """ Run all heads in parallel, concatenate their outputs, project. Args: x: Input of shape batch size, seq len, embedding dim Returns: Output of shape batch size, seq len, embedding dim """ Each head produces batch size, seq len, head dim . Concatenating along the last dimension gives batch size, seq len, embedding dim . concatenated = torch.cat head x for head in self.heads , dim=-1 return self.projection concatenated Attention is very good at moving information between positions in the sequence. What it does not do well is process that information once it has been gathered. The feed-forward layer is where that processing happens. After attention has figured out “what is relevant to what,” the feed-forward layer applies a learnable non-linear transformation to each position independently. This is where the model learns to reason about the relationships that attention surfaced. The structure is always the same: expand the dimension, apply a non-linearity, compress back down. The expansion typically to four times the embedding dimension gives the model a larger space to work in during the reasoning step. python class FeedForward nn.Module : def init self, embedding dim: int, ff dim: int : """ A two-layer network applied to each position independently. The expansion to ff dim typically 4 embedding dim creates room for the model to represent complex transformations before compressing back down. ReLU introduces the non-linearity that lets the network learn things a pure linear model cannot. Args: embedding dim int : Input and output dimension. ff dim int : Intermediate dimension. Typically 4 embedding dim. """ super . init self.expand = nn.Linear embedding dim, ff dim self.activate = nn.ReLU self.compress = nn.Linear ff dim, embedding dim def forward self, x: torch.Tensor - torch.Tensor: x = self.expand x x = self.activate x x = self.compress x return x A transformer block is the full reasoning unit of the model: multi-head attention, followed by a feed-forward layer, with two important additions: residual connections and layer normalisation. Residual connections also called skip connections add the block’s input back to its output. If the attention or feed-forward layer learns something useful, the residual carries it forward. If it learns something harmful, the original signal is preserved. This is what makes it possible to stack many transformer blocks without training becoming impossible. Layer normalisation stabilises the values flowing through the network. Without it, values can grow or shrink uncontrollably across many layers. Normalisation keeps them in a predictable range, which makes training faster and more stable. The standard practice is to normalise the input before passing it to each sub-layer, so the attention and feed-forward layers always receive well-scaled inputs. python class TransformerBlock nn.Module : def init self, embedding dim: int, num heads: int, context size: int, ff dim: int : """ One complete transformer block: the repeating unit stacked N times to form the body of the language model. Processing order Pre-LN variant, most common today : 1. Normalise → Attention → Add residual 2. Normalise → FeedForward → Add residual Normalising before the sub-layer Pre-LN rather than after produces more stable gradients during training. Args: embedding dim int : Model's core vector dimension. num heads int : Number of attention heads. context size int : Maximum sequence length. ff dim int : Feed-forward intermediate dimension. """ super . init self.norm 1 = nn.LayerNorm embedding dim self.attention = MultiHeadAttention embedding dim, num heads, context size self.norm 2 = nn.LayerNorm embedding dim self.feed forward = FeedForward embedding dim, ff dim def forward self, x: torch.Tensor - torch.Tensor: Step 1: Attention with residual Normalise first, run attention, add the result back to the original x. x = x + self.attention self.norm 1 x Step 2: Feed-forward with residual Normalise first, run feed-forward, add the result back. x = x + self.feed forward self.norm 2 x return x Every piece is now built. The language model stacks them in order: The embedding layer converts token indices into vectors with positional information. A sequence of transformer blocks processes those vectors, with each block refining the representation by attending to context and reasoning about it. A final layer normalisation stabilises the output. A linear projection the “language model head” maps from the embedding dimension to the vocabulary size, producing one score per token. The token with the highest score is the model’s prediction. python import torchimport torch.nn as nnimport torch.nn.functional as Fimport requestsclass LanguageModel nn.Module : @staticmethod def load shakespeare : """ Downloads Tiny Shakespeare ~1MB and returns the raw text string. Cached in memory — call once and pass the result around. """ url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" response = requests.get url response.raise for status print f"Dataset loaded: {len response.text :,} characters" return response.text def init self, vocab size: int, embedding dim: int, num heads: int, num layers: int, context size: int, ff dim: int : """ A complete decoder-only language model. Args: vocab size int : Number of unique tokens. Determines embedding table size and the output dimension of the LM head. embedding dim int : Core vector dimension throughout the model. All layers communicate using vectors of this size. num heads int : Attention heads per transformer block. Must divide evenly into embedding dim. num layers int : Number of stacked transformer blocks. More layers = deeper reasoning, more computation. context size int : Maximum number of tokens the model processes at once. The attention mask is sized by this value. ff dim int : Feed-forward intermediate dimension in each block. Typically 4 embedding dim. """ super . init self.embedding = Embedding vocab size, embedding dim self.blocks = nn.Sequential TransformerBlock embedding dim, num heads, context size, ff dim for in range num layers self.final norm = nn.LayerNorm embedding dim self.lm head = nn.Linear embedding dim, vocab size def forward self, x: torch.Tensor, targets: torch.Tensor = None - tuple torch.Tensor, torch.Tensor | None : """ Forward pass. Args: x: Token indices of shape batch size, seq len targets: Optional ground-truth next tokens for training. If provided, computes and returns cross-entropy loss. Returns: logits: Raw scores of shape batch size, seq len, vocab size . loss: Cross-entropy loss if targets provided, otherwise None. """ x = self.embedding x x = self.blocks x x = self.final norm x logits = self.lm head x loss = None if targets is not None: B, T, V = logits.shape loss = F.cross entropy logits.view B T, V , targets.view B T return logits, loss @torch.no grad def generate self, prompt: torch.Tensor, max new tokens: int, context size: int - torch.Tensor: """ Autoregressively generate new tokens from a prompt. Args: prompt: Starting token indices of shape 1, seq len max new tokens: How many tokens to generate context size: The model's maximum context window Returns: Full sequence including prompt + generated tokens """ self.eval for in range max new tokens : context = prompt :, -context size: logits, = self context last logits = logits :, -1, : probs = F.softmax last logits, dim=-1 next token = torch.multinomial probs, num samples=1 prompt = torch.cat prompt, next token , dim=1 return prompt Here is a complete, runnable training loop. Each section is commented with what it is doing and why, not just what the code says. python import torchimport torch.nn as nnimport torch.nn.functional as Fimport requests ── Configuration ─────────────────────────────────────────────EMBEDDING DIM = 128 Vector size throughout the modelNUM HEADS = 4 Must divide evenly into EMBEDDING DIMNUM LAYERS = 4 Number of stacked transformer blocksCONTEXT SIZE = 64 Maximum tokens the model sees at onceFF DIM = 512 Feed-forward intermediate size 4 × EMBEDDING DIM BATCH SIZE = 32 Sequences per training stepLEARNING RATE = 3e-4 Standard starting point for transformersMAX STEPS = 5000 Total training steps ── Data preparation ──────────────────────────────────────────raw text = LanguageModel.load shakespeare vocab = Tokenizer.build vocab raw text tokenizer = Tokenizer vocab vocab size = len vocab data = torch.tensor tokenizer.encode raw text , dtype=torch.long split = int 0.9 len data train data = data :split val data = data split: def get batch split: str - tuple torch.Tensor, torch.Tensor : """ Sample a random batch of input, target pairs from the dataset. For each sequence in the batch: input = tokens at positions i : i + CONTEXT SIZE target = tokens at positions i+1 : i + CONTEXT SIZE + 1 The target is the input shifted one position to the right. This is how next-token prediction is set up: for every token in the input sequence, the target is the very next token. """ source = train data if split == "train" else val data indices = torch.randint len source - CONTEXT SIZE, BATCH SIZE, x = torch.stack source i : i + CONTEXT SIZE for i in indices y = torch.stack source i + 1 : i + CONTEXT SIZE + 1 for i in indices return x, y ── Model and optimiser ───────────────────────────────────────model = LanguageModel vocab size = vocab size, embedding dim = EMBEDDING DIM, num heads = NUM HEADS, num layers = NUM LAYERS, context size = CONTEXT SIZE, ff dim = FF DIM, optimiser = torch.optim.AdamW model.parameters , lr=LEARNING RATE print f"Model parameters: {sum p.numel for p in model.parameters :,}" ── Training loop ─────────────────────────────────────────────for step in range MAX STEPS : model.train x batch, y batch = get batch "train" logits, loss = model x batch, targets=y batch optimiser.zero grad loss.backward optimiser.step if step % 500 == 0: model.eval with torch.no grad : , val loss = model get batch "val" print f"Step {step:5d} | train loss: {loss.item :.4f} | val loss: {val loss.item :.4f}" ── Generation ────────────────────────────────────────────────model.eval prompt text = "T"prompt tokens = torch.tensor tokenizer.encode prompt text .unsqueeze 0 generated = model.generate prompt = prompt tokens, max new tokens = 200, context size = CONTEXT SIZE, output text = tokenizer.decode generated 0 .tolist print "\nGenerated text:" print output text Let’s be explicit about every piece and what it contributes, now that the full picture is visible. The Tokenizer converts text to integers so the model can process it mathematically, and converts integers back to text so you can read the output. Without it, the model cannot handle language at all. The Embedding Layer converts those integers into vectors that can carry meaning. The token embedding learns what each word means. The positional embedding learns where in the sequence each word appears. Together, they give the model both content and context. The Attention Block figures out which tokens in the sequence are relevant to each other. The causal mask ensures the model only looks backward, not forward, so it cannot cheat during training by peeking at the answer. Multi-Head Attention runs several attention heads in parallel, each specialising in a different type of relationship. This gives the model the ability to track multiple linguistic patterns simultaneously. The Feed-Forward Layer processes the information that attention surfaced. Attention moves information between positions; the feed-forward layer reasons about that information at each position independently. The Transformer Block combines attention, feed-forward, residual connections, and normalisation into one repeating unit. Stacking these blocks is what gives the model its depth of reasoning. The Language Model Head maps the final representation back into vocabulary space, producing a score for every possible next token. The highest score wins. The Training Loop teaches the model to make better predictions by repeatedly sampling sequences, computing how wrong the model’s predictions were, and adjusting every parameter in the model slightly toward being more correct. You have now built every component by hand. You understand what a Query is, why the mask exists, what a residual connection protects against, and why layer normalisation belongs before the sub-layer rather than after. That knowledge is what makes the next version readable. PyTorch ships with a built-in class called nn.Transformer. It packages the multi-head attention and feed-forward layers into a ready-made block. Using it, you can build the same language model in significantly fewer lines. The trade-off is that the internals are hidden behind the API — which is exactly why this version comes second, not first. Reading it without having built the components yourself produces understanding of the shape of the code, not understanding of the mechanism. This version uses exactly the same tokenizer, the same embedding layer, the same training loop, and the same generation logic as the from-scratch version. The only thing that changes is the transformer block itself. nn.Transformer gives you a complete encoder-decoder transformer. For a language model, which is decoder-only, you use only the decoder side. PyTorch also provides nn.TransformerDecoderLayer and nn.TransformerDecoder for exactly this purpose: a single decoder block and a stack of them, respectively. Internally, nn.TransformerDecoderLayer contains the same pieces you built: multi-head attention, a feed-forward network, layer normalisation, and residual connections. The configuration parameters map directly to the ones you already know. The one thing nn.Transformer does not build for you is the causal mask. You hand it in explicitly at the forward pass. PyTorch provides a utility function to generate it: nn.Transformer.generate square subsequent mask size . This produces the same upper-triangular mask of negative infinity values that your AttentionBlock registered as a buffer — just generated on demand rather than stored on the module. ── The nn.Transformer Version ───────────────────────────────class TransformerLanguageModel nn.Module : def init self, vocab size: int, embedding dim: int, num heads: int, num layers: int, context size: int, ff dim: int, dropout: float = 0.1 : """ A decoder-only language model built with PyTorch's nn.TransformerDecoder. Compared to the from-scratch version: - The Embedding layer is identical - nn.TransformerDecoderLayer replaces your TransformerBlock - nn.TransformerDecoder replaces the nn.Sequential stack of blocks - The LM head and generation logic are identical - The causal mask is generated at forward time, not stored as a buffer Args: vocab size int : Total number of unique tokens. embedding dim int : Core vector dimension. Called d model in PyTorch docs. num heads int : Attention heads per layer. Must divide embedding dim. num layers int : Number of stacked decoder layers. context size int : Maximum sequence length. ff dim int : Feed-forward intermediate dimension dim feedforward . dropout float : Dropout rate applied inside each decoder layer. 0.1 is the standard starting value. """ super . init self.context size = context size Identical to the from-scratch version self.embedding = Embedding vocab size, embedding dim One decoder layer: the nn.Transformer equivalent of your TransformerBlock. batch first=True tells PyTorch our tensors are batch, seq, features rather than the legacy seq, batch, features default. decoder layer = nn.TransformerDecoderLayer d model = embedding dim, nhead = num heads, dim feedforward = ff dim, dropout = dropout, batch first = True, Stack num layers decoder layers. nn.TransformerDecoder is the equivalent of your nn.Sequential block stack. self.transformer = nn.TransformerDecoder decoder layer = decoder layer, num layers = num layers, self.final norm = nn.LayerNorm embedding dim self.lm head = nn.Linear embedding dim, vocab size def forward self, x: torch.Tensor, targets: torch.Tensor = None - tuple torch.Tensor, torch.Tensor | None : """ Args: x: Token indices of shape batch size, seq len targets: Optional ground-truth next tokens for loss computation Returns: logits: Shape batch size, seq len, vocab size loss: Cross-entropy loss if targets provided, else None """ B, T = x.shape x = self.embedding x B, T, embedding dim Generate the causal mask for the current sequence length. This is the same triangular mask as in your AttentionBlock, expressed as additive negative-infinity values rather than a boolean fill. PyTorch's attention implementation adds this directly to the attention scores before softmax. causal mask = nn.Transformer.generate square subsequent mask T, device=x.device nn.TransformerDecoder expects both a target sequence and a memory sequence the encoder's output in a full encoder-decoder model . For a decoder-only language model, we pass the same tensor for both. The tgt mask enforces causality on the self-attention side. x = self.transformer tgt = x, memory = x, tgt mask = causal mask, x = self.final norm x logits = self.lm head x loss = None if targets is not None: B, T, V = logits.shape loss = F.cross entropy logits.view B T, V , targets.view B T return logits, loss @torch.no grad def generate self, prompt: torch.Tensor, max new tokens: int, context size: int - torch.Tensor: """ Identical generation logic to the from-scratch version. The interface is the same - swap the model, keep the loop. """ self.eval for in range max new tokens : context = prompt :, -context size: logits, = self context last logits = logits :, -1, : probs = F.softmax last logits, dim=-1 next token = torch.multinomial probs, num samples=1 prompt = torch.cat prompt, next token , dim=1 return prompt The training loop and generation code from Part 8 require exactly one change: replace LanguageModel ... with TransformerLanguageModel ... . Every other line stays identical, because the two classes share the same forward signature and the same generate interface. Replace this line in the training setup from Part 8:model = LanguageModel vocab size = vocab size, embedding dim = EMBEDDING DIM, num heads = NUM HEADS, num layers = NUM LAYERS, context size = CONTEXT SIZE, ff dim = FF DIM, With this:model = TransformerLanguageModel vocab size = vocab size, embedding dim = EMBEDDING DIM, num heads = NUM HEADS, num layers = NUM LAYERS, context size = CONTEXT SIZE, ff dim = FF DIM, dropout = 0.1, Everything else in the training loop and generation block is unchanged. The two versions produce the same architecture and will converge to similar loss values on the same data. The difference is entirely about control and transparency. nn.TransformerDecoderLayer is a complete, tested, PyTorch-maintained implementation. It uses PyTorch's optimised F.scaled dot product attention under the hood, which is meaningfully faster than the manual matrix multiplication in your AttentionBlock, especially on GPU. If you are training something real and do not need to modify the attention internals, the nn.Transformer version is the faster and more practical choice. What it does not give you is the ability to change the attention mechanism itself. If you want to experiment with a different masking strategy, a modified scoring function, a different way of combining heads, or any of the architectural variations that appear in modern research, you need the from-scratch version. The internals of nn.TransformerDecoderLayer are not designed to be overridden piecemeal. This is the honest trade-off: convenience and speed versus control and transparency. Most application developers want the former. Most researchers need the latter. Knowing how to build both means you can choose rather than being forced. This model is small by modern standards, and deliberately so. The architecture is real. The training loop is real. The generation logic is real. What is small is the configuration: the embedding dimension, the number of layers, the context size, the dataset. Scale any of those up and you are on the same path that produced every large language model you have heard of. The architecture is not fundamentally different. The ingredients are identical. What changes is size, data, and compute. The most productive next experiments, in order of effort: Train on a larger and more interesting text file. Shakespeare, a Wikipedia dump, or a domain-specific dataset will produce noticeably different generations and teach you a lot about what the model actually learns. Increase the number of layers and the embedding dimension, then observe how the validation loss changes and how long training takes. This gives you a direct intuition for the compute-versus-quality trade-off that governs all of modern AI. Add a dropout layer inside the transformer block for regularisation. When the model has more parameters than training data to justify them, dropout prevents overfitting by randomly zeroing activations during training. Replace the character-level tokenizer with a subword tokenizer from the tiktoken library. This dramatically increases the practical vocabulary coverage while keeping the vocabulary size manageable, and it is what every production language model uses. Build it. Break it. Fix it. Extend it. That is the only path from reading an article to understanding a technology. Building an LLM from Scratch with Pytorch https://pub.towardsai.net/building-an-llm-from-scratch-with-pytorch-b211a0d6793d was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.