A Transformer Becomes an LLM A transformer architecture becomes a large language model through training on trillions of tokens, using residual connections to enable deep networks, tokenization to split text into manageable units, and fine-tuning to create an assistant. From Transformer to ChatGPT: The Part That Isn't the Architecture In the last post /blog/transformers-and-attention , we followed the transformer down to one layer. Attention lets words shape each other. An MLP reshapes the result. The final numbers become a guess at the next word. Stack that layer many times and you have the architecture. But the architecture is not the product. A stack of layers like that is not Claude, or GPT, or Gemini. Untrained, it is a pile of random numbers that knows nothing. This post follows the rest of the path: from that empty architecture to a model you can talk to. Same approach as last time. Small worked examples and diagrams, not full notation. TLDR - Stack transformer layers into one big model. Parameter count measures size: 7B, 70B, 405B. - Text gets split into tokens , not words. Models train on trillions of them. - Pre-training teaches the model to guess the next token. Supervised fine-tuning https://www.ibm.com/think/topics/fine-tuning training on prompts with the desired answers creates an assistant.- Alignment a post-training step that pushes the model toward answers humans prefer shapes answers. - For cheap customization, freeze the model and train tiny add-on matrices. For serving, lean on GPUs. We'll take each of those six in turn. Same running example as the last post: "this too shall", and I'm hoping the model lands on "pass". The piece that makes a deep network neural net with many layers trainable: skip connections In the last post, we skipped a training detail to keep the diagram clean. Let's dig into it now: the residual connection. Here is the problem it solves. During training, errors flow backward through every layer. That signal updates the early layers. Stack enough layers and the signal degrades on the way down. It can shrink toward nothing or blow up. Past some depth, early layers barely move. The fix looks small for the problem: let the original signal bypass each block. Instead of passing only block input forward, pass the sum: input + block input . The next layer always receives both pieces together. If the block is noisy early in training, the original input still survives inside that sum. Over time, the block learns a useful adjustment instead of rebuilding the whole representation from scratch. Those dashed paths are side channels. The original signal has a clean route forward around the block. The correction signal gets the same route on the way back. Each block only has to learn a small adjustment on top of a stable input. That makes training steady enough for trillions of tokens. First, text becomes tokens The last post said "each word becomes a row of numbers". The real unit is the token, not the word. That difference explains a lot of how LLMs behave. Models do not work on words. Words are the wrong unit. The output layer shows why. English has hundreds of thousands of words, and it keeps borrowing more. New words keep arriving. Worse, the model has to score every vocabulary item at every step. With 600,000 words, that means 600,000 scores per step. Sentences are worse, since the set is open-ended. Go the other way, then. Characters? Now the vocabulary is tiny. You might need a thousand symbols for letters, punctuation, and accents. But a single character carries little meaning. "a" and "I" are words; "z" is not. Asking the model to build meaning one character at a time is a brutal job. So tokenizers split the difference by counting. They look for groups of characters that appear together across a huge pile of text. The most common groups become units. Those units are tokens . Common chunks become single tokens. Rare words split into several. There is no clean grammar rule for the splits. You can watch this happen on Tiktokenizer https://tiktokenizer.vercel.app . Feed it "I am an aardvark on a large ark" with the Llama 3 tokenizer and you get 11 tokens, each mapped to an ID: For ordinary English prose I budget 1.5 to 2 tokens per word. Code, odd names, and non-English text can push that ratio up. That ratio is not the vocabulary size. They count different things. The ratio is how many tokens one word costs when you split it. The vocabulary is how many distinct tokens exist to choose from. Many recent models carry around 200,000. English has maybe 600,000 words, but the vocabulary stays smaller because rare words do not each get a slot. They get built from pieces, the way "aardvark" broke into three above. This one design choice explains several things people find odd about LLMs. Why they handle typos and mixed languages so well. The model never sees "words" the way you do. Misspell something, or drop a Hindi word into an English sentence. The model just sees a different token sequence. A typo becomes a different set of chunks, not an unknown word that jams the input. The output can get worse, but the model can still read it. Fun fact:ask a model how many R's are in "strawberry" and it often says two. The answer is three. People have written this up . Why they cannot count the R's in "strawberry". Two reasons stack up. First, the model does not see s-t-r-a-w-b-e-r-r-y as ten letters. It sees tokens like "str", "aw", and "berry". The letters sit inside chunks. Second, there is no built-in loop that walks the characters and increments a counter. The model looks at tokens, reshapes embeddings, and emits the next token. Nowhere in that loop does it stop to tally. Give it a Python sandbox and it can write code to count instead. Why vocabulary size is not the context window. People mix these up. Vocabulary is the distinct tokens the model knows. The context window is how many tokens you can feed it at once, maybe 64,000 or a million depending on the model. So a "200K context" means the window, not the size of the dictionary. The three phases of training Fresh out of initialization, a model is random. You turn it into something useful in three phases. Each phase starts from the last checkpoint. The mechanism stays the same. Show the model text. Check its next-token guess. Nudge the weights toward the right answer. The data and goal change. Phase 1: Pre-training Phase one is the brute-force part. Predict the next token, over and over. Pour in internet text. Let the model guess at every position. Correct it when it is wrong. Repeat across more than 10 trillion tokens. This is why people call them causal models. Here, causal means left-to-right. The model predicts token 10 from tokens 1 through 9. It never gets to peek at token 11. That is the causal attention /blog/transformers-and-attention from the previous post. Letting the model see future tokens during training would hand it the answer. What text? Web pages, filtered for quality. FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb is a public example: a cleaned web dataset built from Common Crawl. Labs also add high-quality open-source code, so the model picks up structure and syntax. They add math and word problems too, with datasets like FineMath https://huggingface.co/datasets/HuggingFaceTB/finemath . Those examples teach logic and step-by-step reasoning. This loop is autoregressive generation. The useful half of the word is auto , self. The model feeds its own output back in. It reads the prompt, produces one token, appends that token, and runs again. Token, append, repeat. That is why chat UIs look like they type one piece at a time. The loop includes randomness. If you always took the most likely next token, the model would get stuck repeating itself. Think about what follows a closing