A Transformer Becomes an LLM

A transformer architecture becomes a large language model through training on trillions of tokens, using residual connections to enable deep networks, tokenization to split text into manageable units, and fine-tuning to create an assistant.

From Transformer to ChatGPT: The Part That Isn't the Architecture In the last post /blog/transformers-and-attention , we followed the transformer down to one layer. Attention lets words shape each other. An MLP reshapes the result. The final numbers become a guess at the next word. Stack that layer many times and you have the architecture. But the architecture is not the product. A stack of layers like that is not Claude, or GPT, or Gemini. Untrained, it is a pile of random numbers that knows nothing. This post follows the rest of the path: from that empty architecture to a model you can talk to. Same approach as last time. Small worked examples and diagrams, not full notation. TLDR - Stack transformer layers into one big model. Parameter count measures size: 7B, 70B, 405B. - Text gets split into tokens , not words. Models train on trillions of them. - Pre-training teaches the model to guess the next token. Supervised fine-tuning https://www.ibm.com/think/topics/fine-tuning training on prompts with the desired answers creates an assistant.- Alignment a post-training step that pushes the model toward answers humans prefer shapes answers. - For cheap customization, freeze the model and train tiny add-on matrices. For serving, lean on GPUs. We'll take each of those six in turn. Same running example as the last post: "this too shall", and I'm hoping the model lands on "pass". The piece that makes a deep network neural net with many layers trainable: skip connections In the last post, we skipped a training detail to keep the diagram clean. Let's dig into it now: the residual connection. Here is the problem it solves. During training, errors flow backward through every layer. That signal updates the early layers. Stack enough layers and the signal degrades on the way down. It can shrink toward nothing or blow up. Past some depth, early layers barely move. The fix looks small for the problem: let the original signal bypass each block. Instead of passing only block input forward, pass the sum: input + block input . The next layer always receives both pieces together. If the block is noisy early in training, the original input still survives inside that sum. Over time, the block learns a useful adjustment instead of rebuilding the whole representation from scratch. Those dashed paths are side channels. The original signal has a clean route forward around the block. The correction signal gets the same route on the way back. Each block only has to learn a small adjustment on top of a stable input. That makes training steady enough for trillions of tokens. First, text becomes tokens The last post said "each word becomes a row of numbers". The real unit is the token, not the word. That difference explains a lot of how LLMs behave. Models do not work on words. Words are the wrong unit. The output layer shows why. English has hundreds of thousands of words, and it keeps borrowing more. New words keep arriving. Worse, the model has to score every vocabulary item at every step. With 600,000 words, that means 600,000 scores per step. Sentences are worse, since the set is open-ended. Go the other way, then. Characters? Now the vocabulary is tiny. You might need a thousand symbols for letters, punctuation, and accents. But a single character carries little meaning. "a" and "I" are words; "z" is not. Asking the model to build meaning one character at a time is a brutal job. So tokenizers split the difference by counting. They look for groups of characters that appear together across a huge pile of text. The most common groups become units. Those units are tokens . Common chunks become single tokens. Rare words split into several. There is no clean grammar rule for the splits. You can watch this happen on Tiktokenizer https://tiktokenizer.vercel.app . Feed it "I am an aardvark on a large ark" with the Llama 3 tokenizer and you get 11 tokens, each mapped to an ID: For ordinary English prose I budget 1.5 to 2 tokens per word. Code, odd names, and non-English text can push that ratio up. That ratio is not the vocabulary size. They count different things. The ratio is how many tokens one word costs when you split it. The vocabulary is how many distinct tokens exist to choose from. Many recent models carry around 200,000. English has maybe 600,000 words, but the vocabulary stays smaller because rare words do not each get a slot. They get built from pieces, the way "aardvark" broke into three above. This one design choice explains several things people find odd about LLMs. Why they handle typos and mixed languages so well. The model never sees "words" the way you do. Misspell something, or drop a Hindi word into an English sentence. The model just sees a different token sequence. A typo becomes a different set of chunks, not an unknown word that jams the input. The output can get worse, but the model can still read it. Fun fact:ask a model how many R's are in "strawberry" and it often says two. The answer is three. People have written this up . Why they cannot count the R's in "strawberry". Two reasons stack up. First, the model does not see s-t-r-a-w-b-e-r-r-y as ten letters. It sees tokens like "str", "aw", and "berry". The letters sit inside chunks. Second, there is no built-in loop that walks the characters and increments a counter. The model looks at tokens, reshapes embeddings, and emits the next token. Nowhere in that loop does it stop to tally. Give it a Python sandbox and it can write code to count instead. Why vocabulary size is not the context window. People mix these up. Vocabulary is the distinct tokens the model knows. The context window is how many tokens you can feed it at once, maybe 64,000 or a million depending on the model. So a "200K context" means the window, not the size of the dictionary. The three phases of training Fresh out of initialization, a model is random. You turn it into something useful in three phases. Each phase starts from the last checkpoint. The mechanism stays the same. Show the model text. Check its next-token guess. Nudge the weights toward the right answer. The data and goal change. Phase 1: Pre-training Phase one is the brute-force part. Predict the next token, over and over. Pour in internet text. Let the model guess at every position. Correct it when it is wrong. Repeat across more than 10 trillion tokens. This is why people call them causal models. Here, causal means left-to-right. The model predicts token 10 from tokens 1 through 9. It never gets to peek at token 11. That is the causal attention /blog/transformers-and-attention from the previous post. Letting the model see future tokens during training would hand it the answer. What text? Web pages, filtered for quality. FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb is a public example: a cleaned web dataset built from Common Crawl. Labs also add high-quality open-source code, so the model picks up structure and syntax. They add math and word problems too, with datasets like FineMath https://huggingface.co/datasets/HuggingFaceTB/finemath . Those examples teach logic and step-by-step reasoning. This loop is autoregressive generation. The useful half of the word is auto , self. The model feeds its own output back in. It reads the prompt, produces one token, appends that token, and runs again. Token, append, repeat. That is why chat UIs look like they type one piece at a time. The loop includes randomness. If you always took the most likely next token, the model would get stuck repeating itself. Think about what follows a closing </li tag in HTML. The top choice may open another list item. Keep taking the top choice and the model can loop forever. So you sample from the probability distribution. That is also why the same prompt can give different answers. What you have after phase one has a name: a base model . I think of it as glorified autocomplete. Give it "the sky is" and it may continue with "blue". But it does not answer questions. It continues text. Ask it something and it might generate more questions. Left to ramble, it hallucinates because rambling was the job. This is also where the GPT name comes from: Generative Pre-trained Transformer. The "pre-trained" part is this phase. Phase 2: Supervised fine-tuning Now you want it to respond like an assistant instead of continuing text. Take the base model and keep training it, but swap the dataset. Out goes raw internet text. In come conversations: an instruction, a system prompt, and a good response. Companies spend real money on these examples and keep them private. Two models can share the same architecture and the same web text, then behave differently because their conversation data differs. The mechanism does not change. The model still predicts the next token and receives a correction when wrong. But the examples now look like conversations. The model learns to treat your input as a prompt to answer, not text to continue. "The sky is" stops triggering "blue". It starts explaining instead. People call this instruction following. Do not confuse it with in-context learning . That is when the model picks up a pattern from examples inside the prompt. The trick sits in the markup. The training pipeline turns the chat into one long string. Special tokens mark who is speaking. Run a chat through the GPT-4o tokenizer https://tiktokenizer.vercel.app and you see it: <|im start| system<|im sep| You are a helpful assistant<|im end| <|im start| user<|im sep| Tell me a story<|im end| <|im start| assistant<|im sep| Each <|im start| is a single token, not the characters spelled out. The tokenizer added it on top of the base vocabulary, so it carries a high ID. Training teaches the model to read these markers as structure. They tell it where the system instruction begins, where the user's turn ends, and where its own turn starts. That trailing <|im start| assistant<|im sep| is the cue: now autocomplete as the assistant. To the model, this is still one block of text. Nothing magic separates "system" from "user". The separation lives in the special tokens and in training. With enough examples, the model treats the system instruction as high-priority context. That connects to the u-shaped attention bias /blog/transformers-and-attention from the previous post. Training shapes it too. SFT trains the model's actual weights, continuing from the pre-training checkpoint. It is not a lightweight add-on like LoRA lora , though people often assume so. LoRA is the cheap path the rest of us reach for later. The labs train the full weights. After this phase, the model fits answer-style benchmarks better. A base model only continues text, so you test it with tricks, like ending the prompt with "Answer:" and reading the next token. Benchmarks like MMLU https://arxiv.org/abs/2009.03300 multiple-choice exams and GSM8K https://arxiv.org/abs/2110.14168 math word problems want a specific answer. An instruction-tuned model gives one directly. Phase 3: Alignment After phase two, the model can answer. Phase three pushes those answers toward what people want: more useful, safer, less rambling. In some cases it also improves reasoning. This is where reinforcement learning comes in. The cleanest version to picture is reinforcement learning from human feedback, RLHF. Show people two answers to the same prompt and ask which one they prefer. Collect those preferences. Train a second model to predict which answer people would pick. Then train the chat model to produce answers that score well under that predictor. The signal changes from "this is the next token" to "people liked this answer more." Direct Preference Optimization https://arxiv.org/abs/2305.18290 DPO gets similar behavior with less machinery. It learns straight from preference pairs. This phase also helps create reasoning models , as far as outsiders can tell. You will see LRM, large reasoning model, set against LLM. The block diagram is still the same transformer. The training changes. The model learns to spend more tokens on intermediate reasoning before the final answer. Some chat UIs show this as a grey "thinking" box. Implementations vary, but the idea is simple: give the model room to work before it answers. Customizing a model on a budget: LoRA Say you have a capable open model and you want it better at one narrow thing. The obvious move is to keep training it on your data. The problem is scale. These are large models for a reason. They have billions of parameters, sometimes hundreds of billions. The largest proprietary models may cross a trillion. Your budget will not cover updates for every weight. Full fine-tuning is off the table for almost everyone. The cheaper family is called PEFT: parameter-efficient fine-tuning. The most common member is LoRA, short for Low-Rank Adaptation. The idea: leave the original weights alone, and bolt a small trainable detour beside them. Take any weight matrix in the model. Instead of editing it, add a parallel path made of two small matrices. The input flows through the frozen matrix and through the detour. Then you add the results. Why two matrices instead of one? The savings come from a thin middle. Take one weight matrix from the model. At the small size from the last post /blog/transformers-and-attention words-become-numbers , four numbers per word, it is 4 by 3, so 12 numbers to train. LoRA leaves that matrix frozen and trains two skinny ones in its place: one shrinks 4 down to 1, the other grows 1 back up to 3. That narrow step in the middle is the "low rank" in the name. Count the numbers: original value matrix: 4 × 3 = 12 numbers to train LoRA detour: 4 × 1 + 1 × 3 = 4 + 3 = 7 numbers to train Seven instead of twelve. On this small matrix that saves five numbers, which is nothing. But real matrices are huge. The LoRA rank stays far below the real dimensions. Across a whole model, that can drop the trainable share below one percent of the weights. A good default is to add LoRA to the attention and MLP matrices. Why does it work at all? It leans on a fact about matrices: two smaller matrices multiplied together can approximate one larger matrix. The approximation is less expressive than retraining the full matrix, so the results are weaker than full fine-tuning. But it costs far less, and it can still shift behavior. That is the whole trade. A typical exercise takes a base model and uses LoRA to make it better at code. Why everyone is racing for data and compute Around 2022, the scaling papers https://arxiv.org/abs/2001.08361 handed labs a simple incentive. Plot model performance against training compute. On many tasks, performance does not climb in a smooth line. It sits flat, near chance, then jumps. The model suddenly "gets" the task. People call this emergence https://arxiv.org/abs/2206.07682 . That pattern set off the race we are in now. If more data and compute keep unlocking new abilities, the move is obvious: grab as much of both as you can. We are still inside that scaling era. Running the model: inference Training is half the story. Using a trained model is inference , and it has its own constraints. Start with the hardware. Look back at the work: multiply a tensor by a matrix, multiply again, add tensors, take dot products. It is matrix multiplication all the way down. GPUs fit that job. They still have "graphics" in the name because projecting triangles onto a screen uses the same linear algebra. A GPU runs thousands of multiply-add operations side by side, so it wins on throughput where a CPU would crawl. You do not write the GPU code by hand. PyTorch takes your model and calls optimized GPU kernels underneath. Inference means loading the model onto the GPU and running it. Serving at scale is a real engineering problem. Frameworks like vLLM https://github.com/vllm-project/vllm , Triton https://github.com/triton-inference-server/server , and TensorRT-LLM https://github.com/NVIDIA/TensorRT-LLM help with it. Training uses separate code and separate hardware. A cloud platform like Modal https://modal.com/ is one way to rent GPUs for that run. Because the full models are heavy, a lot of effort goes into smaller, cheaper variants. Three levers come up: Fewer parameters. Use small models for classification. Quantization. Store weights at lower precision. The model uses less memory and runs faster, with a small accuracy tradeoff. Mixture of Experts. This one is worth a closer look. Mixture of Experts MoE swaps out the priciest part of a transformer layer, the MLP. Instead of one MLP per layer, you keep several and add a small router . For each token coming in, the router picks which expert handles it, usually just the top one or two. The experts are not topic specialists in any human sense. Routing happens per token. Each expert learns to handle the tokens sent its way. So a model can advertise 100B total parameters while firing only a small slice on each token. Big capacity, small per-token bill. You can also run inference on your own machine. Tools like llama.cpp https://github.com/ggml-org/llama.cpp , Ollama https://ollama.com/ , and LM Studio https://lmstudio.ai/ make that practical. Operating systems are starting to expose model APIs too. The landscape If you are picking a model to build on, the useful split is not the brand. One side is hosted and proprietary. The other is download and run. On the hosted side, ChatGPT https://chatgpt.com/ and Claude https://claude.ai/ lead in the US. DeepSeek https://www.deepseek.com/ and Qwen https://github.com/QwenLM/Qwen are major Chinese families. On the open-weights side, Llama https://www.llama.com/ from Meta and Gemma https://ai.google.dev/gemma from Google can run on your own hardware. Legal and security constraints often decide the choice. On regulated work, "we cannot send data to that API" can end the debate. The idea worth keeping Last post's takeaway was that numbers can represent an idea. The architecture shuffles those numbers with discipline. Here is the companion idea: the architecture is the easy half. Two frontier models can be built on nearly the same diagram and still act nothing alike. The costly part is everything around that diagram: the tokens, the training phases, the preference data, the compute. So when someone says "it's just a transformer," they are pointing at the part that is settled. They are leaving out the work and the money. That is why good data is hard to get, and why fine-tuning is a job of its own. Closing We started with a random stack of layers and ended with something you can talk to. None of it needed new math beyond the last post. The missing piece was process: tokens in, three training phases on top, a cheap way to bend the model later, and a pile of GPUs to run it.