{"slug": "reiner-dwarkesh-transcript-md", "title": "reiner-dwarkesh-transcript.md", "summary": "Based on the transcript, this is a blackboard-style interview between Dwarkesh Patel and Reiner Pope, CEO of chip startup MatX, discussing model architecture and machine learning infrastructure. The conversation focuses on how batch size affects token cost and speed in transformer models, using a roofline analysis of a Blackwell NVL72 GPU cluster to explain inference time and cost trade-offs. Topics also include Mixture of Experts (MoE) model layout, pipeline parallelism, and the implications of reinforcement learning on model over-training beyond Chinchilla-optimal scaling.", "body_md": "(00:00:00) – How batch size affects token cost and speed\n(00:32:09) – How MoE models are laid out across a GPU racks\n(00:47:12) – How pipeline parallelism moves model layers across racks\n(01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.”\n(01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal\n(01:33:02) – Deducing long context memory costs from API pricing\n(02:04:02) – Convergent evolution between neural nets and cryptography\nDwarkesh Patel\nToday, I'm interviewing Reiner Pope, who is the CEO of MatX, which is a new chip startup. Previously, he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a blackboard lecture. We're going to get up in a second. We in fact built this whole new studio with specifically this format in mind, so it's a pleasure to get to inaugurate it with you.\nWe're going to be talking about model architecture, ML infra, and many other things. The reason I think it's an important topic is because once you understand how training and inference work in a cluster, a lot of things—about why AI is the way it is, why AI architectures are the way they are, why API prices are the way they are, and fundamentally why AI progress is the way it is—start making sense. You need to understand the details to get there, and you need a blackboard to understand the details. Reiner, thank you so much for doing this.\nReiner Pope\nVery happy to be here.\nDwarkesh Patel\nFull disclosure, I am an angel investor in MatX, but that's unrelated to this podcast. Reiner, to kick us off I'll ask this question. We have a couple of companies like Claude and Codex and Cursor offering something like Fast Mode, where for 6x the price, they'll stream you tokens at 2.5x the speed. Mechanically, I'm curious what's going on here. Why is it the case that you can pay more to get faster latency?\nTwo, could you keep going? Could you pay 100x more and somehow get much faster speeds? Three, could you go the other way? Could you have something like Claude Code “Slow Mode”, where if you are willing to wait for minutes on end, you could get even cheaper prices? Maybe this will help motivate the analysis that you'll be doing through the lecture.\nReiner Pope\nGreat. To jump to the conclusion a little bit, the big effect is batch size. What we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost. There's another effect, which you can call speculative decoding or multi-token prediction. We can maybe come back to that later, but the first thing that we'll talk through is batch size.\nWhat I'd like to introduce is the two principles of analysis. First, we're going to look at a roofline analysis of how we run a transformer model on a cluster of chips. We'll take a Blackwell NVL72 cluster, so a rack of 72 GPUs. The roofline analysis means we look at memory bandwidth and compute performance. The other side of that is that we're going to look at just two simple factors of the model: the time to operate on the weights, and the time to operate on the context, the KV cache.\nLet's jump in. We're going to try and estimate the time that it takes to run an inference of a certain shape. We're not perfect here. We can't exactly predict the time, so instead we're going to approximate. We're going to say that the time must be greater than or equal to a certain quantity. We're going to consider two different aspects: the time it takes to do the memory fetches, and the time it takes to do the compute. It will turn out that this gives us very strong predictive power, even with a simple model.\nOne by one, what is the time that it takes to do the compute? There are really two things I need to do in the compute. I need to multiply by all of the active parameters, and then I need to do some work on the attention. Multiplying by all the active parameters, I have a certain batch size that I'm running, and I've got a number of active parameters in my model. Then I'm just going to divide this by the compute throughput, which is the FLOPs of the chip. This is a hardware concern.\nThis accounts for all of the compute time for all of the weight matrix multiplies. There's a little caveat here. We've ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this. So we'll ignore this.\nDwarkesh Patel\nI'll just interrupt from time to time to ask some very naive questions or to clarify some basic points. For the audience, you're not serving one user at a time. The batch refers to the fact that you're serving many different users at the same time, and that's a whole batch.\nReiner Pope\nI can motivate the batch at least a little bit. We will see exactly why batch is such a favorable optimization. What will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be a thousand times worse than if you do batch many users together. We'll be able to see that quite explicitly.\nThen, number of active parameters. If I look at, for example, a DeepSeek model, the DeepSeek V3 model has about 37 billion active parameters, and 700 billion total parameters. We're focusing on just the ones that are active for a single AI token.\nWe're modeling compute performance. I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much, and maybe there will be some terms we ignored.\nOn the memory side, what do we need to do with memory? We need to fetch all of the weights, so there is some time to fetch the total number of parameters, not just the active parameters. There's weight fetch time, and then in addition, there's a KV cache fetch time. This actually depends on batch size. For every element of the batch, we have to fetch an entire context length worth of tokens, and there's a size per token, bytes for one token. This is a model parameter.\nDwarkesh Patel\nMaybe just backing up, let's explain what the KV cache is real quick.\nReiner Pope\nWhen I do a forward pass… Let me draw how the autoregressive inference works. This is during decode. If I have a bunch of text tokens… I'm drawing a tensor because ultimately the tokens are represented as a tensor in some embedding dimension. In this direction, I have the sequence length.\nThe work of running a decode is that I have to run each token through a whole bunch of matrix multiplies over a bunch of different layers. In general, I'm going to have to do that work over all of these tokens. But one step of decode is to produce just this one additional token up here.\nWhat I'm going to do there is run a full forward pass of multiplying by all of the weight matrices in the entire model. But then I've got this attention mechanism where this token is looking at all of the past tokens, and what is it looking at specifically? It is looking at some internal representation that the model has produced of the tokens, and we call that the KV cache. This process of this single token attending to all of the history of tokens is attention. It is mostly dominated by memory fetches rather than matrix multiplies.\nSo we've got the amount of memory that we're fetching shown over here, and then this is of course just divided by the memory bandwidth, so the memory bytes per second. In fact, these equations here are enough for us to now draw some fit lines. The things that we'd like to look at are sensitivity to batch, and then also, which we'll draw separately, to context length. We said that the big effect you can get is some trade-off in latency versus cost in batch size.\nLet's draw them out. I think there are just really two graphs that we want to draw. We'll first draw batch size versus time here. When we look at the shape of this, we've got a maximum of the sum and then another term. Let's look at these terms one by one and how they scale: the time for compute and memory, and how they show up.\nLet's first look at this compute time. This is just purely linear in batch size with no offset, so it is some curve like this. This is t compute. On the memory side, we've got some portion here that is just this constant in some base offset here, which is the weight fetch. Finally, we have this term here, which is the KV fetch, which is pretty linear in batch size, and so it looks like that. The sum of this plus this maxed with this… Let's at least first draw the sum. The two memory times in conjunction end up looking on this curved slope like this. Then the overall maximum is—I'll draw a little thicker here—the maximum of these two curves.\nWhat does this mean? This is a latency plot. If I grow my batch size, initially I get some not very strong dependence on batch size, so there is some lower bound on latency here. This already partially answers the question. For a given hardware configuration—and we can talk about varying the hardware configuration—there is a lower bound on latency. It is simply that I need to read all of my total parameters from memory into the chips, and that takes a certain amount of time. If I use all of my memory bandwidth, I can't do any better than that.\nDwarkesh Patel\nIt seems like the way you've drawn the slopes for compute time and how the KV grows—and what implication the KV has on memory time—\nReiner Pope\nWhat if this were above or below?\nDwarkesh Patel\nYeah, is that necessarily the case? If this is always true, then as batch size grows compute always dominates KV, which suggests that if you have a big enough batch size, maybe memory is never an issue.\nReiner Pope\nThis is really sensitive to the context length, so I think we should come back and explore this. As you vary the context length, the KV fetch time will go up and up, and that will cause a transition from compute-limited to memory-limited.\nDwarkesh Patel\nIs there something especially significant about the slope being exactly the slope of the compute time?\nReiner Pope\nWhenever we have balance points, it says that you're getting it exactly right. For the particular context length where the slopes match, that says I am equally memory-bound and compute-bound, which is a really desirable place to be.\nDwarkesh Patel\nThis is a very simple algebra problem, but suppose the optimal is 100K context length, and you go to 200K context length. Does your MFU go down to 50%? Does it have a humongous impact on MFU to be slightly outside of the optimal context length range, the Goldilocks zone?\nReiner Pope\nThat's right. That is true as modeled here. There is a key point here that I'm modeling the memory fetch as linear in context length. That depends on model architecture. It is true for all of the model architectures with dense attention. Sparse attention actually scales much better than that.\nDwarkesh Patel\nGot it. Is sparse attention what everybody uses in practice?\nReiner Pope\nI'm pretty excited about sparse attention. It's hard to know what the labs are using. DeepSeek has published a sparse attention mechanism. I'll just put a plug in that some of the DeepSeek papers that have published sparse attention end up putting a square root in this term.\nSo far, we've looked at the latency. It's hard to read off cost from this. If I think about what cost means… To run this inference, I'm going to use the GPU for a certain number of seconds, like one millisecond or 20 milliseconds. I have to pay the rental time for that time. So it's $2/hour per GPU or something like that.\nThat's the cost of this inference, but how many tokens have I processed during that inference? That is the batch size. What we actually want to plot is the cost versus batch size, which is t over B versus batch size. This is the cost per token. We have to imagine dividing each of these three curves by B, so multiplying by this reciprocal. What we end up with there is… The compute curve was linear. We divide by B, and that makes it a constant here. This is t compute. The K", "url": "https://wpnews.pro/news/reiner-dwarkesh-transcript-md", "canonical_source": "https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe", "published_at": "2026-04-29 16:58:29+00:00", "updated_at": "2026-05-22 13:48:35.361676+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "semiconductor", "hardware"], "entities": ["Reiner Pope", "MatX", "Google", "Dwarkesh Patel"], "alternates": {"html": "https://wpnews.pro/news/reiner-dwarkesh-transcript-md", "markdown": "https://wpnews.pro/news/reiner-dwarkesh-transcript-md.md", "text": "https://wpnews.pro/news/reiner-dwarkesh-transcript-md.txt", "jsonld": "https://wpnews.pro/news/reiner-dwarkesh-transcript-md.jsonld"}}