{"slug": "reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people", "title": "Reading MAI's efficiency gain. How to pick architectures like serious people", "summary": "Microsoft's MAI-Thinking-1 report introduces an \"efficiency gain\" (EG) metric that compares candidate model architectures against a baseline by measuring how much compute is needed to reach the same loss, allowing evaluation on either FLOPs or wall-clock time. The metric addresses the common mismatch where a model optimized for theoretical FLOPs may perform poorly on real hardware due to low model FLOPs utilization (MFU), while a model that looks worse on paper could train faster in practice. In the report's comparison, an interleaved MoE layout achieved higher EG than conventional every-layer MoE configurations, with the best every-layer candidate scoring 0.94 on FLOPs and 0.73 on wall-clock time relative to the interleaved baseline.", "body_md": "How to pick architectures like serious people.\n\nMicrosoft's [MAI-Thinking-1 report](https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf) is mostly about a 35B-active, ~1T-total sparse MoE reasoning model (and the report is so detailed by the way, we'll be digging into this over the next few weeks), but the bit I want to talk about today is a small methodological idea tucked into the pre-training section.\nThere is a perennial problem when designing a model, it's always a trade-off between compute budget and the final loss it reaches.\nAnd while you might point to the Chinchilla scaling laws and pick the point on the curve that optimises for your compute budget, that only works if you trust the curve to be the same for your new design as it was for the old one.\nThere is one glaring caveat in this: what FLOPs do you actually get from your own training stack and your cluster? And more importantly if you're paying by the hour, what is the real wall time?\nBecause I bet it doesn't match anyone else, not quite.\n\nThis is where MAI's efficiency gain (EG) comes in. It's a metric that says how much better or worse a candidate design is than the baseline, and it can be computed on any cost axis we like. What we expect in lots of cases is that the optimal model from a FLOPs perspective will not be the optimal model from a wall-clock time perspective.\n\nIt's worth thinking a bit more about what the two types of efficiency we're talking about here mean. Your GPU has a headline number that (probably Nvidia, but bonus points if you're using something left field) tells you the theoretical peak FLOPs it can compute, but real models don't get anywhere close. A well optimised Transformer might be sitting around 50-60%, and a more exotic architecture might be down in the 20s. That means that if you design a model that's cheap in FLOPs, but the kernels are bad and the MFU is low, you might end up with a model that looks great on paper but crawls on your real (and very expensive) hardware.\n\nCounting FLOPs does have one really important virtue, it is independent of the implementation. If you have a new idea, you don't have to wait for the kernels to be optimised to know if it's good. Established mature architectures have had years of development and optimisation making their kernels incredibly fast, this makes it hard for a new idea to compete on wall-clock time until the kernels are optimised, but if you look at FLOPs you can see the potential of the idea without worrying about the implementation details.\n\nBut as soon as we actually start training the model, and it doesn't matter if we're renting time from a cloud, or if you're lucky enough to have on-prem hardware, we are always fighting to minimise the wall-clock time. That's either money out of your pocket, other people needing to share the cluster, or just fewer of your own experiments getting run.\n\nThese two numbers rarely line up, which is the delightful tension MAI's efficiency gain (EG) lets us reason about. The rough outline is that we fit a curve to a ladder of baseline runs so we know what loss the baseline reaches for any compute budget; then for a candidate that reaches some loss, we ask how much compute the baseline would have needed to reach that same loss, and divide by what the candidate actually spent. An EG above 1 means the candidate got there for less, so it wins; below 1 means it lost. Put FLOPs in for \"compute\" and we get ; put wall-clock time in and we get .\n\nTable 2 of the report compares MAI-Base-1's interleaved layout, high-sparsity 8/512 MoE layers alternating with dense FFN layers, against the more conventional choice of a medium-sparsity MoE in every layer. There are two every-layer candidates, measured on the L12–L30 rungs of the ladder, with EG aggregated across the eval suite using their code-heavy weighting (Eq. 3 in the report):\n\n| Candidate (vs interleaved baseline) | ↑ | ↑ |\n|---|---|---|\n| MoE every layer (8/384) | 0.94 | 0.73 |\n| MoE every layer (7+1 shared/384) | 1.03 | 0.82 |\n\nSo here if we look at FLOPs, the 7+1 shared variant looks strong, means it beats the baseline on the FLOPs axis. But on the time axis, both candidates look bad, and while the 7+1 shared variant is better than the plain 8/384 variant, it's still a loss compared to the baseline. And a 3% win on FLOPs is not going to make up for an 18% loss on time, so the overall verdict is that the interleaved layout is still the better choice, even though the FLOPs say otherwise.\n\nThe key thing here is how easily it might have been to look at the FLOPs and say \"oh, the 7+1 shared variant is better, let's go with that\", without realising how much time it costs until that's become an expensive mistake.\n\nSome of this is obviously common sense, but the point is putting a defensible number to it makes it a genuine metric, and it lets us see the trade-off properly.\n\nIt helps to see this on a plot. The candidate flips sides between the two panels, sitting just to the left of the baseline curve on FLOPs (so just above 1) and well to the right of it on time (so well below 1), which is the same disagreement the table reports.\n\nWe can make all of this precise. MAI fit the baseline ladder with a power law,\n\nthe irreducible loss plus a reducible term that decays with the training cost , where sets the size of the reducible part and controls the rate it falls at. Fitting the curve gives us a continuous function we can query at any cost (with a decent approximation of the error as well for those statistically minded of you, but we'll ignore that for now) to say for the same cost what would the baseline loss have been?\n\nThe actual metric needs this the other way round, compute-from-loss, so let's invert it:\n\nThen a candidate at loss and cost would have cost the baseline to match, and the efficiency gain is the ratio\n\nThis means we can read EG straight off the plot as the horizontal gap between the candidate and the curve at the candidate's loss. If the candidate is to the left of the curve, it means it got there for less, so EG > 1 and it wins; if it's to the right, it means it needed more, so EG < 1 and it loses.\n\nWe can reproduce this in a few lines. The report doesn't release the raw ladder points, so we've had to make some up based on the plots, but it shows the key point.\n\n``` python\n1import numpy as np\n2from scipy.optimize import curve_fit\n3\n4# Baseline = the interleaved layout. A scaling ladder: training cost C and the\n5# aggregated eval loss L it reaches, on two cost axes for the same runs.\n6C_flops = np.array([1.0, 2.0, 4.0, 8.0, 16.0])      # units of 1e20 FLOPs\n7C_time  = np.array([1.0, 2.0, 4.0, 8.0, 16.0])      # units of GPU-days\n8L_flops = np.array([3.10, 2.92, 2.78, 2.66, 2.57])  # baseline loss vs FLOPs\n9L_time  = np.array([3.02, 2.86, 2.73, 2.62, 2.54])  # baseline loss vs time\n10\n11def law(C, A, alpha, E):\n12    return A * C ** (-alpha) + E\n13\n14def fit(C, L):\n15    (A, alpha, E), _ = curve_fit(law, C, L, p0=[1.0, 0.2, 2.0], maxfev=100000)\n16    return A, alpha, E\n17\n18def inv(L, A, alpha, E):          # C(L): baseline cost needed to reach loss L\n19    return ((L - E) / A) ** (-1.0 / alpha)\n20\n21A_f, a_f, E_f = fit(C_flops, L_flops)\n22A_t, a_t, E_t = fit(C_time,  L_time)\n23\n24# Candidate = MoE-every-layer (7+1 shared): one loss, two costs.\n25L_cand       = 2.70   # the loss it reached\n26C_cand_flops = 6.09   # FLOPs spent  -> slightly cheap in FLOPs\n27C_cand_time  = 5.78   # time spent   -> slow on the cluster\n28\n29EG_flops = inv(L_cand, A_f, a_f, E_f) / C_cand_flops\n30EG_time  = inv(L_cand, A_t, a_t, E_t) / C_cand_time\n31\n32print(f\"EG_FLOPs = {EG_flops:.2f}  (>1: candidate wins on FLOPs)\")\n33print(f\"EG_Time  = {EG_time:.2f}  (<1: candidate loses on the clock)\")\n```\n\nwhich prints\n\n```\n1EG_FLOPs = 1.03  (>1: candidate wins on FLOPs)\n2EG_Time  = 0.82  (<1: candidate loses on the clock)\n```\n\nNothing about efficiency gain is specific to this model. This is exactly the choice we all face every time we make any new change to our architectures and training recipes. For any new idea, we can compare it against a baseline ladder and work out exactly where this sits against the curve for FLOPs and wall-clock time.\n\nIt lets us ask the question of whether our new idea that definitely wins by reducing the FLOPs / loss ratio, but hasn't yet had the custom kernels optimised, is actually a good idea or not. We can see if actually the wall-clock time is close enough that our engineering effort will pay off, or if the very clever idea that works on paper just doesn't quite get there on our hardware. And if we have a new idea that actually does win on time as well, we can be confident that it's a real improvement and not just a quirk of the FLOPs counting.", "url": "https://wpnews.pro/news/reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people", "canonical_source": "https://idlemachines.co.uk/essays/efficiency-gain", "published_at": "2026-06-04 00:05:15+00:00", "updated_at": "2026-06-04 00:15:13.679380+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-research", "ai-infrastructure"], "entities": ["Microsoft", "MAI-Thinking-1", "Chinchilla"], "alternates": {"html": "https://wpnews.pro/news/reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people", "markdown": "https://wpnews.pro/news/reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people.md", "text": "https://wpnews.pro/news/reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people.txt", "jsonld": "https://wpnews.pro/news/reading-mai-s-efficiency-gain-how-to-pick-architectures-like-serious-people.jsonld"}}