# Reading MAI's efficiency gain. How to pick architectures like serious people

> Source: <https://idlemachines.co.uk/essays/efficiency-gain>
> Published: 2026-06-04 00:05:15+00:00

How to pick architectures like serious people.

Microsoft's [MAI-Thinking-1 report](https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf) is mostly about a 35B-active, ~1T-total sparse MoE reasoning model (and the report is so detailed by the way, we'll be digging into this over the next few weeks), but the bit I want to talk about today is a small methodological idea tucked into the pre-training section.
There is a perennial problem when designing a model, it's always a trade-off between compute budget and the final loss it reaches.
And while you might point to the Chinchilla scaling laws and pick the point on the curve that optimises for your compute budget, that only works if you trust the curve to be the same for your new design as it was for the old one.
There is one glaring caveat in this: what FLOPs do you actually get from your own training stack and your cluster? And more importantly if you're paying by the hour, what is the real wall time?
Because I bet it doesn't match anyone else, not quite.

This is where MAI's efficiency gain (EG) comes in. It's a metric that says how much better or worse a candidate design is than the baseline, and it can be computed on any cost axis we like. What we expect in lots of cases is that the optimal model from a FLOPs perspective will not be the optimal model from a wall-clock time perspective.

It's worth thinking a bit more about what the two types of efficiency we're talking about here mean. Your GPU has a headline number that (probably Nvidia, but bonus points if you're using something left field) tells you the theoretical peak FLOPs it can compute, but real models don't get anywhere close. A well optimised Transformer might be sitting around 50-60%, and a more exotic architecture might be down in the 20s. That means that if you design a model that's cheap in FLOPs, but the kernels are bad and the MFU is low, you might end up with a model that looks great on paper but crawls on your real (and very expensive) hardware.

Counting FLOPs does have one really important virtue, it is independent of the implementation. If you have a new idea, you don't have to wait for the kernels to be optimised to know if it's good. Established mature architectures have had years of development and optimisation making their kernels incredibly fast, this makes it hard for a new idea to compete on wall-clock time until the kernels are optimised, but if you look at FLOPs you can see the potential of the idea without worrying about the implementation details.

But as soon as we actually start training the model, and it doesn't matter if we're renting time from a cloud, or if you're lucky enough to have on-prem hardware, we are always fighting to minimise the wall-clock time. That's either money out of your pocket, other people needing to share the cluster, or just fewer of your own experiments getting run.

These two numbers rarely line up, which is the delightful tension MAI's efficiency gain (EG) lets us reason about. The rough outline is that we fit a curve to a ladder of baseline runs so we know what loss the baseline reaches for any compute budget; then for a candidate that reaches some loss, we ask how much compute the baseline would have needed to reach that same loss, and divide by what the candidate actually spent. An EG above 1 means the candidate got there for less, so it wins; below 1 means it lost. Put FLOPs in for "compute" and we get ; put wall-clock time in and we get .

Table 2 of the report compares MAI-Base-1's interleaved layout, high-sparsity 8/512 MoE layers alternating with dense FFN layers, against the more conventional choice of a medium-sparsity MoE in every layer. There are two every-layer candidates, measured on the L12–L30 rungs of the ladder, with EG aggregated across the eval suite using their code-heavy weighting (Eq. 3 in the report):

| Candidate (vs interleaved baseline) | ↑ | ↑ |
|---|---|---|
| MoE every layer (8/384) | 0.94 | 0.73 |
| MoE every layer (7+1 shared/384) | 1.03 | 0.82 |

So here if we look at FLOPs, the 7+1 shared variant looks strong, means it beats the baseline on the FLOPs axis. But on the time axis, both candidates look bad, and while the 7+1 shared variant is better than the plain 8/384 variant, it's still a loss compared to the baseline. And a 3% win on FLOPs is not going to make up for an 18% loss on time, so the overall verdict is that the interleaved layout is still the better choice, even though the FLOPs say otherwise.

The key thing here is how easily it might have been to look at the FLOPs and say "oh, the 7+1 shared variant is better, let's go with that", without realising how much time it costs until that's become an expensive mistake.

Some of this is obviously common sense, but the point is putting a defensible number to it makes it a genuine metric, and it lets us see the trade-off properly.

It helps to see this on a plot. The candidate flips sides between the two panels, sitting just to the left of the baseline curve on FLOPs (so just above 1) and well to the right of it on time (so well below 1), which is the same disagreement the table reports.

We can make all of this precise. MAI fit the baseline ladder with a power law,

the irreducible loss plus a reducible term that decays with the training cost , where sets the size of the reducible part and controls the rate it falls at. Fitting the curve gives us a continuous function we can query at any cost (with a decent approximation of the error as well for those statistically minded of you, but we'll ignore that for now) to say for the same cost what would the baseline loss have been?

The actual metric needs this the other way round, compute-from-loss, so let's invert it:

Then a candidate at loss and cost would have cost the baseline to match, and the efficiency gain is the ratio

This means we can read EG straight off the plot as the horizontal gap between the candidate and the curve at the candidate's loss. If the candidate is to the left of the curve, it means it got there for less, so EG > 1 and it wins; if it's to the right, it means it needed more, so EG < 1 and it loses.

We can reproduce this in a few lines. The report doesn't release the raw ladder points, so we've had to make some up based on the plots, but it shows the key point.

``` python
1import numpy as np
2from scipy.optimize import curve_fit
3
4# Baseline = the interleaved layout. A scaling ladder: training cost C and the
5# aggregated eval loss L it reaches, on two cost axes for the same runs.
6C_flops = np.array([1.0, 2.0, 4.0, 8.0, 16.0])      # units of 1e20 FLOPs
7C_time  = np.array([1.0, 2.0, 4.0, 8.0, 16.0])      # units of GPU-days
8L_flops = np.array([3.10, 2.92, 2.78, 2.66, 2.57])  # baseline loss vs FLOPs
9L_time  = np.array([3.02, 2.86, 2.73, 2.62, 2.54])  # baseline loss vs time
10
11def law(C, A, alpha, E):
12    return A * C ** (-alpha) + E
13
14def fit(C, L):
15    (A, alpha, E), _ = curve_fit(law, C, L, p0=[1.0, 0.2, 2.0], maxfev=100000)
16    return A, alpha, E
17
18def inv(L, A, alpha, E):          # C(L): baseline cost needed to reach loss L
19    return ((L - E) / A) ** (-1.0 / alpha)
20
21A_f, a_f, E_f = fit(C_flops, L_flops)
22A_t, a_t, E_t = fit(C_time,  L_time)
23
24# Candidate = MoE-every-layer (7+1 shared): one loss, two costs.
25L_cand       = 2.70   # the loss it reached
26C_cand_flops = 6.09   # FLOPs spent  -> slightly cheap in FLOPs
27C_cand_time  = 5.78   # time spent   -> slow on the cluster
28
29EG_flops = inv(L_cand, A_f, a_f, E_f) / C_cand_flops
30EG_time  = inv(L_cand, A_t, a_t, E_t) / C_cand_time
31
32print(f"EG_FLOPs = {EG_flops:.2f}  (>1: candidate wins on FLOPs)")
33print(f"EG_Time  = {EG_time:.2f}  (<1: candidate loses on the clock)")
```

which prints

```
1EG_FLOPs = 1.03  (>1: candidate wins on FLOPs)
2EG_Time  = 0.82  (<1: candidate loses on the clock)
```

Nothing about efficiency gain is specific to this model. This is exactly the choice we all face every time we make any new change to our architectures and training recipes. For any new idea, we can compare it against a baseline ladder and work out exactly where this sits against the curve for FLOPs and wall-clock time.

It lets us ask the question of whether our new idea that definitely wins by reducing the FLOPs / loss ratio, but hasn't yet had the custom kernels optimised, is actually a good idea or not. We can see if actually the wall-clock time is close enough that our engineering effort will pay off, or if the very clever idea that works on paper just doesn't quite get there on our hardware. And if we have a new idea that actually does win on time as well, we can be confident that it's a real improvement and not just a quirk of the FLOPs counting.