Mano-CUA 2.0: After a Year of Building a 4B GUI Agent, We Found the Bottleneck Was Never Model Size

wpnews.pro

Mano-P is an open-source project we've been working on. It runs a 4B-parameter vision-language model on a MacBook, controlling the computer by watching screenshots. Clicking, typing, hotkeys — the model looks at each screenshot and decides what to do next. Everything stays on-device, no cloud API calls. We shipped 1.0 a while back and recently iterated to 2.0, re-running our 100 real macOS GUI task benchmark. Some of the results surprised us.

We settled on 4B mostly because of the 16GB MacBook memory constraint. We tried general-purpose VL models too — on our benchmark they scored around 39%, with issues around Chinese input focus, non-browser app support, and step-limit truncation on longer tasks. General VL capability and GUI agent capability turned out to be pretty different things. Internally we also debated MoE, but on-device inference is memory-bound more than compute-bound, and MoE keeps all expert weights resident in memory. That's a bad trade on a Mac. We stayed on MLX — it maps directly to Apple Silicon's unified memory, which is about as good as it gets for local inference on these chips.

That's the background. The interesting part is what happened in 2.0.

We'd been assuming 4B models just couldn't handle GUI agents well — too small. After finishing 2.0, we realized that wasn't the story. The first real bottleneck was Chinese GUI training data.

Version 1.0 was pretty bad at Chinese UI elements. The model kept making character-level recognition errors, confusing one Chinese character for a visually similar one. These mistakes sound minor, but for a GUI agent they're fatal. If you can't find the button, nothing downstream matters.

English interfaces didn't have this problem. "Settings" and "Settinas" look very different. Chinese is different — characters have higher visual similarity to each other, and buttons like "确定" or "取消" pack dense strokes into a small area, making them harder for the model to distinguish. At the time we thought this was a 4B capacity issue. Chinese characters are just harder than English letters, small models can't handle it.

After adding more Chinese GUI training data in 2.0, it turned out that wasn't the case. The fix wasn't fancy — just increasing the volume of Chinese interface screenshots in the training set. The results showed up immediately in the enterprise IM category: 33% on 1.0, 83% on 2.0. WeChat went from 33% to 67%.

Fifty percentage points. Looking back, 1.0's data coverage was simply insufficient. This was solvable with more data — no architecture changes, no larger model needed. We'd spent a lot of time debating model selection when the bottleneck was in the data. Kind of embarrassing, honestly.

Version 2.0 dropped from 74% to 68% on browser and web tasks. When this came up internally there was some debate about whether we'd overdone the Chinese data.

Looking at the breakdown, the decline was concentrated in English web element recognition. Chinese web tasks barely changed. After rebalancing the training mix toward Chinese, the model's sensitivity to English UI elements dropped a bit.

The sample sizes aren't huge, so we don't think this represents actual capability regression. But it exposed a problem we haven't solved: balancing Chinese and English GUI data. The 4B model has limited capacity — cram too much in and things interfere with each other. Right now we're doing volume-based control where more Chinese means less English. Ideally both would be sufficient, but at this model size that's not achievable.

Short term we're planning finer-grained sampling strategies, weighting by task type and UI complexity rather than just language. Whether that works, we're not sure. This might need a larger model to have enough room.

Both 1.0 and 2.0 scored 30% on long-horizon tasks — those requiring 10+ steps. No improvement. Cross-app tasks went from 0% to 20%, passing 1 out of 5.

From what we've tested, this looks more like a 4B capacity constraint. Tasks with 10+ steps require maintaining context throughout, with each step's results feeding correctly into the next decision. The 4B model's working memory is limited — by step 10, it's already fuzzy on what happened in the first few steps. We haven't found a way to push past this with data alone. Cross-app is even harder. Find a file someone sent in WeChat, switch to Finder, save it to the desktop. One window switch and the 4B model tends to lose the previous context. We expect a larger model would help here, but haven't verified it.

Internally we're weighing two paths. One is adding more reasoning-chain training data to the 4B and seeing how far that goes. The other is building a larger model — 7B or 8B — that would run on an M5 Pro with 64GB. Haven't decided. Honestly still not sure which path is worth the investment.

Everything above is about model capability. Cider solves something different — users can't wait.

GUI agents have an unusual performance characteristic. After every action, you take a fresh screenshot and feed the whole image's tokens through the model. That prefill stage directly determines how long the user waits. On an M5 Pro, MLX's native W8A16 mode gives us 2.839 seconds for prefill.

2.8 seconds doesn't sound like much. But over a 10-step task, waiting nearly 3 seconds at each step adds up to half a minute. Users feel that as sluggish. Decode speed isn't the issue — 80 tokens/s is plenty for generating action commands.

The problem was that MLX doesn't ship with online activation quantization operators. Weights are static and can be quantized offline. But activations are dynamic — every screenshot produces different intermediate values going through the network. MLX doesn't provide that capability, so we wrote our own.

The trickiest design decision was quantization granularity. Too coarse and outlier activation values drag down overall accuracy. Too fine and compute overhead goes up. We landed on per-token granularity — each token's activation vector gets its own quantization parameters computed independently. More overhead than per-tensor, but accuracy loss stays manageable.

Results: W8A8 prefill dropped from 2.839s to 2.519s, about 12.7% faster. Peak memory also went down, which matters for GUI agents that need to coexist in memory with the user's other applications. We haven't measured the exact memory savings yet.

M5 chips have hardware acceleration so the improvement is significant. M4 and below fall back to pure Python, which gives limited speedup. We evaluated optimizing specifically for M4 and decided the return wasn't worth the effort. Cider isn't Mano-P-specific — any MLX model can use it.

Fuzzy descriptions and cross-app tasks are the two clearest weak points, both pointing at the 4B model's reasoning capacity. We'll likely build a larger version. Cider continues with stability and compatibility work.

Full category breakdown below. Test hardware: MacBook Pro M5 16GB. Cloud model: Claude Sonnet 4.5. Local model: Mano-CUA-4B W8A16.

Category	Tasks	1.0	2.0	Cloud
Browser/Web	31	74%	68%	90%
Enterprise IM	6	33%	83%	100%
6	33%	67%	83%
WPS/Office	5	0%	40%	100%
System Settings	6	50%	83%	50%
Notes/Reminders	4	50%	50%	100%
File Management	7	43%	43%	71%
System Utilities	3	100%	100%	100%
Long-horizon	10	30%	30%	80%
Cross-app	5	0%	20%	60%
Fuzzy descriptions	10	30%	30%	80%
No open hint	5	20%	60%	80%

source & further reading

dev.to — original article Why Warp is betting engineering leaders are done picking a favourite coding agent Building a Real-Time AI Voice Agent with OpenAI Realtime API and Next.js Understanding the Difference between Agents vs Automation

Mano-CUA 2.0: After a Year of Building a 4B GUI Agent, We Found the Bottleneck Was Never Model Size

Run your AI side-project on zahid.host