← TrustedRouter blog A committee of five open-weights models you can download, fused into a single answer, beats Anthropic's mid-tier closed model Fable 5 on deep research. We ran an all-open panel — MiniMax M3, Kimi K2.6, DeepSeek V4 Pro, Gemma-4, and GLM-5.2 — each driving its own agentic research loop, then fused the five reports with MiniMax M3 as synthesizer, the best fuser we tested. On DRACO, judged by the same gemini-3.1-pro grader we use everywhere, that stack scores 69.9. Fable 5 running solo scores 65.3. The open committee wins by 4.6 points, and every weight in it sits on disks you control.
The reason this works is that fusion rewards diverse error, and five independently-trained open models disagree in useful ways. Each panelist searches the web, reads its own sources, and writes its own report; where one hallucinates a date or misses a primary source, the others usually don't, and the synthesizer keeps what survives cross-examination. A single model, even a good closed one like Fable 5, has one failure mode and repeats it through the whole answer. The committee has five, mostly uncorrelated, and the judge catches the difference. We have argued before that the strongest open models never show up on the leaderboards that rank them solo, and this is the mechanism: their value shows up in the ensemble, and a single-shot score never measures it.
| Panel | Synthesizer | DRACO |
|---|---|---|
| Frontier-mixed (incl. GPT-5.5, Opus) | MiniMax M3 | 71.6 |
All-open (M3 / K2.6 / V4 Pro / Gemma-4 / GLM-5.2) | MiniMax M3 | 69.9 |
| Fable 5 + GPT-5.5 | GPT-5.5 | 69.0 |
| Fable 5 solo | — | 65.3 |
The obvious objection: does this beat the closed frontier? No. Our frontier-mixed panel, with GPT-5.5 and Opus sitting on the committee, scores 71.6. Letting real frontier models into the panel buys 1.7 points over the open-only version, and that gap is consistent run to run. The open committee also edges past the best closed-ish fusion we have published, a Fable 5 plus GPT-5.5 panel at 69.0. So we make the narrower, fully defensible claim: a committee that touches no proprietary API anywhere clears a single mid-tier closed model by a real margin.
The synthesizer is the seat that decides whether the stack stays open, so we put MiniMax M3 in it: the top fuser we tested, open weights, and it never blanks. The runner-up, GLM-5.2, ties it on score but censors — on politically loaded tasks it refuses or goes blank, Taiwan being the cleanest example, and a synthesizer that won't write the answer is worthless on exactly the questions where fusion matters most. TrustedRouter Fusion covers that case directly: when the synthesizer returns nothing, it falls through to a backup model — here Gemma-4, also open weights — and the run keeps going. With M3 in the seat we never trip the fallback; swap in GLM and you would, and the answer still comes out, with no task reaching for a closed model to get unblocked.
What makes the result matter is what it removes. This configuration has no API key, no per-token bill, no provider that can deprecate the model under you or refuse a prompt class on its own policy. The panelists are open weights, the synthesizer is open weights, the fallback is open weights, and the whole loop runs on hardware you own. For deep research with its long horizon, many sources, and real cost per query, that is the difference between renting a capability and holding one. The pieces, the exact panel, and the DRACO harness are all public: the models we route are listed on our models page, and the fusion code is in TrustedRouter-Fusion-Draco. Five models you can download, judged by the same grader as everything else, land above a closed model people pay for.