cd /news/artificial-intelligence/new-sota-trustedrouter-fusion-beats-… · home topics artificial-intelligence article
[ARTICLE · art-31930] src=trustedrouter.com ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

New SOTA: TrustedRouter Fusion Beats Fable and Frontier

TrustedRouter achieved a state-of-the-art score of 70.6 on the DRACO benchmark by fusing five models including open-weights models DeepSeek V4 Pro and Kimi K2.6, surpassing OpenRouter's best fusion of Fable and GPT-5.5 at 69.0. The result is fully reproducible with open-source code and data, reinforcing TrustedRouter's commitment to verifiable AI research.

read4 min views2 publishedJun 18, 2026

← TrustedRouter blog Research is only worth as much as someone else's ability to run it again. Too much of AI has drifted the other way: the strongest results arrive as a single number in a post, produced by a model you cannot open, on a harness no one else can see, graded by a rubric that ships to nobody. You are asked to take it on faith. We are building TrustedRouter to be an AI lab that does open science the old way: open code, open results, nothing hidden. Our whole stack is radically open source — frontend and backend alike, Apache-2.0 licensed — and so is everything behind this benchmark. That is how a benchmark number earns trust: verifiability, not hype.

So we held ourselves to it. We set out to reproduce OpenRouter's Fusion result — that a panel of models, each writing its own answer with a final model synthesizing them, beats any single model on a hard research benchmark — and then to push past it. On DRACO, a hundred deep-research tasks graded against roughly forty weighted criteria each by gemini-3.1-pro, a diverse panel synthesized by Claude Opus 4.8 scores 70.6. That is the state of the art, above OpenRouter's best published fusion of Fable 5 and GPT-5.5 at 69.0. Every prompt, every tool call, and every graded answer behind the number is published.

The result comes from the panel, and the panel is itself an argument for open weights. OpenRouter's strongest fusions paired two closed frontier models. Ours adds frontier open-weights models — DeepSeek V4 Pro and Kimi K2.6 — alongside GPT-5.5, Opus, and Gemini 3 Flash. Fusion works on disagreement: models that fail in different places, reconciled by a strong synthesizer. Open-weights models are trained on different data and disagree in different ways than a closed pair does, and the wider panel is what reaches the top.

The synthesizer carries most of that result. Hold the five-model panel fixed and change only the model that writes the final answer: Opus 4.8 scores 70.6, GPT-5.5 scores 62.2. Same reports, same judge analysis, same hundred tasks, eight points of swing from one decision. A larger panel behind a weaker synthesizer buys nothing.

No single model comes near that on its own. Run each one through the same agentic loop with the same live tools, and the strongest of them lands seven points below the panel.

Solo model TrustedRouter OpenRouter
GPT-5.5 63.0 60.0
Claude Opus 4.8 60.7 58.8
DeepSeek V4 Pro 59.9 60.3
Kimi K2.6 50.1 53.7
Gemini 3.1 Pro 47.4 45.4
Gemini 3 Flash 41.1 43.1

The strongest solo reaches 63; the panel reaches 70.6. Assembling a frontier answer out of models that are each behind the frontier is the entire point.

DRACO is an agentic benchmark. The answers are not in any model's weights, so each model in the panel has to search the web, read the sources, and run the numbers itself; we give every one of them live tools and let it drive its own research. Those runs issued thousands of searches and fetches, and all of them sit in the published replays — none touching the benchmark's own hosts, so nothing was looked up that was meant to be worked out. The leakage guard lives in the open-source harness, and the audit is yours to re-run.

We ran all of it on TrustedRouter for the same reason we published the code. A benchmark sends your prompts and the documents you fetch through someone else's servers, and with most gateways you take their privacy on faith. TrustedRouter runs inside a Trusted Execution Environment (TEE), end-to-end encrypted: a sealed enclave the operator cannot read into, handling every request as an attested workload whose exact code is measured and published. You can pull the image digest, match it against the open source, and confirm the binary that saw your prompt is the one in the repository, with nowhere inside it to record anything. You check the privacy the way you check the score — by hand, against a hash.

We do not want you to trust our 70.6. Clone the repository — the harness, the tasks, the judge, the panel, and the raw run traces are all in it — point it at TrustedRouter, and produce the number yourself. Open code, open results, a score you can reproduce and a privacy guarantee you can verify. That is what an AI lab doing open science looks like, and it is the only kind of result worth believing.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @trustedrouter 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/new-sota-trustedrout…] indexed:0 read:4min 2026-06-18 ·