{"slug": "new-sota-trustedrouter-fusion-beats-fable-and-frontier", "title": "New SOTA: TrustedRouter Fusion Beats Fable and Frontier", "summary": "TrustedRouter achieved a state-of-the-art score of 70.6 on the DRACO benchmark by fusing five models including open-weights models DeepSeek V4 Pro and Kimi K2.6, surpassing OpenRouter's best fusion of Fable and GPT-5.5 at 69.0. The result is fully reproducible with open-source code and data, reinforcing TrustedRouter's commitment to verifiable AI research.", "body_md": "[← TrustedRouter blog](/blog)\n\n# New SOTA: TrustedRouter Fusion beats Fable and Frontier\n\nResearch is only worth as much as someone else's ability to run it again. Too much of AI has drifted the other way: the strongest results arrive as a single number in a post, produced by a model you cannot open, on a harness no one else can see, graded by a rubric that ships to nobody. You are asked to take it on faith. We are building TrustedRouter to be an AI lab that does open science the old way: open code, open results, nothing hidden. Our whole stack is radically open source — frontend and backend alike, Apache-2.0 licensed — and so is everything behind this benchmark. That is how a benchmark number earns trust: verifiability, not hype.\n\nSo we held ourselves to it. We set out to reproduce OpenRouter's Fusion result — that a panel of models, each writing its own answer with a final model synthesizing them, beats any single model on a hard research benchmark — and then to push past it. On [DRACO](https://github.com/Lore-Hex/TrustedRouter-Fusion-Draco), a hundred deep-research tasks graded against roughly forty weighted criteria each by gemini-3.1-pro, a diverse panel synthesized by Claude Opus 4.8 scores **70.6**. That is the state of the art, above OpenRouter's best published fusion of Fable 5 and GPT-5.5 at 69.0. Every prompt, every tool call, and every graded answer behind the number is published.\n\nThe result comes from the panel, and the panel is itself [an argument for open weights](/blog/the-best-open-models-arent-on-your-leaderboard). OpenRouter's strongest fusions paired two closed frontier models. Ours adds frontier open-weights models — DeepSeek V4 Pro and Kimi K2.6 — alongside GPT-5.5, Opus, and Gemini 3 Flash. Fusion works on disagreement: models that fail in different places, reconciled by a strong synthesizer. Open-weights models are trained on different data and disagree in different ways than a closed pair does, and the wider panel is what reaches the top.\n\nThe synthesizer carries most of that result. Hold the five-model panel fixed and change only the model that writes the final answer: Opus 4.8 scores 70.6, GPT-5.5 scores 62.2. Same reports, same judge analysis, same hundred tasks, eight points of swing from one decision. A larger panel behind a weaker synthesizer buys nothing.\n\nNo single model comes near that on its own. Run each one through the same agentic loop with the same live tools, and the strongest of them lands seven points below the panel.\n\n| Solo model | TrustedRouter | OpenRouter |\n|---|---|---|\n| GPT-5.5 | 63.0 | 60.0 |\n| Claude Opus 4.8 | 60.7 | 58.8 |\n| DeepSeek V4 Pro | 59.9 | 60.3 |\n| Kimi K2.6 | 50.1 | 53.7 |\n| Gemini 3.1 Pro | 47.4 | 45.4 |\n| Gemini 3 Flash | 41.1 | 43.1 |\n\nThe strongest solo reaches 63; the panel reaches 70.6. Assembling a frontier answer out of models that are each behind the frontier is the entire point.\n\nDRACO is an agentic benchmark. The answers are not in any model's weights, so each model in the panel has to search the web, read the sources, and run the numbers itself; we give every one of them live tools and let it drive its own research. Those runs issued thousands of searches and fetches, and all of them sit in the published replays — none touching the benchmark's own hosts, so nothing was looked up that was meant to be worked out. The leakage guard lives in the open-source harness, and the audit is yours to re-run.\n\nWe ran all of it on TrustedRouter for the same reason we published the code. A benchmark sends your prompts and the documents you fetch through someone else's servers, and with most gateways you take their privacy on faith. TrustedRouter runs inside a Trusted Execution Environment (TEE), end-to-end encrypted: a sealed enclave the operator cannot read into, handling every request as an [attested](/blog/attestation-is-all-you-need) workload whose exact code is measured and published. You can pull the image digest, match it against the open source, and confirm the binary that saw your prompt is the one in the repository, with nowhere inside it to record anything. You check the privacy the way you check the score — by hand, against a hash.\n\nWe do not want you to trust our 70.6. Clone the [repository](https://github.com/Lore-Hex/TrustedRouter-Fusion-Draco) — the harness, the tasks, the judge, the panel, and the raw run traces are all in it — point it at TrustedRouter, and produce the number yourself. Open code, open results, a score you can reproduce and a privacy guarantee you can verify. That is what an AI lab doing open science looks like, and it is the only kind of result worth believing.", "url": "https://wpnews.pro/news/new-sota-trustedrouter-fusion-beats-fable-and-frontier", "canonical_source": "https://trustedrouter.com/blog/fusion-evals-open-source", "published_at": "2026-06-18 01:10:41+00:00", "updated_at": "2026-06-18 01:22:08.373745+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-research", "ai-products", "ai-infrastructure"], "entities": ["TrustedRouter", "OpenRouter", "Claude Opus 4.8", "GPT-5.5", "DeepSeek V4 Pro", "Kimi K2.6", "Gemini 3.1 Pro", "Gemini 3 Flash"], "alternates": {"html": "https://wpnews.pro/news/new-sota-trustedrouter-fusion-beats-fable-and-frontier", "markdown": "https://wpnews.pro/news/new-sota-trustedrouter-fusion-beats-fable-and-frontier.md", "text": "https://wpnews.pro/news/new-sota-trustedrouter-fusion-beats-fable-and-frontier.txt", "jsonld": "https://wpnews.pro/news/new-sota-trustedrouter-fusion-beats-fable-and-frontier.jsonld"}}