{"slug": "surpassing-frontier-performance-with-fusion", "title": "Surpassing Frontier Performance with Fusion", "summary": "OpenRouter launched Fusion, a tool that synthesizes outputs from multiple AI models into a single response, achieving scores of 69.0% on the DRACO deep research benchmark—surpassing individual frontier models like Fable 5 (65.3%). Panels of budget models also outperformed top-tier models, demonstrating that model diversity boosts performance at lower cost.", "body_md": "# Surpassing Frontier Performance with Fusion\n\nBrian Thomas ·\n\n## On this page\n\n-\n[Panels of Models Consistently Outperform on Deep Research](#panels-of-models-consistently-outperform-on-deep-research) -\n[One API call that fuses the best output of multiple models](#one-api-call-that-fuses-the-best-output-of-multiple-models) -\n[We chose DRACO to test reasoning, tool calling, and succinctness](#we-chose-draco-to-test-reasoning-tool-calling-and-succinctness) -\n[Preventing the Models from Cheating](#preventing-the-models-from-cheating) -\n[Significant boost from fusing a model with itself](#significant-boost-from-fusing-a-model-with-itself) -\n[Notes on our DRACO implementation](#notes-on-our-draco-implementation) -\n[Give Fusion a try](#give-fusion-a-try)\n\nWe’ve found that synthesizing the results of multiple models can significantly outperform what individual models are capable of. Introducing Fusion: a tool for getting these combined results just as easily as calling a single model. It allows you to choose a panel of participant models alongside a judge model responsible for fusing the individual results together.\n\nTo understand the benefits of Fusion, we used a deep research benchmark that tests the combination of reasoning, tool usage, and knowledge. We found that:\n\n- Panels consistently outperform individual models\n- Beyond-frontier performance can be achieved with frontier panels\n- Panels of budget models can surpass frontier models and get close to frontier panel performance\n\n[Try Fusion now](https://openrouter.ai/fusion) in a chatroom, or check out the [API docs](https://openrouter.ai/docs/guides/features/server-tools/fusion) to build it into your application.\n\n## Panels of Models Consistently Outperform on Deep Research\n\nWe tested Fusion on 100 deep research tasks from the [DRACO benchmark](https://arxiv.org/abs/2602.11685). Some highlights of what we found:\n\n- Fable 5 + GPT-5.5 fused together scored 69.0%**, surpassing every individual model, including Fable 5 alone at 65.3%**.\n- A budget panel (Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro) beat GPT-5.5 and Opus 4.8. It came within 1% of Fable 5’s score while being 50% of the cost.\n\n| Type | Model(s) | Score |\n|---|---|---|\n| Fusion | Fable 5 + GPT-5.5**synthesized by Opus 4.8 | 69.0% |\n| Fusion | Opus 4.8 + GPT-5.5 + Gemini 3.1 Prosynthesized by Opus 4.8 | 68.3% |\n| Fusion | Opus 4.8 + GPT-5.5synthesized by Opus 4.8 | 67.6% |\n| Fusion | Opus 4.8 + Opus 4.8synthesized by Opus 4.8 | 65.5% |\n| Solo | Claude Fable 5** | 65.3% |\n| Fusion | Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Prosynthesized by Opus 4.8 | 64.7% |\n| Solo | DeepSeek V4 Pro | 60.3% |\n| Solo | GPT-5.5 | 60.0% |\n| Solo | Claude Opus 4.8 | 58.8% |\n| Solo | Kimi K2.6 | 53.7% |\n| Solo | Gemini 3.1 Pro | 45.4% |\n| Solo | Gemini 3 Flash | 43.1% |\n\n** 7 of the 100 DRACO tasks were not completed because Fable 5’s content filters blocked them from executing. We chose not to fall back to Opus 4.8 for those tasks, so the Fable results reflect 93 scored tasks rather than the full 100. This gives the most accurate picture of Fable’s own performance, but means direct score comparisons against models that completed all 100 tasks are slightly uneven.\n\nWe believe this demonstrates the benefits of model diversity, similar to the benefits seen on human team performance. Bringing multiple different perspectives to complex problems yields superior results.\n\n## One API call that fuses the best output of multiple models\n\nWhen you send a prompt to Fusion, we dispatch it to a panel of models in parallel, each with web search and web fetch enabled. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots. The calling model then writes the final answer grounded in that analysis.\n\nThe whole pipeline runs server-side so it can be called just like you would an individual model.\n\nCall Fusion directly with a single model slug:\n\n```\n{\n  \"model\": \"openrouter/fusion\",\n  \"messages\": [\n    { \"role\": \"user\", \"content\": \"What are the strongest arguments for and against carbon taxes?\" }\n  ]\n}\n```\n\nOr customize the panel:\n\n```\n{\n  \"model\": \"openrouter/fusion\",\n  \"messages\": [{ \"role\": \"user\", \"content\": \"...\" }],\n  \"plugins\": [{\n    \"id\": \"fusion\",\n    \"model\": \"google/gemini-3-flash-preview\",\n    \"analysis_models\": [\n      \"google/gemini-3-flash-preview\",\n      \"moonshotai/kimi-k2.6\",\n      \"deepseek/deepseek-v4-pro\"\n    ]\n  }]\n}\n```\n\n## We chose DRACO to test reasoning, tool calling, and succinctness\n\nWe needed a benchmark that could tell the difference between a model that sounds thorough and one that actually is. Standard benchmarks test factual recall or reasoning puzzles. They don’t test the thing Fusion is built for: researching a complex question, synthesizing multiple sources, and producing a comprehensive, well-cited analysis.\n\n[DRACO](https://arxiv.org/abs/2602.11685) (by [Perplexity AI](https://www.perplexity.ai/)) is designed for this. It contains 100 deep research tasks spanning 10 domains: academic research, finance, law, medicine, technology, UX design, general knowledge, needle-in-a-haystack retrieval, personalized assistance, and product comparison.\n\nEach task comes with a rubric of roughly 39 weighted criteria across four categories:\n\n**Factual Accuracy**(~20 criteria): verifiable claims the response must get right** Breadth & Depth**(~9 criteria): synthesis quality, trade-off analysis, actionable guidance** Presentation Quality**(~6 criteria): terminology, formatting, readability** Citation Quality**(~5 criteria): primary source citations with working references\n\nCriteria can carry negative weights. Meeting a negative criterion means the response contains an error. For example, dangerous medical advice carries a big penalty. These negative criteria also make it hard to game the score by being verbose: a model that confidently states wrong things gets punished.\n\nEach response is graded per-criterion by a judge model, three independent times. We reported the mean normalized score (0-100) across all tasks.\n\nDRACO has limitations the authors acknowledge: it evaluates text-only, English-only interactions, and its static task set may not fully generalize to future deep research applications. Absolute scores also depend on judge model choice (the paper reports 10–25 point shifts between judges), though relative system rankings remain stable.\n\n## Preventing the Models from Cheating\n\nWhen we gave the panel models web search, we discovered something alarming: they were finding the DRACO grading rubric online. While this was coincidental from search terms rather than intentional cheating, it still exposed a real contamination risk.\n\nWe solved this by excluding the locations where the results are hosted from web search and web fetch, preventing models from accessing pages related to the benchmark rubric. OpenRouter’s [server tools](https://openrouter.ai/docs/guides/features/server-tools) support these exclude lists universally across all models by using a third party provider like Exa or Parallel, so applying them was a one-line config change rather than per-model patching. All results in this post were produced after the exclusion lists were in place.\n\nIf you are running your own evals, the same mechanism is available: pass `excluded_domains`\n\nto web_search or `blocked_domains`\n\nto web_fetch in your [tool definitions](https://openrouter.ai/docs/guides/features/server-tools/web-search) to prevent the panel from accessing specific sources.\n\n## Significant boost from fusing a model with itself\n\nWe ran Opus 4.8 partnered with itself as a two-model panel, with Opus 4.8 also serving as the synthesizer. The result: 65.5%, a 6.7-point jump over solo Opus 4.8 (58.8%). This suggests that a meaningful chunk of Fusion’s lift comes from the synthesis step itself, not just from combining different model architectures. Running the same prompt twice produces different reasoning paths, different tool calls, different source selections. It’s not enough to outperform a diverse set of models, but helps us understand the impact of the synthesis itself.\n\n## Notes on our DRACO implementation\n\nWe carefully replicated the methodology described in the DRACO paper with the exception of using Gemini 3.1 Pro Preview as judge instead of the paper’s choice of Gemini 3 Pro. This means our scores are not directly comparable to the original paper’s published results.\n\nWe wanted to preserve the high human–LLM alignment properties that led to the authors’ selection, while capturing the discernment of the newer model. We sanity-checked our judging with Claude Sonnet 4.6 after Gemini 3.1 Pro Preview scored low on the benchmark itself, finding that it preserved the qualities that led to the authors’ selection as judge. Our goal was to show relative differences between Fusion and individual models.\n\n## Give Fusion a try\n\n**API**: Send `\"model\": \"openrouter/fusion\"`\n\nto directly call Fusion, or add `{\"type\": \"openrouter:fusion\"}`\n\nto your tools array to let the model decide when to use it. [Fusion docs](https://openrouter.ai/docs/guides/features/server-tools/fusion)\n\n**Chatroom**: Open [openrouter.ai/fusion](https://openrouter.ai/fusion) and pick a preset or build a custom panel.", "url": "https://wpnews.pro/news/surpassing-frontier-performance-with-fusion", "canonical_source": "https://openrouter.ai/blog/announcements/fusion-beats-frontier/", "published_at": "2026-06-14 08:44:18+00:00", "updated_at": "2026-06-14 09:00:47.991431+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-products", "ai-tools", "ai-infrastructure"], "entities": ["OpenRouter", "Fable 5", "GPT-5.5", "Opus 4.8", "Gemini 3 Flash", "Kimi K2.6", "DeepSeek V4 Pro", "DRACO"], "alternates": {"html": "https://wpnews.pro/news/surpassing-frontier-performance-with-fusion", "markdown": "https://wpnews.pro/news/surpassing-frontier-performance-with-fusion.md", "text": "https://wpnews.pro/news/surpassing-frontier-performance-with-fusion.txt", "jsonld": "https://wpnews.pro/news/surpassing-frontier-performance-with-fusion.jsonld"}}