{"slug": "japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing", "title": "Japan’s Sakana Fugu Beats Opus 4.8 and GPT-5.5 by Conducting Them, Not Replacing Them", "summary": "Japan's Sakana AI released Fugu, a small orchestration model that coordinates frontier models from OpenAI, Google, and Anthropic, achieving benchmark scores higher than any individual model it directs, including Claude Opus 4.8 and GPT-5.5, on tests like SWE-Bench Pro and Terminal Bench 2.1.", "body_md": "*While everyone waits for the next giant model, a Tokyo lab took a different path. Sakana AI’s new system, Fugu, is not a frontier model at all. It is a conductor, a small model trained to coordinate the frontier models from OpenAI, Google, and Anthropic, route each task to the right one, and combine their answers. The surprising part is the result, the orchestrated team beats every individual model it directs, and stands shoulder to shoulder with the very best. Here is how it works, what the benchmarks actually show, and the honest caveats.*\n\nThe dominant move in AI for years has been to build a bigger model. More parameters, more data, more compute, a single larger brain that answers your questions directly. On June 22, 2026, the Tokyo lab Sakana AI shipped something that does not fit that mold at all, and the benchmark numbers it posted are worth understanding precisely because the thing producing them isn’t what you’d expect.\n\nSakana’s new system is called Fugu, and the first thing to get straight is what it is, because it’s easy to misread. Fugu isn’t a frontier model competing with the giants. It’s an orchestrator, a model whose entire job is to coordinate other models. You send it a request, and rather than answering directly, it decides which of the leading frontier models should handle which part of the task, routes the work to them, and synthesizes their outputs into a single answer. The models it conducts are the strongest ones available, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Fugu itself is the conductor, not the musician, and that distinction is the whole story.\n\nThe clearest way to understand Fugu is by contrast with a normal model.\n\nWhen you use a standard model like Opus 4.8 or GPT-5.5, you are talking to one system that was trained to answer you. One brain, one response. Fugu works differently. It’s a small language model trained for a single skill, coordination. Faced with your request, it dynamically assembles a plan, figures out which model in its pool is best suited to each piece of the problem, delegates accordingly, checks and combines what comes back, and returns one synthesized answer. The intelligence Fugu adds isn’t in answering, it’s in knowing which expert to ask and how to combine their work.\n\nThe analogy that fits is a project manager with a roster of top specialists on call. The manager doesn’t personally do every task. They route each part to the expert best suited for it, then assemble the results into one coherent deliverable. Fugu is that manager, and Opus, GPT-5.5, and Gemini are the specialists. When you use Fugu, you are quietly using all of them, coordinated.\n\nIf that pattern sounds familiar, it should. It’s the same idea behind the multi-model setups a growing number of developers are building by hand, one component plans and routes, others execute. Fugu is that concept built into a polished, trained product by a research lab, and then benchmarked against the giants it coordinates.\n\nHere is the part that turns a clever architecture into a real story. By coordinating those frontier models well, Fugu scores higher than any of those same models can score on their own.\n\nThat’s the claim worth sitting with. The orchestrated team beats each of its individual members. Sakana evaluated Fugu against the exact models in its pool, with the same maximum reasoning effort, and the top variant came out ahead on most of the hard benchmarks. The numbers below are Sakana’s own reported figures, and they tell the story clearly.\n\nOn SWE-Bench Pro, a demanding test of fixing real software bugs in production repositories, Fugu-Ultra scores 73.7, ahead of Opus 4.8 at 69.2, GPT-5.5 at 58.6, and Gemini 3.1 Pro at 54.2. On Terminal Bench 2.1, a test of agentic coding in a real terminal environment, Fugu-Ultra reaches 82.1, ahead of GPT-5.5 at 78.2, Opus 4.8 at 74.6, and Gemini at 70.3. On LiveCodeBench it posts 93.2, on GPQA-Diamond, a set of graduate-level science questions, it hits 95.5, and on a string of other reasoning and coding benchmarks it leads the pool. Across the published table, the top Fugu variant leads on the large majority of the benchmarks over every model it directs. The team, well conducted, beats the soloists.\n\nHere is Sakana’s published comparison, so you can see the whole picture, including where the individual models still win.\n\nAll figures are Sakana-reported. Read the table and you can see the pattern and its limits at once, the top Fugu variant leads most rows, but GPT-5.5 takes MRCRv2, Opus 4.8 edges CTI-REALM, and on a few benchmarks standard Fugu even beats Ultra.\n\nFugu comes in two flavors, and the difference is how hard each works.\n\nThe standard Fugu is built for speed and lower cost. It coordinates the pool efficiently, calling fewer models per task for a faster, cheaper answer, and it uses dynamic pricing based on which models it actually activates. It’s the version for everyday, latency-sensitive work, the manager who quickly hands a task to the one right expert and gets you a fast result.\n\nFugu-Ultra is built for the hardest problems, where you want the best possible answer regardless of speed. It uses a larger pool of models per query and has them do more cross-checking and synthesis, squeezing out maximum quality at the cost of higher latency and fixed pricing. It’s the manager who convenes the whole expert panel on a tough problem and assembles the strongest combined result.\n\nThat difference shows up directly in the scores, Fugu-Ultra generally outscores standard Fugu on the hardest tests, 73.7 versus 59.0 on SWE-Bench Pro for instance, because the extra coordination pays off. Interestingly, it’s not universal, on a few benchmarks standard Fugu edges out Ultra, which is a useful reminder that more coordination is not automatically better, sometimes the lighter approach happens to route a given task more effectively.\n\nThis is a genuinely impressive result, but a careful reading requires three honest qualifications, and leaving them out would be selling the story rather than telling it.\n\nFirst, these are vendor-reported numbers. Sakana ran the evaluations and published the table, and as of now no independent third-party lab has reproduced them. That doesn’t make them wrong, but it makes them claims to validate rather than settled fact, and the right response to any vendor benchmark is to test it on your own workload before trusting the headline.\n\nSecond, an orchestrator’s score answers a slightly different question than a single model’s. When Fugu-Ultra posts a high number, that reflects the whole system, the routing, the delegation, the synthesis, and the raw power of the pool underneath. A strong score can come from excellent coordination as much as from any single model’s ability, which is exactly what Fugu is built to do, but it means you are measuring a system, not a model, and the comparison should be read in that light.\n\nThird, the wins are not a clean sweep, and the most important detail is one Sakana itself is careful about. Sakana doesn’t claim Fugu beats Anthropic’s top model, Fable 5. Its own framing is that Fugu stands shoulder to shoulder with it, and on the hardest coding benchmark, SWE-Bench Pro, Fable 5 still leads clearly, scoring well above Fugu-Ultra on that test. There are other places the soloists win too, GPT-5.5 leads the long-context recall test, and Opus 4.8 edges ahead on a cybersecurity benchmark. An orchestrator is ultimately bounded by the models it has to work with, and where the best single model is exceptional, coordination doesn’t always close the gap. The accurate summary isn’t that Fugu beats everything, it’s that it reaches the frontier by a route nobody else took.\n\nStep back and the significance is not really about one lab’s leaderboard. It’s about the bet underneath it. The entire industry has poured its resources into making single models bigger. Sakana built a system that, without training a frontier model of its own, reaches frontier-level results by coordinating the models that already exist, and on a number of hard tasks beats each of them individually. That’s a meaningful piece of evidence for an idea that has been gaining ground, that the next gains in AI may come as much from orchestration as from scale.\n\nIt also lands at a practical moment. The same coordinate-multiple-models pattern is something developers are increasingly building themselves, and Fugu shows what it looks like when a research lab takes that pattern seriously, trains a real model to do the conducting, and pushes it to the frontier. Whether orchestration ultimately rivals raw scale or merely complements it is still an open argument, and the vendor-reported caveat means the numbers deserve scrutiny rather than applause. But the approach is real, the results are striking, and the larger point is hard to ignore. Japan’s most interesting answer to the frontier labs wasn’t to build a bigger brain. It was to build something that makes the existing brains work together, and to show that, conducted well, the team can beat the soloists.\n\n*If you work with multi-model systems or have a view on whether orchestration genuinely reaches the frontier or just borrows it from the models underneath, I would be interested to hear it in the comments. And if you test Fugu against your own workload, the real-world results are worth far more than any vendor benchmark table.*\n\n[Japan’s Sakana Fugu Beats Opus 4.8 and GPT-5.5 by Conducting Them, Not Replacing Them](https://pub.towardsai.net/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing-them-c04834ef73d8) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing", "canonical_source": "https://pub.towardsai.net/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing-them-c04834ef73d8?source=rss----98111c9905da---4", "published_at": "2026-06-26 19:01:01+00:00", "updated_at": "2026-06-26 19:40:26.080452+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "ai-research", "ai-products"], "entities": ["Sakana AI", "Fugu", "OpenAI", "Google", "Anthropic", "Claude Opus 4.8", "GPT-5.5", "Gemini 3.1 Pro"], "alternates": {"html": "https://wpnews.pro/news/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing", "markdown": "https://wpnews.pro/news/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing.md", "text": "https://wpnews.pro/news/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing.txt", "jsonld": "https://wpnews.pro/news/japans-sakana-fugu-beats-opus-4-8-and-gpt-5-5-by-conducting-them-not-replacing.jsonld"}}