{"slug": "the-swarm-that-designs-itself", "title": "The swarm that designs itself", "summary": "Doubleword released an open-source agent swarm that lets an LLM orchestrator design its own team of bounded-context workers to parallelize wide, shardable tasks. In a security audit of a 2.4M-token codebase, the swarm outperformed a single long-context agent on cost and output, reducing projected token usage from 300M to a fraction of that. The approach, inspired by Moonshot's Kimi K2.5 report, challenges the prevailing trend of relying on ever-larger models for all problems.", "body_md": "# The swarm that designs itself\n\nFaced with a hard task, the instinct is to reach for more: a smarter model, a longer context window, one capable agent that can hold the whole problem in its head at once. The entire frontier is racing along that axis, chasing more intelligence and more context, and it has handed us a great tool. For deep, sequential problems, a single long-context agent is a superb one: the best hammer we’ve ever had.\n\nSo we reach for it on everything. When it can’t crack a task, we rarely stop to ask whether a hammer was the right tool. We just wait for a bigger one, the next model with a longer window and a higher benchmark. But hand someone a hammer and everything starts to look like a nail. Some tasks were never nails.\n\nFor one shape of work, there is another way. A lot of real work isn’t deep and sequential. It’s *wide and shardable*: audit every file in this repo, review every dependency, document every subsystem, check every source. Point a single long-context agent at that and it will get there, but the re-reading gets expensive fast.\n\nThat question is why we built the [Doubleword Agent Swarm](https://github.com/doublewordai/swarm), our open-source reimplementation of the agent swarm Moonshot introduced in the [Kimi K2.5 report](https://arxiv.org/abs/2602.02276): an LLM orchestrator **designs its own team** of bounded-context workers and fans them out in parallel over a task. We used it on a real codebase and compared the run, side by side, with a single long-context agent.\n\nFor wide, shardable work, a swarm of bounded agents can beat one long-context agent on cost and output. The model designs the team; the runtime only has to spawn workers, isolate context, and gather results.\n\n[What the hammer costs](#what-the-hammer-costs)\n\nTo make it concrete, we picked a job we needed done: a security audit of [control-layer](https://github.com/doublewordai/control-layer), Doubleword’s open-source AI gateway, 512 source files and about 2.4M tokens of unique source. The brief asked for candidate issues across injection, leaked secrets, broken auth, and unsafe file handling. We ran it both ways, one long-context agent and the swarm.\n\nWe ran the single agent first: Claude Opus, a 1M-token window, no chunking, just “audit the repo”. It works, but the cost shows up in the loop. An agent loop re-sends the growing transcript with every turn, so by the time our metered run had covered ~7% of the repo, it had already burned 27.7M tokens, 95% of them cache reads, the same transcript shipped back again and again.These numbers come from an actual metered run. The agent repeatedly fills its window, hits the compaction ceiling, summarises, and grows again. With prompt caching, the projected full-repo bill is ~$300; without it, ~$1,800. Projected over the full repo, the audit lands around **300M tokens**: a 2.4M-token codebase, amplified ×125.\n\nReading the codebase once costs 2.4M tokens. The remaining ~297M is the agent re-reading what it has already seen.\n\n[The alternative: a swarm](#the-alternative-a-swarm)\n\nDon’t make one agent re-read everything. Split the repo across many bounded workers, each reading only its own slice, once, all at the same time. That’s an agent swarm.\n\nIt’s how every company already works. The CEO is the most capable (and most expensive) person in the building, and the wrong one to personally trawl through every file. So they don’t. They hire specialists with tight remits, hand each a bounded task, and never see the mountain of material those specialists wade through. What comes back is a short, high-level summary. The raw work stays with the specialist. The same setup works here, just with agents: the orchestrator plays CEO, the workers are its specialists, and only their conclusions ever travel back up.\n\nThe hard part is deciding who should do what.\n\nIn February 2026, Moonshot published the Kimi K2.5 technical report.[Kimi K2.5: Visual Agentic Intelligence](https://arxiv.org/abs/2602.02276), Kimi Team, February 2026. See also Moonshot's [agent swarm post](https://www.kimi.com/blog/agent-swarm). Its agent-swarm result is the framework we built on: scale *out*, not just up. A trainable orchestrator spawns specialised sub-agents and runs them in parallel, trained with PARL (Parallel-Agent Reinforcement Learning), where only the orchestrator learns and the sub-agents stay frozen. The headline numbers: 4.5× lower latency than a single agent, and +17.8 points on BrowseComp.60.6 → 78.4 on BrowseComp, a deep-research benchmark, versus the single-agent baseline. What’s new is that **the swarm designs itself**: decomposition and team width are the model’s call, not a hand-written workflow.\n\nWhat’s in the weights is only the orchestration *instinct*: how to decompose, delegate, reconcile. The runtime that makes a swarm real (spawn, isolate, parallelise, aggregate) lives in Moonshot’s hosted product, not in the open weights. An open endpoint gives us what it has always given us: messages and tools in, tool calls and text out.\n\nThe project fills that runtime gap. [doublewordai/swarm](https://github.com/doublewordai/swarm) is our from-scratch interpretation of Moonshot’s swarm, built on the [Open Responses API](https://openresponses.org), and model-agnostic by default (`moonshotai/Kimi-K2.6`\n\n).Full credit to Moonshot for the pattern. The repo's README has a [\"Faithful to Kimi\"](https://github.com/doublewordai/swarm#faithful-to-kimi) section spelling out what we reproduced (the self-designing orchestrator, context sharding, the critical-steps metric), what we deliberately dropped (PARL training, the mutating toolbox), and what we added.\n\n[The architecture](#the-architecture)\n\nEverything we kept from the paper, and everything we added, fits into four principles:\n\n**A self-designing orchestrator.** The model decides the team and the decomposition, not a hand-written workflow.**Bounded local context.** Each worker sees only its slice, and returns only results.**Independent verification.** Every candidate finding is challenged before it counts.**Synthesis.** One final pass reconciles everything into a deliverable.\n\nLeft to right: the repo is cloned and compressed into a budgeted map; the orchestrator designs a team and dispatches scoped tasks in parallel *waves*; ephemeral workers investigate and return findings only; findings are deduped, challenged by verifiers, and reconciled by a synthesizer into `report.md`\n\n. Four blocks, one per principle.\n\n[Block 1: the orchestrator designs the team](#block-1-the-orchestrator-designs-the-team)\n\nThe orchestrator gets the repo map up front and can probe with `read_file`\n\nand `grep`\n\nbefore committing to a plan. Then it builds its team with two tools, `create_subagent(name, system_prompt)`\n\nand `assign_task(agent, prompt)`\n\n, the literal tool surface Kimi K2.5/K2.6 were RL-trained on.K2.5 technical report, Appendix E.8: this is `--interface kimi`\n\n, which the repo ships as the default. We also ship `--interface structured`\n\n, where the orchestrator instead calls `dispatch_workers([{role, focus, paths}])`\n\nand the harness preloads each worker's files, decomposing by directory rather than by task, which keeps the planning turn small on very large repos. It authors each specialist’s system prompt itself. In one real audit run it invented a persona we never asked for:\n\n```\n# the orchestrator wrote this prompt (persona registered once)\ncreate_subagent(\n    name=\"injection-filesystem\",\n    system_prompt=\"You hunt injection and unsafe file access…\",\n)\n\n# every task spawns a fresh agent with that persona\nassign_task(\"injection-filesystem\", \"Audit cli.py …\")\nassign_task(\"injection-filesystem\", \"Trace cost.py …\")\n```\n\nThe division of labour matters: the model decides *who* does *what*; the harness decides which tools each role may hold. Each dispatching turn is a wave: width is parallelism, follow-up waves fill gaps.\n\n[Block 2: workers see only their slice](#block-2-workers-see-only-their-slice)\n\nThe paper’s key lever is **context sharding**. Each task spawns a fresh, throwaway agent. It gathers the context it needs (`read_file`\n\n, `grep`\n\n, plus whatever capability tools its role grants), works for a few rounds, calls `submit_results`\n\n, and is gone. Only schema-valid results and a status line return to the orchestrator; the worker’s research is discarded, never re-sent.\n\nThe same boundary keeps both context and cost under control: no single context overflows, and per-agent token usage stays low. That is why fanning out hundreds of workers stays cheap, and why the 300M-token bill from earlier never materialises. (The v1 toolset is deliberately read-only, so it’s safe to point at any repo.)\n\n[Block 3: every finding meets a skeptic](#block-3-every-finding-meets-a-skeptic)\n\nA swarm of enthusiastic hunters produces enthusiastic false positives. So before anything counts, each candidate finding is handed to an independent verifier whose only job is to **refute it**, and which defaults to “not real” when unsure. Survivors ship with adjusted severity; refuted findings are dropped and counted.\n\nThis stage is our addition; the paper’s orchestrator reconciles inline. It’s optional and per-brief; `--verify-votes N`\n\nturns it into a majority-vote panel.\n\n[Block 4: one pass writes the report](#block-4-one-pass-writes-the-report)\n\nA single tool-free synthesis call reconciles the confirmed findings into the deliverable: `report.md`\n\nfor humans, `findings.json`\n\nfor machines. Its shape comes from the brief, not the engine, which brings us to the part we like most.\n\n[Swap the brief, keep the engine](#swap-the-brief-keep-the-engine)\n\nNothing in the engine mentions auditing. The loop (orchestrate, shard, verify, synthesize) is byte-identical for every task. What a swarm *does* is a **brief**: ~50 lines of data binding prompts to roles, a result schema (enforced at `submit_results`\n\n: invalid items are dropped, not trusted), and a tool selection per role:\n\n``` python\n# src/briefs/onboarding.py (abridged)\nfrom . import Brief, register\n\nregister(Brief(\n    name=\"onboarding\",\n    description=\"Document a codebase's subsystems for newcomers.\",\n    orchestrator_prompt=\"You are the lead author … call dispatch_workers once …\",\n    worker_prompt=\"Document ONLY your assigned files: purpose, key components, deps …\",\n    synthesis_prompt=\"Assemble an onboarding guide: overview, per-subsystem sections …\",\n    result_schema={...},\n    result_key=\"sections\",\n    worker_tools=(\"read_file\", \"grep\"),\n    verifier_prompt=None,   # set a prompt to switch the adversarial verify stage on\n))\n```\n\nTwo briefs ship in the box, `audit`\n\nand `onboarding`\n\n; new briefs fit in an afternoon: a dependency review (one worker per dependency: version drift, advisories, upgrade risk), a refactor plan (workers map usage per module, the synthesizer sequences the steps), wide research (one worker per source; verifiers refute unsupported claims).\n\n[The receipts](#the-receipts)\n\nThe two runs, side by side:\n\n| Solo agent (Claude Opus) | Swarm (Kimi K2.6) | |\n|---|---|---|\n| Tokens | ~300M · projected | 5.6M · measured |\n| Cost | ~$300 | ~$6.70 |\n| vs. the 2.4M read-once floor | ×125 | ×2.3 |\n\nThe swarm’s run, measured: 348 API calls, 5.2M tokens in, ~450k out, about **53× fewer tokens and 45× cheaper** at Doubleword’s Kimi K2.6 pricing ($0.95/M input, $4/M output).Solo figures are projected from the metered partial run, with prompt caching priced in; swarm figures are measured. Every run also writes `summary.json`\n\nwith tokens, cost, coverage, and the paper's critical-vs-total step counts: `speedup = total / critical`\n\nscores how well the orchestrator parallelised the run, the way the paper scores it. The verifier stage refuted and dropped roughly half of the candidate findings before any of them reached us, so what survived came with severity, `file:line`\n\n, and a suggested fix attached.\n\n[Nobody’s waiting: the flex tier](#nobodys-waiting-the-flex-tier)\n\nA swarm is a [just-get-it-done workload](https://blog.doubleword.ai/inference-when-no-one-is-waiting): hundreds of concurrent calls, and no human watching any single one. It’s throughput-bound: what matters is when the whole wave lands, not when each call returns.\n\nDoubleword’s flex tier is priced for this. Individual calls may run longer, but global throughput holds, so end-to-end wall-clock time stays roughly the same, at ~30% off.Tier discounts are per-model; ~30% is Doubleword's flex pricing for Kimi K2.6 at the time of the run. `swarm compare <brief> --repo …`\n\nruns the identical workload on both tiers and writes the wall-clock / token / cost table, so you can measure the trade on your own job.\n\n```\nservice_tier = \"flex\"   # was \"priority\"\n```\n\n[Run it on your repo](#run-it-on-your-repo)\n\nThe quickest path is the [dw CLI](https://docs.doubleword.ai), which sets up auth, endpoint, and model in one step:\n\n```\ndw login\ndw examples clone swarm\ncd swarm\ndw project setup\n```\n\nThen point a brief at a repo:\n\n```\ndw project run audit      -- --repo psf/requests --max-files 20         # audit a GitHub repo\ndw project run onboarding -- --path ./my-service                         # document a local directory\ndw project run audit      -- --repo psf/requests --service-tier flex    # run on flex tier for 30% cost saving!\ndw project run report                                                   # print the latest run's report\n```\n\nEach run writes `results/<brief>-<slug>/`\n\n: the synthesized `report.md`\n\n, structured `findings.json`\n\n, `swarm-tree.json`\n\n(the team the orchestrator designed, worth reading at least once), and `summary.json`\n\nwith tokens, cost, and step counts.\n\nEverything is open: the harness is [on GitHub](https://github.com/doublewordai/swarm), it speaks the [Open Responses API](https://openresponses.org), and it’s model-agnostic: use `-m/--model`\n\n(plus `--provider`\n\nwhen needed) to pick any tool-calling model.\n\n[Doubleword](https://doubleword.ai) is built for this kind of high-throughput inference, and we’d love to see what you fan out. Clone the swarm, write a brief, and go run parallel agents.\n\n```\n@misc{doubleword-the-swarm-that-designs-itself,\n  title        = {The swarm that designs itself},\n  author       = {Peter Bhabra},\n  year         = {2026},\n  howpublished = {Doubleword Blog},\n  url          = {https://blog.doubleword.ai/the-swarm-that-designs-itself},\n}\n```\n\n", "url": "https://wpnews.pro/news/the-swarm-that-designs-itself", "canonical_source": "https://blog.doubleword.ai/the-swarm-that-designs-itself", "published_at": "2026-06-30 00:00:00+00:00", "updated_at": "2026-06-30 15:31:39.151081+00:00", "lang": "en", "topics": ["ai-agents", "ai-research", "ai-tools", "large-language-models", "ai-infrastructure"], "entities": ["Doubleword", "Claude Opus", "Moonshot", "Kimi K2.5", "Doubleword Agent Swarm", "control-layer"], "alternates": {"html": "https://wpnews.pro/news/the-swarm-that-designs-itself", "markdown": "https://wpnews.pro/news/the-swarm-that-designs-itself.md", "text": "https://wpnews.pro/news/the-swarm-that-designs-itself.txt", "jsonld": "https://wpnews.pro/news/the-swarm-that-designs-itself.jsonld"}}