The swarm that designs itself Doubleword released an open-source agent swarm that lets an LLM orchestrator design its own team of bounded-context workers to parallelize wide, shardable tasks. In a security audit of a 2.4M-token codebase, the swarm outperformed a single long-context agent on cost and output, reducing projected token usage from 300M to a fraction of that. The approach, inspired by Moonshot's Kimi K2.5 report, challenges the prevailing trend of relying on ever-larger models for all problems. The swarm that designs itself Faced with a hard task, the instinct is to reach for more: a smarter model, a longer context window, one capable agent that can hold the whole problem in its head at once. The entire frontier is racing along that axis, chasing more intelligence and more context, and it has handed us a great tool. For deep, sequential problems, a single long-context agent is a superb one: the best hammer we’ve ever had. So we reach for it on everything. When it can’t crack a task, we rarely stop to ask whether a hammer was the right tool. We just wait for a bigger one, the next model with a longer window and a higher benchmark. But hand someone a hammer and everything starts to look like a nail. Some tasks were never nails. For one shape of work, there is another way. A lot of real work isn’t deep and sequential. It’s wide and shardable : audit every file in this repo, review every dependency, document every subsystem, check every source. Point a single long-context agent at that and it will get there, but the re-reading gets expensive fast. That question is why we built the Doubleword Agent Swarm https://github.com/doublewordai/swarm , our open-source reimplementation of the agent swarm Moonshot introduced in the Kimi K2.5 report https://arxiv.org/abs/2602.02276 : an LLM orchestrator designs its own team of bounded-context workers and fans them out in parallel over a task. We used it on a real codebase and compared the run, side by side, with a single long-context agent. For wide, shardable work, a swarm of bounded agents can beat one long-context agent on cost and output. The model designs the team; the runtime only has to spawn workers, isolate context, and gather results. What the hammer costs what-the-hammer-costs To make it concrete, we picked a job we needed done: a security audit of control-layer https://github.com/doublewordai/control-layer , Doubleword’s open-source AI gateway, 512 source files and about 2.4M tokens of unique source. The brief asked for candidate issues across injection, leaked secrets, broken auth, and unsafe file handling. We ran it both ways, one long-context agent and the swarm. We ran the single agent first: Claude Opus, a 1M-token window, no chunking, just “audit the repo”. It works, but the cost shows up in the loop. An agent loop re-sends the growing transcript with every turn, so by the time our metered run had covered ~7% of the repo, it had already burned 27.7M tokens, 95% of them cache reads, the same transcript shipped back again and again.These numbers come from an actual metered run. The agent repeatedly fills its window, hits the compaction ceiling, summarises, and grows again. With prompt caching, the projected full-repo bill is ~$300; without it, ~$1,800. Projected over the full repo, the audit lands around 300M tokens : a 2.4M-token codebase, amplified ×125. Reading the codebase once costs 2.4M tokens. The remaining ~297M is the agent re-reading what it has already seen. The alternative: a swarm the-alternative-a-swarm Don’t make one agent re-read everything. Split the repo across many bounded workers, each reading only its own slice, once, all at the same time. That’s an agent swarm. It’s how every company already works. The CEO is the most capable and most expensive person in the building, and the wrong one to personally trawl through every file. So they don’t. They hire specialists with tight remits, hand each a bounded task, and never see the mountain of material those specialists wade through. What comes back is a short, high-level summary. The raw work stays with the specialist. The same setup works here, just with agents: the orchestrator plays CEO, the workers are its specialists, and only their conclusions ever travel back up. The hard part is deciding who should do what. In February 2026, Moonshot published the Kimi K2.5 technical report. Kimi K2.5: Visual Agentic Intelligence https://arxiv.org/abs/2602.02276 , Kimi Team, February 2026. See also Moonshot's agent swarm post https://www.kimi.com/blog/agent-swarm . Its agent-swarm result is the framework we built on: scale out , not just up. A trainable orchestrator spawns specialised sub-agents and runs them in parallel, trained with PARL Parallel-Agent Reinforcement Learning , where only the orchestrator learns and the sub-agents stay frozen. The headline numbers: 4.5× lower latency than a single agent, and +17.8 points on BrowseComp.60.6 → 78.4 on BrowseComp, a deep-research benchmark, versus the single-agent baseline. What’s new is that the swarm designs itself : decomposition and team width are the model’s call, not a hand-written workflow. What’s in the weights is only the orchestration instinct : how to decompose, delegate, reconcile. The runtime that makes a swarm real spawn, isolate, parallelise, aggregate lives in Moonshot’s hosted product, not in the open weights. An open endpoint gives us what it has always given us: messages and tools in, tool calls and text out. The project fills that runtime gap. doublewordai/swarm https://github.com/doublewordai/swarm is our from-scratch interpretation of Moonshot’s swarm, built on the Open Responses API https://openresponses.org , and model-agnostic by default moonshotai/Kimi-K2.6 .Full credit to Moonshot for the pattern. The repo's README has a "Faithful to Kimi" https://github.com/doublewordai/swarm faithful-to-kimi section spelling out what we reproduced the self-designing orchestrator, context sharding, the critical-steps metric , what we deliberately dropped PARL training, the mutating toolbox , and what we added. The architecture the-architecture Everything we kept from the paper, and everything we added, fits into four principles: A self-designing orchestrator. The model decides the team and the decomposition, not a hand-written workflow. Bounded local context. Each worker sees only its slice, and returns only results. Independent verification. Every candidate finding is challenged before it counts. Synthesis. One final pass reconciles everything into a deliverable. Left to right: the repo is cloned and compressed into a budgeted map; the orchestrator designs a team and dispatches scoped tasks in parallel waves ; ephemeral workers investigate and return findings only; findings are deduped, challenged by verifiers, and reconciled by a synthesizer into report.md . Four blocks, one per principle. Block 1: the orchestrator designs the team block-1-the-orchestrator-designs-the-team The orchestrator gets the repo map up front and can probe with read file and grep before committing to a plan. Then it builds its team with two tools, create subagent name, system prompt and assign task agent, prompt , the literal tool surface Kimi K2.5/K2.6 were RL-trained on.K2.5 technical report, Appendix E.8: this is --interface kimi , which the repo ships as the default. We also ship --interface structured , where the orchestrator instead calls dispatch workers {role, focus, paths} and the harness preloads each worker's files, decomposing by directory rather than by task, which keeps the planning turn small on very large repos. It authors each specialist’s system prompt itself. In one real audit run it invented a persona we never asked for: the orchestrator wrote this prompt persona registered once create subagent name="injection-filesystem", system prompt="You hunt injection and unsafe file access…", every task spawns a fresh agent with that persona assign task "injection-filesystem", "Audit cli.py …" assign task "injection-filesystem", "Trace cost.py …" The division of labour matters: the model decides who does what ; the harness decides which tools each role may hold. Each dispatching turn is a wave: width is parallelism, follow-up waves fill gaps. Block 2: workers see only their slice block-2-workers-see-only-their-slice The paper’s key lever is context sharding . Each task spawns a fresh, throwaway agent. It gathers the context it needs read file , grep , plus whatever capability tools its role grants , works for a few rounds, calls submit results , and is gone. Only schema-valid results and a status line return to the orchestrator; the worker’s research is discarded, never re-sent. The same boundary keeps both context and cost under control: no single context overflows, and per-agent token usage stays low. That is why fanning out hundreds of workers stays cheap, and why the 300M-token bill from earlier never materialises. The v1 toolset is deliberately read-only, so it’s safe to point at any repo. Block 3: every finding meets a skeptic block-3-every-finding-meets-a-skeptic A swarm of enthusiastic hunters produces enthusiastic false positives. So before anything counts, each candidate finding is handed to an independent verifier whose only job is to refute it , and which defaults to “not real” when unsure. Survivors ship with adjusted severity; refuted findings are dropped and counted. This stage is our addition; the paper’s orchestrator reconciles inline. It’s optional and per-brief; --verify-votes N turns it into a majority-vote panel. Block 4: one pass writes the report block-4-one-pass-writes-the-report A single tool-free synthesis call reconciles the confirmed findings into the deliverable: report.md for humans, findings.json for machines. Its shape comes from the brief, not the engine, which brings us to the part we like most. Swap the brief, keep the engine swap-the-brief-keep-the-engine Nothing in the engine mentions auditing. The loop orchestrate, shard, verify, synthesize is byte-identical for every task. What a swarm does is a brief : ~50 lines of data binding prompts to roles, a result schema enforced at submit results : invalid items are dropped, not trusted , and a tool selection per role: python src/briefs/onboarding.py abridged from . import Brief, register register Brief name="onboarding", description="Document a codebase's subsystems for newcomers.", orchestrator prompt="You are the lead author … call dispatch workers once …", worker prompt="Document ONLY your assigned files: purpose, key components, deps …", synthesis prompt="Assemble an onboarding guide: overview, per-subsystem sections …", result schema={...}, result key="sections", worker tools= "read file", "grep" , verifier prompt=None, set a prompt to switch the adversarial verify stage on Two briefs ship in the box, audit and onboarding ; new briefs fit in an afternoon: a dependency review one worker per dependency: version drift, advisories, upgrade risk , a refactor plan workers map usage per module, the synthesizer sequences the steps , wide research one worker per source; verifiers refute unsupported claims . The receipts the-receipts The two runs, side by side: | Solo agent Claude Opus | Swarm Kimi K2.6 | | |---|---|---| | Tokens | ~300M · projected | 5.6M · measured | | Cost | ~$300 | ~$6.70 | | vs. the 2.4M read-once floor | ×125 | ×2.3 | The swarm’s run, measured: 348 API calls, 5.2M tokens in, ~450k out, about 53× fewer tokens and 45× cheaper at Doubleword’s Kimi K2.6 pricing $0.95/M input, $4/M output .Solo figures are projected from the metered partial run, with prompt caching priced in; swarm figures are measured. Every run also writes summary.json with tokens, cost, coverage, and the paper's critical-vs-total step counts: speedup = total / critical scores how well the orchestrator parallelised the run, the way the paper scores it. The verifier stage refuted and dropped roughly half of the candidate findings before any of them reached us, so what survived came with severity, file:line , and a suggested fix attached. Nobody’s waiting: the flex tier nobodys-waiting-the-flex-tier A swarm is a just-get-it-done workload https://blog.doubleword.ai/inference-when-no-one-is-waiting : hundreds of concurrent calls, and no human watching any single one. It’s throughput-bound: what matters is when the whole wave lands, not when each call returns. Doubleword’s flex tier is priced for this. Individual calls may run longer, but global throughput holds, so end-to-end wall-clock time stays roughly the same, at ~30% off.Tier discounts are per-model; ~30% is Doubleword's flex pricing for Kimi K2.6 at the time of the run. swarm compare