Faced with a hard task, the instinct is to reach for more: a smarter model, a longer context window, one capable agent that can hold the whole problem in its head at once. The entire frontier is racing along that axis, chasing more intelligence and more context, and it has handed us a great tool. For deep, sequential problems, a single long-context agent is a superb one: the best hammer we’ve ever had.
So we reach for it on everything. When it can’t crack a task, we rarely stop to ask whether a hammer was the right tool. We just wait for a bigger one, the next model with a longer window and a higher benchmark. But hand someone a hammer and everything starts to look like a nail. Some tasks were never nails.
For one shape of work, there is another way. A lot of real work isn’t deep and sequential. It’s wide and shardable: audit every file in this repo, review every dependency, document every subsystem, check every source. Point a single long-context agent at that and it will get there, but the re-reading gets expensive fast.
That question is why we built the Doubleword Agent Swarm, our open-source reimplementation of the agent swarm Moonshot introduced in the Kimi K2.5 report: an LLM orchestrator designs its own team of bounded-context workers and fans them out in parallel over a task. We used it on a real codebase and compared the run, side by side, with a single long-context agent.
For wide, shardable work, a swarm of bounded agents can beat one long-context agent on cost and output. The model designs the team; the runtime only has to spawn workers, isolate context, and gather results.
To make it concrete, we picked a job we needed done: a security audit of control-layer, Doubleword’s open-source AI gateway, 512 source files and about 2.4M tokens of unique source. The brief asked for candidate issues across injection, leaked secrets, broken auth, and unsafe file handling. We ran it both ways, one long-context agent and the swarm.
We ran the single agent first: Claude Opus, a 1M-token window, no chunking, just “audit the repo”. It works, but the cost shows up in the loop. An agent loop re-sends the growing transcript with every turn, so by the time our metered run had covered ~7% of the repo, it had already burned 27.7M tokens, 95% of them cache reads, the same transcript shipped back again and again.These numbers come from an actual metered run. The agent repeatedly fills its window, hits the compaction ceiling, summarises, and grows again. With prompt caching, the projected full-repo bill is ~$300; without it, ~$1,800. Projected over the full repo, the audit lands around 300M tokens: a 2.4M-token codebase, amplified ×125.
Reading the codebase once costs 2.4M tokens. The remaining ~297M is the agent re-reading what it has already seen.
Don’t make one agent re-read everything. Split the repo across many bounded workers, each reading only its own slice, once, all at the same time. That’s an agent swarm.
It’s how every company already works. The CEO is the most capable (and most expensive) person in the building, and the wrong one to personally trawl through every file. So they don’t. They hire specialists with tight remits, hand each a bounded task, and never see the mountain of material those specialists wade through. What comes back is a short, high-level summary. The raw work stays with the specialist. The same setup works here, just with agents: the orchestrator plays CEO, the workers are its specialists, and only their conclusions ever travel back up.
The hard part is deciding who should do what.
In February 2026, Moonshot published the Kimi K2.5 technical report.Kimi K2.5: Visual Agentic Intelligence, Kimi Team, February 2026. See also Moonshot's agent swarm post. Its agent-swarm result is the framework we built on: scale out, not just up. A trainable orchestrator spawns specialised sub-agents and runs them in parallel, trained with PARL (Parallel-Agent Reinforcement Learning), where only the orchestrator learns and the sub-agents stay frozen. The headline numbers: 4.5× lower latency than a single agent, and +17.8 points on BrowseComp.60.6 → 78.4 on BrowseComp, a deep-research benchmark, versus the single-agent baseline. What’s new is that the swarm designs itself: decomposition and team width are the model’s call, not a hand-written workflow.
What’s in the weights is only the orchestration instinct: how to decompose, delegate, reconcile. The runtime that makes a swarm real (spawn, isolate, parallelise, aggregate) lives in Moonshot’s hosted product, not in the open weights. An open endpoint gives us what it has always given us: messages and tools in, tool calls and text out.
The project fills that runtime gap. doublewordai/swarm is our from-scratch interpretation of Moonshot’s swarm, built on the Open Responses API, and model-agnostic by default (moonshotai/Kimi-K2.6
).Full credit to Moonshot for the pattern. The repo's README has a "Faithful to Kimi" section spelling out what we reproduced (the self-designing orchestrator, context sharding, the critical-steps metric), what we deliberately dropped (PARL training, the mutating toolbox), and what we added.
Everything we kept from the paper, and everything we added, fits into four principles:
A self-designing orchestrator. The model decides the team and the decomposition, not a hand-written workflow.Bounded local context. Each worker sees only its slice, and returns only results.Independent verification. Every candidate finding is challenged before it counts.Synthesis. One final pass reconciles everything into a deliverable.
Left to right: the repo is cloned and compressed into a budgeted map; the orchestrator designs a team and dispatches scoped tasks in parallel waves; ephemeral workers investigate and return findings only; findings are deduped, challenged by verifiers, and reconciled by a synthesizer into report.md
. Four blocks, one per principle.
Block 1: the orchestrator designs the team
The orchestrator gets the repo map up front and can probe with read_file
and grep
before committing to a plan. Then it builds its team with two tools, create_subagent(name, system_prompt)
and assign_task(agent, prompt)
, the literal tool surface Kimi K2.5/K2.6 were RL-trained on.K2.5 technical report, Appendix E.8: this is --interface kimi
, which the repo ships as the default. We also ship --interface structured
, where the orchestrator instead calls dispatch_workers([{role, focus, paths}])
and the harness preloads each worker's files, decomposing by directory rather than by task, which keeps the planning turn small on very large repos. It authors each specialist’s system prompt itself. In one real audit run it invented a persona we never asked for:
create_subagent(
name="injection-filesystem",
system_prompt="You hunt injection and unsafe file access…",
)
assign_task("injection-filesystem", "Audit cli.py …")
assign_task("injection-filesystem", "Trace cost.py …")
The division of labour matters: the model decides who does what; the harness decides which tools each role may hold. Each dispatching turn is a wave: width is parallelism, follow-up waves fill gaps.
Block 2: workers see only their slice
The paper’s key lever is context sharding. Each task spawns a fresh, throwaway agent. It gathers the context it needs (read_file
, grep
, plus whatever capability tools its role grants), works for a few rounds, calls submit_results
, and is gone. Only schema-valid results and a status line return to the orchestrator; the worker’s research is discarded, never re-sent.
The same boundary keeps both context and cost under control: no single context overflows, and per-agent token usage stays low. That is why fanning out hundreds of workers stays cheap, and why the 300M-token bill from earlier never materialises. (The v1 toolset is deliberately read-only, so it’s safe to point at any repo.)
Block 3: every finding meets a skeptic
A swarm of enthusiastic hunters produces enthusiastic false positives. So before anything counts, each candidate finding is handed to an independent verifier whose only job is to refute it, and which defaults to “not real” when unsure. Survivors ship with adjusted severity; refuted findings are dropped and counted.
This stage is our addition; the paper’s orchestrator reconciles inline. It’s optional and per-brief; --verify-votes N
turns it into a majority-vote panel.
Block 4: one pass writes the report
A single tool-free synthesis call reconciles the confirmed findings into the deliverable: report.md
for humans, findings.json
for machines. Its shape comes from the brief, not the engine, which brings us to the part we like most.
Swap the brief, keep the engine
Nothing in the engine mentions auditing. The loop (orchestrate, shard, verify, synthesize) is byte-identical for every task. What a swarm does is a brief: ~50 lines of data binding prompts to roles, a result schema (enforced at submit_results
: invalid items are dropped, not trusted), and a tool selection per role:
from . import Brief, register
register(Brief(
name="onboarding",
description="Document a codebase's subsystems for newcomers.",
orchestrator_prompt="You are the lead author … call dispatch_workers once …",
worker_prompt="Document ONLY your assigned files: purpose, key components, deps …",
synthesis_prompt="Assemble an onboarding guide: overview, per-subsystem sections …",
result_schema={...},
result_key="sections",
worker_tools=("read_file", "grep"),
verifier_prompt=None, # set a prompt to switch the adversarial verify stage on
))
Two briefs ship in the box, audit
and onboarding
; new briefs fit in an afternoon: a dependency review (one worker per dependency: version drift, advisories, upgrade risk), a refactor plan (workers map usage per module, the synthesizer sequences the steps), wide research (one worker per source; verifiers refute unsupported claims).
The two runs, side by side:
| Solo agent (Claude Opus) | Swarm (Kimi K2.6) | |
|---|---|---|
| Tokens | ~300M · projected | 5.6M · measured |
| Cost | ~$300 | ~$6.70 |
| vs. the 2.4M read-once floor | ×125 | ×2.3 |
The swarm’s run, measured: 348 API calls, 5.2M tokens in, ~450k out, about 53× fewer tokens and 45× cheaper at Doubleword’s Kimi K2.6 pricing ($0.95/M input, $4/M output).Solo figures are projected from the metered partial run, with prompt caching priced in; swarm figures are measured. Every run also writes summary.json
with tokens, cost, coverage, and the paper's critical-vs-total step counts: speedup = total / critical
scores how well the orchestrator parallelised the run, the way the paper scores it. The verifier stage refuted and dropped roughly half of the candidate findings before any of them reached us, so what survived came with severity, file:line
, and a suggested fix attached.
Nobody’s waiting: the flex tier
A swarm is a just-get-it-done workload: hundreds of concurrent calls, and no human watching any single one. It’s throughput-bound: what matters is when the whole wave lands, not when each call returns.
Doubleword’s flex tier is priced for this. Individual calls may run longer, but global throughput holds, so end-to-end wall-clock time stays roughly the same, at ~30% off.Tier discounts are per-model; ~30% is Doubleword's flex pricing for Kimi K2.6 at the time of the run. swarm compare <brief> --repo …
runs the identical workload on both tiers and writes the wall-clock / token / cost table, so you can measure the trade on your own job.
service_tier = "flex" # was "priority"
The quickest path is the dw CLI, which sets up auth, endpoint, and model in one step:
dw login
dw examples clone swarm
cd swarm
dw project setup
Then point a brief at a repo:
dw project run audit -- --repo psf/requests --max-files 20 # audit a GitHub repo
dw project run onboarding -- --path ./my-service # document a local directory
dw project run audit -- --repo psf/requests --service-tier flex # run on flex tier for 30% cost saving!
dw project run report # print the latest run's report
Each run writes results/<brief>-<slug>/
: the synthesized report.md
, structured findings.json
, swarm-tree.json
(the team the orchestrator designed, worth reading at least once), and summary.json
with tokens, cost, and step counts.
Everything is open: the harness is on GitHub, it speaks the Open Responses API, and it’s model-agnostic: use -m/--model
(plus --provider
when needed) to pick any tool-calling model.
Doubleword is built for this kind of high-throughput inference, and we’d love to see what you fan out. Clone the swarm, write a brief, and go run parallel agents.
@misc{doubleword-the-swarm-that-designs-itself,
title = {The swarm that designs itself},
author = {Peter Bhabra},
year = {2026},
howpublished = {Doubleword Blog},
url = {https://blog.doubleword.ai/the-swarm-that-designs-itself},
}