Adding a 7th subagent pushed my orchestrator latency from 22s to 31s — the opposite of what I expected.
I'd been running a fanout pattern in my ad-creative analysis SaaS: spawn N subagents in parallel, collect results, merge into one verdict. The parallel part worked fine. Individual subagents finished in 9–12 seconds regardless of how many I spawned. The problem was everything after that.
With 8 subagents, each returning ~800 tokens of analysis, the orchestrator was assembling a 6,400-token context before it could even call the LLM once. On Cloudflare Workers, serializing 8 JSON blobs into a single prompt string was taking 4+ seconds of pure CPU time before the first API call fired. The log entry that made it obvious:
[worker:orchestrator] WARN
aggregate_context_size=52480 bytes
serialize_duration=4312ms
reason="context_assembly_backpressure"
Measured across 3 weeks of production data:
| Subagents | Total latency | Aggregation share |
|---|---|---|
| 2 | 14.2s | 18% |
| 4 | 16.8s | 31% |
| 6 | 22.4s | 47% |
| 8 | 31.1s | 61% |
At 6+ subagents, aggregation consumed more than half the wall-clock time. The fanout was fast. The funnel was the bottleneck.
The fix wasn't reducing parallelism — it was changing what the orchestrator actually reads. Instead of passing full results to the aggregation LLM call, each subagent now writes to R2 on completion. The orchestrator pulls only a three-field summary struct per agent (verdict
, confidence
, top_signal
). Eight agents still produce eight files, but the aggregation context dropped from ~6,400 tokens to ~1,100. Monthly cost for that one pipeline step: $207 → $38.
The counterintuitive part: the bottleneck wasn't the LLM. It was the context assembly happening before the LLM even got called.
I wrote up the full breakdown — including the R2 chunking pattern, the D1 counter approach for tracking partial completions without polling, and the KV-based loop guard for failed aggregation retries — over on riversealab.com.