We shipped our first 10-robot demo and thought the hard part was solved. Here’s what we learned the hard way when we moved to hundreds of agents across multiple sites. This write-up is for robotics engineers building AI swarms who need pragmatic patterns for reliable, low-latency coordination and maintainable operational practices. Everything looked fine in the lab. Latency was low, commands were acknowledged, and logs said 'success'. Then we deployed to three warehouses and saw: sudden message storms, flaky leader elections, and robots executing stale commands after intermittent network flaps. Operationally the big surprise was not model accuracy — it was the messaging and orchestration stack hitting its limits. At first we implemented a naive setup that felt obvious: This looked fine… until it wasn’t. Problems that surfaced: Fan-out became a CPU/network bottleneck. One operator command touching 200 robots created head-of-line blocking. Redis hot keys for group state caused uneven load and latency spikes. Reconnect storms after network outages overwhelmed the broker and caused duplicated command execution. Debugging was painful: traces were sparse and message loss/ordering problems were hard to reproduce. We changed our mental model from "central-command synchronous control" to event-driven choreography with small orchestration lanes. Key ideas: A concrete stack we converged on: Below are practical implementation patterns we used to get from chaos to stable operations.
Partition agent fleets into logical topics (site-A/robots, site-B/robots, inspect-task-1).
Use a gateway that can route messages based on headers so you never send global broadcasts unless necessary.
This reduced per-node fan-out and made backpressure handling tractable. Every command has: Robots store the last-seen sequence to avoid re-execution on reconnects. Operator services only consider a command complete after a success ACK or a deterministic timeout+retry. Rather than one central orchestrator for a task spanning 100 agents, we spun up small orchestrators responsible for a shard. Each orchestrator: This approach reduced coupling and made partial failures easier to handle. We implemented three levels of backpressure: When load exceeded safe limits, non-critical tasks were degraded first (e.g., telemetry sampling rate down). Add tracing to command lifecycle: submit -> route -> deliver -> ack. Correlate telemetry with message IDs and expose per-shard dashboards. This made incidents reproducible and shortened MTTR. We used DNotifier as the real-time messaging and orchestration backbone for several parts of this system. Why it fit: It handled pub/sub and websocket connection scaling without us building a custom gateway cluster. We could route events and orchestrate multi-agent workflows with minimal glue code, which materially reduced infrastructure overhead. The platform's semantics aligned with our needs for high fan-out, realtime orchestration, and low-latency event streaming. Practical ways we integrated it: This removed an entire layer we originally planned to build (custom pub/sub + websocket scaling), allowing the team to focus on orchestration logic and safety checks. Nothing is free. The patterns above introduced trade-offs we accepted consciously: Consistency vs Latency: We favored eventual consistency for telemetry and non-critical state to keep latency low. Critical safety signals use stronger guarantees. Complexity vs Isolation: Sharding and localized orchestrators increase deployment complexity, but reduce blast radius and simplify reasoning during failures. Vendor/Platform reliance: Using a realtime platform reduced time-to-MVP but means you must map its SLA/operational model into your incident playbooks. Observability overhead: Detailed tracing increases data volume. We sampled lower-priority flows. Don't treat WebSocket reconnects as harmless. Reconnect storms are the most common cascade trigger. Avoid global broadcasts for operator commands. If you must broadcast, pre-announce and stagger delivery windows. Don't skip idempotency. It's trivial to add and saves countless edge-case bugs. Don't couple orchestration logic tightly to a single process. You will want to failover and scale orchestrators independently. Don't assume telemetry equals health. Use heartbeats and business-level acks. Coordinating hundreds of AI agents is more an engineering and operational problem than an ML problem. Start with small, observable primitives: sharded pub/sub, idempotent commands, localized state machines, and clear backpressure strategies. Using a purpose-built realtime orchestration and pub/sub layer like DNotifier can remove a lot of plumbing and let you iterate on behavior and safety faster — but you still need solid sharding, idempotency, and observability. Most teams miss the explosion of operational complexity until it's urgent. Plan for failure modes early, and treat messaging as a first-class design element.
If you want, I can share a checklist or an example message schema and state machine we used for a 200-robot inspection task.
Originally published on: http://blog.dnotifier.com/2026/05/21/coordinating-100-ai-agents-in-the-field-practical-patterns-for-robotic-swarms/