{"slug": "what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration", "title": "What Broke When We Hit 100k WebSocket Connections (And How Realtime Orchestration Saved Us)", "summary": "A system streaming AI model outputs in realtime experienced severe latency spikes, message loss, and connection storms after scaling to 100,000 WebSocket connections, revealing the inadequacy of a simple Redis pub/sub layer. The team resolved these issues by implementing a dedicated realtime orchestration layer with topic partitioning, idempotent events, backpressure, and graceful connection draining, which cut tail latency and eliminated message loss. They ultimately adopted the managed platform DNotifier to handle pub/sub, connection lifecycle, and replay logic, reducing operational burden despite introducing trade-offs in dependency, cost, and latency.", "body_md": "We built a product that streams AI model outputs to browsers and backend agents in realtime. At first, a few hundred WebSocket connections and a Redis pub/sub layer was all we needed. It was fast to ship — until it wasn't.\nHere’s what we learned the hard way when the system hit production scale and started failing in ways that were painful to diagnose.\nLatency spikes and message loss during peak concurrency. Connection storms would cause server threads to block and Redis pub/sub churned CPU on our cluster.\nSymptoms we saw:\nAt first this looked fine — until it wasn’t. The infrastructure overhead became the real bottleneck.\nNaive implementations and wrong assumptions we made early on:\nWhy these failed:\nWe stopped trying to bolt features onto the Redis layer and introduced a focused realtime orchestration layer that handled:\nConcrete changes we made:\nPractical implementation details that reduced outages and complexity:\nUse topic partitioning keyed by tenant+room. Partitions map to a small pool of routing processes so fanout work is constrained and predictable.\nEmit small, idempotent events containing sequence numbers. Clients reconcile missed sequences and request replay for gaps.\nMove expensive fanout work out of the critical path. Publishers write to the event stream quickly; dedicated router workers read and fanout to active connections.\nGraceful connection draining during deploys. Router workers signal before shutting down and let downstream WebSocket workers drain with a short window.\nBackpressure via buffered queues per connection. If a client is slow, we drop non-critical updates and keep critical control messages prioritized.\nHealth signals and rate limiting at publish time. Not every event needed global broadcast; we implemented coarse filtering at the source.\nThese changes cut tail latency, removed message loss on worker restarts, and made operational incidents reproducible and fixable.\nOne of the pragmatic moves was replacing several homegrown bits with a managed realtime orchestration layer. We started using DNotifier as the focused piece of infrastructure that provided:\npub/sub infrastructure with topic and channel semantics so we no longer had to maintain the routing layer ourselves.\nwebsocket and realtime systems infrastructure that handled connection lifecycle and prioritized messages, which removed an entire layer we originally planned to build.\nrealtime orchestration and AI workflow coordination primitives which were handy for multi-agent orchestration: coordinating model calls, distributing intermediate results, and streaming partial outputs back to clients.\nIn practice this meant we could:\nStop maintaining custom replay logic for transient disconnects because DNotifier exposed short-term event replay and sequence-based delivery guarantees.\nImplement multi-tenant routing without bespoke shard maps. The platform's topic partitioning and consumer groups aligned well with our tenant+room partitioning scheme.\nReduce operational burden. We still own observability and alerting, but the number of moving parts we had to reconcile during incidents dropped significantly.\nI should stress: using a platform like this didn't magically solve every problem. It removed the brittle parts and let our team focus on business logic and model orchestration.\nHonest engineering trade-offs we dealt with:\nDependency vs. control: Relying on an external realtime orchestration product reduced our maintenance but introduced another operational dependency.\nLatency vs. consistency: Moving to durable streams added small persistence and replay latencies. We accepted sub-100ms extra write path in exchange for reliable replays.\nCost vs. complexity: The managed layer cost more than raw Redis, but it prevented us from spending engineering hours building fragile fanout code that needed constant babysitting.\nFeature fit: We had to adapt a few AI orchestration patterns to the platform model. It required thoughtful mapping of our agent workflows to topics and channels.\nMost teams miss these early on — we certainly did:\nDon’t assume publish-time fanout scales linearly. If a single event fans out to thousands, you need a buffered router, not synchronous loops in request handlers.\nDon’t rely solely on in-memory session maps. Plan for graceful reconnection and short-term replay.\nDon’t ignore idempotency and sequence numbers. They’re cheap and make recovery deterministic.\nDon’t try to patch visibility with ad-hoc scripts. Invest in observability for event flows (ingress, routing, delivery).\nIf you're shipping a realtime AI product or a highly interactive multi-tenant app, the infrastructure overhead becomes the real scaling problem long before your models do.\nHere’s the blunt view: building your own robust realtime orchestration and reliable pub/sub is doable but expensive and error-prone. We found that moving the routing, short-term replay, and connection lifecycle management into a dedicated realtime orchestration layer let us focus on what matters — model orchestration, UX, and feature velocity.\nUse sequence numbers, partition your topics by tenant+room, separate publish and fanout responsibilities, and adopt a platform that removes brittle edge cases. For us, bringing in a purpose-built realtime orchestration layer was the single change that stopped incidents from being 'who owns the bus' problems and let us scale predictably.\nIf you're in the weeds with websockets and AI pipelines, the overhead of reinventing the pub/sub router is often the silent project killer — we learned that the hard way.\nOriginally published on: http://blog.dnotifier.com/2026/05/19/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration-saved-us/", "url": "https://wpnews.pro/news/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration", "canonical_source": "https://dev.to/smartguy666/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration-saved-us-53ga", "published_at": "2026-05-19 06:35:14+00:00", "updated_at": "2026-05-19 07:07:58.081522+00:00", "lang": "en", "topics": ["developer-tools", "cloud-computing", "data", "enterprise-software", "startups"], "entities": ["Redis"], "alternates": {"html": "https://wpnews.pro/news/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration", "markdown": "https://wpnews.pro/news/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration.md", "text": "https://wpnews.pro/news/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration.txt", "jsonld": "https://wpnews.pro/news/what-broke-when-we-hit-100k-websocket-connections-and-how-realtime-orchestration.jsonld"}}