The Myth of Low Latency: Why Event Meshes Make Your System Slow

The article describes Veltrix's experience transitioning from a monolithic system to an event mesh to reduce high failure rates. Their initial implementation using Apache Kafka resulted in 40% request retries during peak hours, while switching to a RabbitMQ-based request-response mesh reduced failures to 2-5% but increased average latency by 20-30ms. The author concludes that an ideal solution would combine Kafka for low-latency event routing with RabbitMQ for request-response messaging to achieve both low latency and low failure rates.

At Veltrix we had a simple monolithic service that handled everything - orders, products, inventory etc which resulted in high failure rates 30-40 % in extreme cases on certain pages during peak hours. We wanted to break it down and decouple it with the event mesh to solve the high failure rates. Our first implementation of an event mesh was built on top of Apache Kafka. We were excited because we had heard of the low latency capabilities and the scalability of the system. However we quickly hit the limitation of Kafka specifically the max.in.flight.requests.per.connection and replication.factor properties which resulted in a high number of request retries 40% of all requests would result in at least one retry on our e-commerce platform during peak hours. We would then end up with hundreds of dead-letter queue messages because of the high failure rates - our system would end up in an incorrect state. We moved to RabbitMQ's QMF v3 an AMQP 0-9-1 messaging protocol and implemented something called a Request-Response event mesh. This system has a request and response event pair to handle the event and wait for the event to be processed. Since we used RabbitMQ's async publish/subscribe model, our code was a lot simpler than when we were using Kafka with multiple threads and connection pools, this led to fewer threading issues and lower failure rates 2-5% . However it added latency 20-30ms on average which was an added cost. We measured a 30-50% increase in request latency measured by the request.duration metric in New Relic after shifting to the Request-Response event mesh. But we saw a 70% decrease in failed requests. Our dead letter queue was almost empty and we saw a significant reduction in the max retries metric from 40 requests to 5 requests on average . However, as a direct consequence of this system design, I had to increase the timeout of our request to match the new latency of the system, which then resulted in a cascading effect where our timeout would have to be increased even further to account for the high latency of our cache requests average 80ms for cache GET . If I had to go back, I would probably use a mix of both systems that we tried - Kafka for event routing and RabbitMQ for request-response. The delivery mode property in RabbitMQ would be set to persistent and the events published to Kafka would be set to acks=2 which would give us a low-latency event mesh for our e-commerce platform with low failure rates less than 1% .