Stop Measuring Agent Infrastructure by Gateway Latency Alone

A developer argues that the industry's focus on LLM gateway latency benchmarks is misguided for agent systems. Production agents require session persistence, cost attribution, model routing, fallback policies, sandbox isolation, and observability—features that latency benchmarks ignore. In a typical 500ms agent workflow, even the fastest gateway saves only 6% of total latency, making other concerns more critical.

I've been watching the LLM gateway benchmarks get faster. Bifrost at 11 microseconds, Helicone at 8 milliseconds, LiteLLM at 8ms. On single requests, the math is brutal: Bifrost is 720x faster than LiteLLM. But this week I watched three teams benchmark gateways, pick based on latency, deploy to production, and then realize they'd solved the wrong problem. The issue isn't the benchmarks. The issue is what they're benchmarking. And what they're not. Here's what a typical gateway latency benchmark does: This makes sense if your application is a chat interface. One user sends one message. Low latency feels good. You route that call through a gateway and add 11 microseconds? Invisible. Phenomenal. But agent systems don't work like chat interfaces. A production agent making a decision typically isn't making one LLM call. It's making many: That's 5–15 calls per decision. Some agents do 50+. If each call goes through your gateway, the latency compounds : The relative difference shrinks. But—and this is the part the benchmarks hide—in production, agents don't run in isolation. Multiple agents run concurrently. Tool calls can block. Fallbacks trigger retries. Cost attribution happens per request. Session state needs to persist across crashes. The gateway overhead is one component of agent latency. It's not the only component, and it's usually not the largest. I talked to five teams this month running coding agents in production. Here's what they actually cared about, ranked by impact: 1. Session Persistence "If our agent crashes mid-task, we lose everything. The benchmark didn't mention session state at all." They needed agents to survive pod restarts, maintain tool call history, and resume from where they left off. 2. Cost Attribution "Our CFO asked what each agent decision costs. The gateway's latency benchmark didn't tell us that." They needed to tag requests by agent, workflow, team, and user—then roll up costs per agent and per decision. Latency benchmarks measure throughput, not cost per decision. 3. Model Routing "We use Claude for complex tasks, GPT-4 for speed, and open-source for cheap calls. The fastest gateway doesn't route on task complexity." They needed conditional routing: "If this agent is handling a finance decision, use Claude. If it's a simple lookup, use a cheaper model." Bifrost at 11µs overhead doesn't matter if it can't route based on decision type. 4. Fallback & Retry Policy "Our tool sometimes fails. We need to know how many retries happened and why, not just how many total requests went through." They needed to instrument retry loops and prevent cost spirals. A gateway that handles 10,000 RPS but logs every retry identically isn't helping. 5. Sandbox Isolation "Each agent gets its own session. Tools run in isolated sandboxes. The gateway latency benchmark doesn't mention sandboxes at all." They needed agents to run in per-team, per-workflow sandboxes with resource limits and audit trails. 6. Observability & Debugging "When an agent makes a bad decision, we need to replay it. We need to see every tool call, every model invocation, every decision point." They needed structured tracing, not just latency metrics. The gateway latency benchmark measures exactly zero of these. Let's do the math on a real workflow: An agent processes a customer support ticket. It: Total latency: ~400–600ms Gateway overhead in this flow: So Bifrost saves 31ms on a 500ms workflow. That's 6%. Important? Sure. But not more important than cost governance, session persistence, and model routing. And LiteLLM at 8ms overhead is already dwarfed by the actual workflow latency. The benchmark conversation assumes you need one gateway for everything. But production agent infrastructure needs two layers: Data plane where gateway latency matters : This is where Bifrost shines. Go's concurrency model is genuinely suited for high-throughput, low-latency routing. 11µs overhead is real and measurable. Control plane where gateway latency doesn't matter : This is where data plane latency is irrelevant. A control plane call that takes 200ms is acceptable if it's handling session state, sandbox provisioning, or workflow routing. You're not making 5,000 of them per second. You're making a few per agent lifecycle. This is also where LiteLLM Agent Platform operates. It's not trying to be a low-latency gateway. It's trying to be a reliable control plane that actually makes agents runnable in production. Here's a framework teams should actually use: Most teams ask 9 first. They should ask it last. A production agent system needs both: The latency benchmark tells you about the data plane. It tells you nothing about the control plane. Teams that pick based solely on data plane latency end up with a fast gateway that can't handle agent sessions, costs, or multi-tenancy. They solve the wrong problem and build the wrong system. If you're evaluating agent infrastructure: Separate the layers. Don't try to measure everything in one benchmark. Measure data plane latency separately from control plane reliability and feature depth. Measure the right things. Ask vendors: How do you handle session persistence? Cost attribution? Multi-tenancy? Sandbox isolation? Observability? These matter more than the 11µs vs 8ms difference. Test the full workflow. Don't benchmark a single LLM call. Benchmark a complete agent decision with tool calls, retries, and cost tracking. That's closer to production reality. Separate costs. Data plane should be fast and cheap commodity hardware . Control plane should be reliable and governable probably more expensive per request, but fewer requests . The teams building agents at scale are not chasing 11-microsecond gateway overhead. They're building systems where sessions survive crashes, costs are predictable, and agents can actually be governed in production. That's a different set of problems. And latency benchmarks don't measure it. Paul Twist is an AI engineer based in Berlin. He works on production infrastructure for agents and writes about the gap between what works in demos and what works at scale.