What Is an AI Gateway? (And the Week We Realized We Desperately Needed One)

wpnews.pro

TL;DR

Six months ago we had what I'd describe as a functional mess. We were running three LLM providers - OpenAI for our customer-facing chat, Anthropic for internal document summarisation, and a self-hosted Llama model for batch classification jobs. Each had its own SDK. Each had its own API key, living in .env

files on whoever's machine had last run that service. Each had its own rate limiting logic, copy-pasted between services with slight variations.

It worked, in the way that things work when nobody has had a bad enough incident yet.

The incident arrived on a Tuesday. A background job that was supposed to run once a week got accidentally scheduled to run every minute. It was calling GPT-4o. We noticed when the Slack alert fired at 2am about an unusual credit card charge. By the time someone killed the job, we'd burned through $340 in about four hours. The API key had no spending limit. There was no alerting on token usage. The job had no rate limiting. All three of those gaps were things we knew about and hadn't prioritised.

That week, we started properly looking at AI gateways.

The simplest definition: an AI gateway is a middleware layer that sits between your application code and your LLM providers. All your LLM requests go through it, and it handles the cross-cutting concerns that you'd otherwise have to re-implement in every service: routing, authentication, rate limiting, cost tracking, caching, fallbacks, guardrails.

The analogy that clicked for me is an API gateway for the rest of your microservices stack. If you've ever set up Kong or AWS API Gateway to handle auth and rate limiting for your REST services, an AI gateway does the same thing but for LLM traffic specifically, which has different characteristics (token-based pricing, streaming responses, variable latency, context windows) that a generic API gateway doesn't handle cleanly.

Architecturally, it typically has:

The important thing is that none of this lives in your application code. It's a separate layer with its own config, which means you can change routing rules or enforce a new spending limit without touching application code or doing a deployment.

Before I get into specific features, it helps to be concrete about the problems. The ones we hit:

1. Unmanaged API keys

We had four API keys in four .env

files. When an engineer left the team, we invalidated their personal keys but not the shared service keys, because we weren't entirely sure which services were using them. A gateway solves this by being the only thing that holds the real provider keys. Application services authenticate to the gateway with scoped virtual keys. If you need to revoke access, you revoke the virtual key — the underlying provider key stays intact and doesn't need to change.

2. Zero cost visibility

We knew our monthly spend from the Anthropic and OpenAI dashboards. We had no idea which team or service was responsible for which portion of that spend. When costs went up, we couldn't attribute it. A gateway with per-team and per-service cost tracking meant that the next month, we had a breakdown: classification job (42%), customer chat (31%), internal summarisation (19%), miscellaneous (8%). Suddenly we knew where to optimise.

3. No spending limits

The Tuesday incident. Enough said. A gateway lets you set hard token or spend limits per API key, per team, per service. When the limit hits, the request gets a rate-limit error instead of a bill at the end of the month.

4. Silent failures on model outages

When OpenAI had a partial outage last March, our customer chat just... failed quietly. Requests returned errors, the frontend showed a generic message, and we found out from a user report rather than an alert. A gateway with fallback routing would have automatically switched to Anthropic or our self-hosted model and kept the service up. We were just making direct SDK calls with no fallback logic.

5. The security audit

This one came from outside the team. Our security team did a review and had two questions we couldn't fully answer: "Can you show me an audit log of which users triggered which model calls in the last 90 days?" and "How do you ensure that production credentials aren't accessible to developers locally?" We couldn't answer either cleanly. A gateway with request-level logging and centralised key management is the infrastructure answer to both.

One feature that sounds vague but turned out to be genuinely useful: routing.

Not just "send this request to OpenAI" — but intelligent routing. There are a few modes worth understanding:

Latency-based routing: The gateway continuously monitors response times across your configured providers. When one provider's latency spikes, it automatically routes to whichever is fastest. This is particularly useful when you're using multiple deployments of the same model across regions.

Weighted load balancing: You can split traffic across providers by percentage. We used this when testing a new model — routing 10% of requests to the new model, watching the metrics, and gradually shifting the split as confidence grew. No code changes, just a config update.

Fallback chains: Define a priority order. If the primary model is unavailable or rate-limited, try the secondary, then the tertiary. The request succeeds from the application's perspective — it never sees the fallback happen.

Cost-based routing: Route to cheaper models for lower-stakes tasks. We ended up routing classification jobs to a smaller, cheaper model and only using GPT-4o for the tasks that genuinely needed it. The gateway enforces this policy centrally rather than relying on individual engineers to make cost-conscious choices in every service.

We expected to use caching for exact-match deduplication — if two users send the identical prompt, return the cached response. Useful, but not that common in practice.

What we didn't expect was how useful semantic caching turned out to be. Semantically similar prompts — not identical, but asking for the same thing slightly differently — return the cached response if the similarity score is above a threshold you configure. For our summarisation workload, we found that a significant portion of requests were semantically similar enough to return cached results. That's real cost reduction without any change in output quality.

The key configuration decisions: cache expiry (how long is a cached response valid?), and the similarity threshold (how similar is "similar enough"?). These are worth tuning — the defaults are conservative and you can usually go more aggressive once you understand your workload.

Guardrails are the part of AI gateway setup that gets deferred because it feels like a "compliance problem" rather than an engineering problem. It's both.

A guardrail is a policy that runs on requests before they're sent to the model (input guardrails) and on responses before they're returned to the application (output guardrails). Common uses:

The way TrueFoundry handles this is through pre-built integrations that need no external credentials for the basics, with options to plug in Azure Content Safety, AWS Bedrock Guardrails, OpenAI Moderations, or Google Model Armor for more specific requirements. You can run guardrails in validate mode (inspect, flag, optionally block) or mutate mode (inspect and modify — useful for PII scrubbing where you want to replace rather than reject).

The thing that shifted my thinking on this: guardrails aren't just for compliance. Prompt injection via third-party content is a real engineering risk once you're building agents that retrieve external content and put it into context. A guardrail that runs on retrieved content before it reaches the model is the right architectural answer — not trying to sanitise inputs at the application layer.

We evaluated a few options. LiteLLM was the first thing we tried — it's the obvious starting point because it's open source, MIT licensed, and gets you a unified endpoint across providers in an afternoon. We ran it for about six weeks. What broke for us: no SSO integration (we needed Okta for compliance), and the per-team budget enforcement we needed was behind the enterprise license. The YAML config also got unwieldy as we added more models and routing rules.

We ended up on TrueFoundry's AI Gateway. A few specifics that mattered to our evaluation:

Architecture: The gateway runs entirely in-memory for auth, rate limiting, and routing decisions — no external DB lookups on the hot path. Config syncs from the control plane via NATS. This means gateway latency doesn't degrade as you add governance rules. The benchmarks show 350+ RPS on 1 vCPU with under 10ms added latency at full load, which matched what we saw in our own testing.

Key management: Developers get virtual keys that map to gateway-managed provider credentials. The actual OpenAI and Anthropic keys never leave the secrets manager. Onboarding a new developer means issuing a new virtual key. Offboarding means revoking it — one action, immediate effect.

Per-team budgets: Enforced on the request path, not as a post-spend alert. When the limit hits, requests return rate-limit errors. We haven't had another 2am Slack alert.

Self-hosted model support: We route to our on-prem Llama deployment through the same gateway as our OpenAI and Anthropic traffic. Same observability, same cost attribution, same rate limiting. This was the biggest gap with Portkey, which has no visibility into self-hosted model infrastructure.

Deployment: We run it inside our VPC. The whole control plane stays in our infrastructure, which is what our security team needed to answer the data residency question cleanly.

What I'd flag as the honest tradeoff: TrueFoundry is more to set up than LiteLLM. It's Kubernetes-native, so if you don't have a K8s environment, there's more upfront work. And Portkey's prompt management UI is genuinely better for non-engineers who want to iterate on prompts without touching config files. Those are real differences worth knowing before you evaluate.

The moment everything clicked was adding this to our service configs:

export ANTHROPIC_BASE_URL=https://<gateway-url>/api/inference/
export OPENAI_BASE_URL=https://<gateway-url>/api/inference/

That's it. Every existing SDK call — LangChain, the OpenAI Python client, direct requests — started going through the gateway without any code changes. Suddenly we had cost attribution, rate limiting, and request logging across all our services. The application code didn't know anything had changed.

This is also, incidentally, the right way to think about what a gateway does to your architecture: it's a configuration change at the infrastructure layer, not a code change at the application layer. That's why the governance it provides actually holds — it's enforced at the network level, not dependent on individual engineers remembering to implement it.

I don't want to make it sound like everyone needs an AI gateway immediately. If you're at an early stage, the overhead isn't worth it.

You probably don't need a gateway yet if:

You're ready for a gateway when:

The Tuesday incident was our sign. Hopefully yours is less expensive.

What's the specific thing that pushed your team toward a gateway — or convinced you to hold off? Curious whether cost incidents are as common a forcing function as they were for us, or whether it's usually the security audit that does it. Drop it in the comments.

source & further reading

dev.to — original article Real-Time AI Feature Engineering with Spark Structured Streaming and Databricks Feature Store Building an AI Chat Agent with MCP, Spring AI The Complete Guide to OpenAI-Compatible APIs for Chinese LLMs

What Is an AI Gateway? (And the Week We Realized We Desperately Needed One)

Run your AI side-project on zahid.host