Virtual keys per tenant: ditching our custom LLM billing layer

wpnews.pro

cd /news/ai-infrastructure/virtual-keys-per-tenant-ditching-our… · home › topics › ai-infrastructure › article

[ARTICLE · art-15460] src=dev.to ↗ pub=2026-05-27T16:02Z topic=ai-infrastructure verified=true sentiment=↑ positive

Virtual keys per tenant: ditching our custom LLM billing layer

Nexus Labs replaced 60% of its 11,247-line Python middleware for per-tenant LLM cost attribution, rate limiting, and provider failover with Bifrost's virtual keys and governance features. The migration eliminated 4,200 lines of Python code by moving provider routing, budget caps, and rate limits into Bifrost's virtual key configuration, while also reducing latency by switching from Python with synchronous Redis calls to Bifrost's Go-based implementation. However, the migration required a full sprint to map legacy billing fields to virtual key metadata, and the team disabled semantic caching for agent workloads due to prompt similarity risks.

read4 min views14 publishedMay 27, 2026

TL;DR: We had 11,247 lines of Python middleware handling per-tenant LLM cost attribution, rate limiting, and provider failover. Replaced about 60% of it with Bifrost's virtual keys and governance features. Some honest gaps remain, which is why this is a writeup and not a sales pitch.

Nexus Labs runs enterprise agent automation. Each customer gets isolated workloads. Each workload makes between 200 and 50,000 LLM calls per day across OpenAI, Anthropic, Bedrock, and Vertex.

When I joined, we had a Python middleware doing four things at once: API key rotation per provider, per-tenant rate limits in Redis, cost attribution via request tagging, and fallback logic when a provider returned 429s.

11,247 lines of Python. Three engineers had touched it. Two had left. One of them had encoded their team-internal pricing assumptions inline. Every model deprecation became a sprint.

Three things, in priority order:

I evaluated three gateways before picking one. Here is the comparison after running each through a 2-week eval against our actual traffic shape.

Feature	Bifrost	LiteLLM	Portkey
Per-tenant virtual keys with budgets	Native	Plugin/config	Native
Self-host without external deps	Yes	Yes	Limited
OpenAI-compatible API for all providers	Yes	Yes	Yes
Built-in Prometheus metrics	Yes	Yes (newer)	Hosted preferred
Semantic caching	Yes	Yes	Yes
MCP gateway	Yes	No	Limited
Built-in web UI for config	Yes	Limited	Cloud-first

LiteLLM was the real contender. Larger community, more battle-tested in production for some workload shapes. Where it lost for us: setting up hierarchical budgets across customer to team to workload tiers required more YAML wrangling than we wanted, and the failover behavior on streaming requests was less predictable under our tests.

Portkey was strong on dashboards. We didn't want a hosted dependency for our cost control path.

The piece that surprised me most was the virtual keys model. From the docs (governance/virtual-keys), every tenant gets a virtual key. The key carries the budget cap, rate limit, allowed providers, and allowed models. Our orchestrator stopped caring about provider routing entirely.

Config that replaced 4,200 lines of Python:

virtual_keys:
  - id: vk_acme_prod
    customer_id: acme_corp
    budget:
      max_per_month_usd: 12000
      reset_duration: monthly
    rate_limit:
      requests_per_minute: 600
    allowed_providers:
      - openai
      - anthropic
      - bedrock
    fallbacks:
      - provider: openai
        model: gpt-4o
      - provider: anthropic
        model: claude-sonnet-4-6
      - provider: bedrock
        model: anthropic.claude-sonnet-4-6

Our orchestrator now does one thing: pick a virtual key based on tenant. Send the request. Done.

Before:

gateway_middleware/

After 4 months:

The latency number was the biggest surprise. Bifrost is Go. Our middleware was Python doing synchronous Redis calls. We knew that was a problem. Solving it wasn't on the roadmap.

This isn't free.

Migration was harder than the docs suggest. Our cost attribution data didn't map cleanly. We had legacy fields like team_internal_billing_code

baked into every log. Mapping these to virtual key metadata took a full sprint, and the team still grumbles about it.

Semantic caching is risky for our workload. Our agents call LLMs with tool results embedded in prompts. Two prompts that look 92% similar can require very different responses. We disabled semantic caching for the agent path. Enabled it only for our content generation path, where we saw a 31% hit rate.

MCP gateway integration is newer than the rest. We use it for filesystem access from a customer-facing automation agent. Works fine. But debugging when a tool call fails requires more log digging than the rest of the platform.

No native cost-anomaly alerting yet. Budget caps work. But "this customer's usage spiked 3x in 2 hours" is still wired up via Prometheus alerts and PagerDuty by hand. Portkey has this in their hosted product. If real-time anomaly alerts are your top requirement, weight that.

If you have one provider and one customer, you don't need this. Use the provider's SDK.

If you have 3+ providers, multiple customer tiers, and someone on your team has written class CostTrackingMiddleware

more than once, evaluate. Spin up the Docker container (quickstart). Point staging traffic at it for a week. Look at the metrics. Decide.

The model is the easy part. Cost attribution is the part that wakes you up at 2am when a customer's bill is wrong.

source & further reading

dev.to — original article Cursor + FFmpeg Micro: Ship Video Features Without Learning FFmpeg Egregor: Local Multi-AI Consilium for Comprehensive Smart Contract and Code Audits Stripe to Mollie Migration: What Actually Breaks

~/api · this article 200

$curl api.wpnews.pro/v1/news/virtual-keys-per-tenant-…

Read original on dev.to → dev.to/marcuswwchen/virtual-keys-per-tenant-ditc…

mentioned entities

Nexus Labs

Bifrost

LiteLLM

Portkey

OpenAI

Anthropic

Bedrock

Vertex

metadata

slugvirtual-keys-per-tenant-ditching-our-custom-llm-billing-layer

topic#ai-infrastructure

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevShow HN: Turn your Google accoun…

next →Introducing okta-skill: Zero-Con…

── more in #ai-infrastructure 4 stories · sorted by recency

cryptobriefing.com · 12 Jul · #ai-infrastructure

Elon Musk’s Grok 4.5 launch intensifies AI arms race with implications for crypto-adjacent compute markets

startupfortune.com · 12 Jul · #ai-infrastructure

Zhipu's co-founder says open AI is safer just as its stock triples in a year

runewardd.github.io · 12 Jul · #ai-infrastructure

Show HN: Runeward: Sandboxing AI agents with policy gates

startupfortune.com · 12 Jul · #ai-infrastructure

How to Evaluate AI Agents Before You Ship Them to Real Users

── more on @nexus labs 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required