Protecting against inference theft

wpnews.pro

cd /news/artificial-intelligence/protecting-against-inference-theft · home › topics › artificial-intelligence › article

[ARTICLE · art-18131] src=vercel.com ↗ pub=2026-05-29T04:00Z topic=artificial-intelligence verified=true sentiment=↓ negative

Protecting against inference theft

Inference theft attacks on AI endpoints are surging, with attackers using residential proxies and adapters to steal expensive model calls for resale at a fraction of the cost. Vercel reported a single attack on its docs AI chat endpoint on April 12, 2026, that spiked traffic to 1,300 requests per minute, which would have resulted in an inference cost run rate exceeding $10,000 per day. The company warns that standard rate limits and authentication walls are insufficient, requiring per-request verification to prevent attackers from amortizing session-level checks across thousands of stolen calls.

read5 min views8 publishedMay 29, 2026

HTTP requests are cheap. Vercel charges ~$2/million, a fraction of a cent per call. But a single prompt to an agent on a frontier model can cost $2, making AI a million times more expensive, and inference theft one of the highest-margin businesses an attacker can run. We have seen this type of attack on our own APIs.

If you have AI endpoints exposed to the internet, the risk of abuse is high and can easily run up bills in the tens of thousands of dollars or more. Protecting those endpoints requires verification to run on every AI request, not on the session or signup. Rate limits and auth walls aren't sufficient on their own because checks that run once per session get amortized away across thousands of stolen calls.

At Vercel, we gate every AI request through BotID deep analysis, and you can do the same on your own endpoints with a few lines of code.

Inference theft is the unauthorized use of someone else's paid AI inference, either for free consumption or downstream resale. The operator pays per AI call; the attacker pays nothing for the inference, then resells the tokens at a discount. This goes beyond rate-limit abuse to actual resale of a stolen resource in a market.

Any internet-facing endpoint that gives a caller meaningful control over an LLM prompt is a target. The more general the endpoint, the higher the payout per stolen call.

AI playgrounds, like the AI SDK Playground, are the most dangerous shape because the caller has maximum control over the prompt, the model, and often the parameters. Stolen calls land cleanly into any standard client.

Support bots and documentation assistants are less exposed when system prompts are fixed server-side, but attackers have learned how to talk the models around system prompts cheaply enough to make resale viable.

Resale value tracks how easily the stolen calls can be dropped into a provider-compatible client.

IP rate limits and auth walls were built for attacks with dramatically lower per-call economics, where gaming IPs and accounts weren't worth the cost.

The payoff from stolen inference is high enough that attackers will procure residential proxy IPs by the thousands and register throwaway accounts at whatever scale defeats your gate. Rate limits get diluted across the fleet of IP addresses, and real accounts pass authentication.

Sophisticated attackers wrap your custom AI endpoint in an OpenAI- or Anthropic-compatible adapter and fan calls out through residential proxies.

The adapter is the key component. It is a one-time engineering cost that presents the victim's idiosyncratic API as OpenAI- or Anthropic-compatible, so the stolen inference drops into any standard coding agent or SDK. Resale at even five to ten percent of list price against zero marginal inference cost can make for a generous-margin business.

A recent example is Chipotlai Max, a forked coding agent that ships with a proxy turning Chipotle's customer-support chatbot into an OpenAI-compatible endpoint. The project openly solicits help porting the same inference theft approach to Home Depot, Lowe's, Target, and Starbucks.

The adapter is also the session boundary for the attacker's downstream users. They authenticate to the adapter, not to your endpoint. By the time a call hits your API, it has already crossed the boundary you were planning to defend. The check has to run on the call the adapter is proxying, not the session it sits behind.

On April 12, 2026, traffic to the Vercel docs AI chat endpoint spiked to roughly ten times normal volume on Anthropic's Claude Haiku 4.5 model. Traffic rose to 1,300 requests per minute at peak, which would have translated to an inference cost run rate of over ten thousand dollars per day.

The attack came in through residential proxies that obscured the real client IPs. Across hundreds of thousands of bot requests over two days, standard per-IP rate limits had nothing useful to act on.

Protecting AI endpoints against inference theft requires verification of every request. We use Vercel's BotID with deep analysis, called inside the route handler before the AI request lands.

If our gate had run at session start instead of per request, the attacker would have paid the bypass cost once and walked away with hundreds of thousands of stolen calls. Any check that runs per session amortizes the attacker's bypass cost across every subsequent inference call. Per-request gates force that ratio down to one, and even at high inference prices, defeating a check on every call isn't worth the cost. This is where the cost asymmetry works in the defender's favor. Inference is the most expensive resource per call the attacker is stealing, but verification is one of the cheapest costs per call for protection.

Traditional image CAPTCHAs no longer hold up against modern attackers because the same AI models that make inference worth stealing can easily bypass them.

We deploy Vercel BotID on our AI endpoints, gating every request. BotID is an invisible CAPTCHA with deep analysis powered by Kasada that uses client-side machine learning to distinguish humans from bots without showing a visible challenge, which means it can run on every request rather than only at session start.

BotID deep analysis detected and blocked more than ten thousand bot requests in the first minutes of the spike. Within twenty-four hours, request volume on the endpoint was flat at normal levels.

Server-side, checkBotId() runs inside the route handler and returns a classification for the request currently being served.

The route also has to be declared on the client. Without this, checkBotId()

fails because BotID doesn't attach the challenge headers to the request:

See the BotID docs for the next.config.ts wrapper and the full setup.

Inference will stay orders of magnitude more expensive than the requests carrying it, so resale stays profitable and attackers will keep iterating.

To protect your AI endpoints:

Audit which of your AI endpoints are exposed

Prioritize by attack likelihood: more caller prompt control means an easier target

Gate every endpoint on every request

Get started in our AI endpoint protection Knowledge Base Guide.

source & further reading

vercel.com — original article Claude Opus 5 now available on AI Gateway Ling 3.0 Flash is now available on AI Gateway Vercel MCP can now deploy code

~/api · this article 200

$curl api.wpnews.pro/v1/news/protecting-against-infer…

Read original on vercel.com → vercel.com/blog/protecting-against-inference-the…

mentioned entities

Vercel

BotID

AI SDK

metadata

slugprotecting-against-inference-theft

topic#artificial-intelligence

secondary4 topics

sentimentnegative

canonicalvercel.com

navigation

← prevChatGPT glitch is leaking OpenAI…

next →New infosec products of the mont…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 27 Jul · #artificial-intelligence

🧠 Architect a Personalized Multi-Agent System with Long-Term Memory for Real Estate Tokenization

techpowerup.com · 27 Jul · #artificial-intelligence

(PR) Siemens Advances Self-verifying Agentic AI Workflows for Semiconductor and PCB Design

techpowerup.com · 27 Jul · #artificial-intelligence

(PR) Synopsys Showcases Comprehensive Autonomous Engineering Workflows from Silicon to Systems, Developed with NVIDIA Technology

dev.to · 26 Jul · #artificial-intelligence

Claude Opus 5 closed last year's SDK gaps — not this year's

── more on @vercel 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 28 May · #ai-tools

Grok Build introduces /remember command for persistent context across coding sessions

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required