Your API has thousands of LLM consumers. None of them can read your changelog.

A developer has identified a critical blind spot in API testing: the rise of LLM consumers that cannot read changelogs or respond to deprecation notices. Unlike traditional API consumers—known, versioned codebases that can be enumerated and notified—LLM consumers are "frozen" populations whose understanding of an API schema is anchored to a training cutoff or runtime prompt. This mismatch creates a new class of bugs where schema changes like renaming `user_id` to `userId` are catastrophic for LLM consumers but invisible to standard consumer-driven contract testing.

Sometime in the last eighteen months, the consumer base of every public API changed, and almost nobody updated their testing strategy. If you publish an API today — even an internal one with an OpenAPI spec on a Confluence page — you have consumers you didn't onboard, didn't issue a key to, didn't notify on deprecation, and can't reach with a changelog. Those consumers are language models, and the agents people are building on top of them. Some of them learned your API last week. Some learned it eighteen months ago and got their knowledge frozen there. None of them got the email when you renamed user id to userId . This is a real category of consumer, in production, in volume, today. And the testing discipline that was supposed to catch breaking changes — consumer-driven contract testing, the Pact-shaped world — wasn't designed for it. The mismatch is wider than most teams realize, and it's quietly producing a class of bugs that nobody is currently tracking as a bug. A traditional API consumer is a known, named, versioned codebase. Your mobile app, version 4.3.2. Your partner's billing service. The Python SDK you publish. You can enumerate them. You can test against them. When you change something, you can call the team and tell them. If you adopted Pact in the last decade you have a broker that tells you, automatically, which consumers your provider change would break — before you ship it. An LLM consumer is none of those things. It is a population , not a codebase. It does not version itself in any way you can address. It has a training cutoff, after which your API documentation is frozen in its weights. It has a tool-use prompt at runtime, after which any additional knowledge of your API is whatever some agent author happened to paste in. It does not pull your changelog. It will not retry against the new schema. It will, with terrifying confidence, call the old endpoint with the old fields, parse the old shape, and degrade silently. Call this the frozen consumer : a real, behaviorally-relevant user of your API whose understanding of your schema is anchored to a moment in time you do not control, that you cannot query, and that you cannot update with a deprecation header. If your team ships an OpenAPI spec, it is in the training corpus. If your endpoints respond to documented requests, agents are calling them. KushoAI's State of Agentic API Testing 2026 report found that 41% of public APIs experience schema drift within 30 days of any given snapshot, and 63% within 90 days — meaning the vast majority of LLM consumers in production are calling an API that has already changed since their training data was assembled. The drift is the default state. The "synchronized consumer" was always a flattering fiction, but with human consumers you could at least notify them. You can't notify a population. The old taxonomy of breaking changes — removed endpoints, removed required fields, changed types, narrowed enums — is still right. It's just incomplete. There are now categories of change that are not breaking for human-written consumers but catastrophic for frozen consumers . Three matter: You rename user id to userId . Your changelog notes it. Your SDKs auto-regenerate. Your typed clients refuse to compile until updated. Every human consumer is either fine, or loud about not being fine. The LLM consumer is silent. It will keep generating requests with user id for as long as that token cluster wins next-token prediction, which is to say: indefinitely for many models, until the next retraining cycle. The model isn't broken. The model is doing exactly what models do — predicting the most likely token sequence based on the training distribution it learned. The same applies to: snake case ↔ camelCase migrations /users/{id}/order vs /users/{id}/orders X-Request-Id → Request-Id , dropping the X- /v1/items → /api/items with content negotiation userID ↔ userId For human consumers and traditional contract tests these are "find-replace" changes that the typecheck catches in seconds. For frozen consumers, they're invisible cliffs. And field additions are not safe either: KushoAI's data shows additions account for 86% of all observed drift events, and LLMs reliably hallucinate field names from related-domain APIs to fill perceived gaps — so an "additive" change can still produce calls with phantom fields the model believes you accept. The shape of the response is unchanged. The meaning is not. You added two new values to an existing enum: status: "active" | "inactive" becomes status: "active" | "inactive" | "trialing" | "paused" . Strictly additive. Your typed consumers expand their switch statements or your TypeScript users find out at runtime — same outcome, different timing . For an LLM consumer, the new values are now out of distribution . It learned a binary. It will branch on a binary. "trialing" will route through the "inactive" branch with some probability and the "active" branch with the rest, depending on the agent's prompt, the model temperature, and the surrounding context. It will not raise. It will not log. It will just route some non-trivial fraction of customers to the wrong code path. The most dangerous shape of this isn't even enums. It's response codes whose semantics drift: a 422 that used to mean "validation failed, retry never" but now also gets emitted by a new ML-backed validator that means "validation failed, retry maybe." Same code. Different meaning. Frozen consumers will keep treating it as terminal. Human consumers update their retry policy when you announce it. This is the inverse problem. Your API gets smaller , and an LLM keeps calling the removed parts. Frozen consumers will confidently call endpoints you sunset two years ago. They will populate request bodies with deprecated fields that your server now rejects in the lucky case or silently ignores in the worse one . They will believe in pagination tokens you stopped issuing. There's a documented failure mode for this — researchers call it functional hallucination : the agent calls an endpoint that doesn't exist, or sends a string "fifty-percent" to a field requiring the integer 50 . Roughly 20% of package references in LLM-generated code are hallucinated, and 43% of those hallucinations repeat across generations . These are not random one-off fabrications. They are stable confabulations the model has learned to repeat. Your removed endpoint has a half-life measured in model generations, not deploy cycles. Pact, the canonical consumer-driven contract testing tool, works like this: the consumer declares what it expects from the provider. The expectation is published to a broker. The provider, in CI, verifies that its current code satisfies every published expectation. If a provider change would break any registered consumer, the verification fails and the merge doesn't happen. This is a beautiful system, and I'm a fan — I wrote a piece last week https://dev.to/deepaksatyam/pact-is-great-its-also-why-most-teams-dont-do-contract-testing-5c14 about how Pact is genuinely great and dramatically underused. But re-read the contract: the consumer declares what it expects. The consumer is a participant. The contract is mutual. A frozen consumer is not a participant. It doesn't sign anything. It doesn't publish to a broker. It doesn't even know which version of your API it learned. It has, instead, a phantom contract : a statistical model of your schema, derived from text someone scraped at some point, with no version, no expiration, no notification channel. You cannot run a Pact test against a phantom contract because there is no contract object to fetch. There is no canonical "what LangChain agents trained on a 2023-cutoff model expected from your /billing endpoint" — there is a distribution of expectations, varying by base model, fine-tune, surrounding prompt context, and which retrieval-augmented documents the agent author happened to attach. Pact assumed a population of N consumers where N is countable and addressable. Frozen consumers are a population where N is large, varies daily, and each individual "consumer" is the joint product of a base model, a prompt, and your now-stale documentation. The math of contract testing assumes a finite, named set. The new consumer base is neither. This is not Pact's fault. Pact is correct within its assumptions. The assumptions about who your consumers are , on the other hand, no longer hold for any API with a public-facing spec. I don't have a clean answer here, and nobody else does either yet. But three practices are starting to converge, and they're all cheap enough that you can adopt them this quarter without waiting for the discipline to mature. Your OpenAPI document is no longer documentation. It is the canonical artifact that a population of frozen consumers will read once and remember forever. Descriptions matter more than they used to. Examples matter more than they used to. The names of fields matter vastly more than they used to. This has consequences for how you make changes. Renaming a field for human-readability is no longer a free improvement. The cost is not "update the SDK." The cost is "every agent backed by a model trained between $LAST CUTOFF and the next major training run will silently use the wrong name, indefinitely." That cost is real and you should price it into the change. The corollary: if you must rename, accept the old name in parallel for at least one model-generation cycle — call it 18 months — and emit it in responses too. The cost of dual-shape support is finite, bounded, and visible. The cost of a population of agents quietly calling you wrong is, depending on what your API does, anywhere from "support tickets" to "incident." The cheapest, highest-signal test you are not running: take your OpenAPI spec, hand it to Claude or GPT or Gemini with no other context , and ask the model to construct a valid call to each endpoint. Then run the call. See what happens. This sounds dumb. It catches things contract tests cannot, because it is the actual behavior your actual consumers will exhibit. If the model consistently misnames a field, misreads an enum, hallucinates a required parameter, or calls a deprecated endpoint, you have just discovered a production bug that no Pact test will ever surface. A minimal version, in pseudo-Python: python def ai compat check spec, endpoint, models= "claude", "gpt", "gemini" : findings = for model in models: request = model.generate call spec=spec, endpoint=endpoint, instruction="Construct a valid request. Return JSON." response = actually call request if response.status = 400: findings.append model, endpoint, request, response return findings You can run this in CI alongside your contract tests. Treat divergence as a finding , not a failure — sometimes the model is doing something dumb you don't need to fix. But sometimes it's pointing at a real ambiguity in your spec, or a real frozen-consumer trap. Right now your deprecation strategy is: blog post, changelog, email to known consumers. The first two don't reach frozen consumers at all. The third reaches them only if a human is in the loop to update the agent's instructions. The emerging fix is the Model Context Protocol MCP and similar machine-readable surfaces. An MCP server is a structured, queryable contract that an agent can pull at runtime . It doesn't rely on the model's training data. It tells the agent, right now, what the current shape is. Publishing an MCP surface alongside your REST API is the closest thing to a registered-consumer relationship you can get with the LLM population. It won't reach agents that don't use MCP, but the population that does is growing fast, and being there early is cheap. Consumer-driven contract testing was the testing discipline of the 2010s. It assumed a knowable, addressable, code-bearing consumer. That assumption still holds for most of your traffic. It does not hold for an emerging, growing, increasingly-important slice of it. AI-consumer compatibility testing is the same problem in a fundamentally different shape. The tooling doesn't exist yet. There are no Pact-equivalents for the frozen-consumer case, because nobody has figured out yet what the broker is supposed to broker. I expect the next several years of API testing tooling to be about figuring this out, the same way the 2010s were about figuring out contracts. Until then: assume you have consumers you can't see, write your OpenAPI as though it's a contract you can't take back, and test against the models that are actually calling you. I think and write about this kind of thing for a living — most of my time goes into SpecShield, where I work on API breaking-change detection. If you've shipped a "non-breaking" change that broke an AI consumer in production, I'd really like to hear it in the comments — I'm starting to collect cases.