# Can You Tell When an LLM API Swaps in a Cheaper Model?

> Source: <https://dev.to/newtorob/can-you-tell-when-an-llm-api-swaps-in-a-cheaper-model-3n9b>
> Published: 2026-06-16 15:33:10+00:00

If you call an open-weight model behind an API, whether that is your own box, a hosted endpoint, or a router, you are trusting that the thing answering is the model on the label. Providers have every incentive to serve a smaller or more aggressively quantized model under load. I wanted to know if you can catch that from the outside.

Short version: the obvious method fails, a less obvious one works, and it only works if you accumulate evidence.

The intuitive idea is to send a prompt, look at the answer, and flag low-quality responses. I scored served outputs by perplexity under the model that was supposed to produce them. The result was backwards. A cheaper model (I used a 0.5B as the impostor) produces simpler, more generic, more predictable text, and predictable text has low perplexity under any model. The impostor's output scored better than the genuine model's own output, by about 0.65 bits per byte, on 9 of 10 prompts. So "flag the improbable answers" rewards the cheaper model. Scratch that.

Stop grading free-form answers. Fix a token sequence and ask the model to score it: the log-probability it assigns to that sequence, teacher-forced, one forward pass, no sampling. A model assigns higher probability to text it would itself produce, so for the same fixed sequence the genuine model is measurably more confident than a different one.

Here are the numbers, scored on my own machines with the Qwen2.5 family:

| comparison | mean gap (nats/token) | genuine wins |
|---|---|---|
| honest floor (same model, q4 vs q8) | about 0.00 (std 0.07) | n/a |
| 1.5B impostor (2x cheaper) | +0.27 | 8 of 10 |
| 0.5B impostor (6x cheaper) | +0.66 | 10 of 10 |

That honest floor row is the important one. The same model at two quantizations drifts about 0.07 nats per token, centered on zero. The 2x-impostor signal of 0.27 is only about three times that, and on short, low-entropy outputs the two distributions overlap. A single scoring challenge cannot separate a 2x-cheaper impostor from an honest server running a different quant.

The means are clearly distinct though, so it works as an accumulating signal. With honest standard deviation around 0.07 and an impostor mean around 0.27, a running average over roughly 10 to 15 challenges separates them with confidence. So this is a slow background audit, not a one-shot test. Difficulty scales with how close the impostor is: a 6x downgrade falls out in a few checks, a 2x needs about a dozen, and a very close swap or a light quant downgrade may be impractical.

I first got nonsense numbers, about -10 nats for tokens like "of" and "is", which is worse than uniform-random over the vocabulary. The cause was that in llama-cpp-python 0.3.23 the high-level create_completion logprobs are wrong. The fix is to read per-position logits straight from the context and compute the log-softmax yourself. Sanity-check any logprob pipeline against a known sentence first. English should land around 0.5 to 1.5 bits per byte under a decent model. If you see 5, your scorer is broken, not the model.

This needs real logprob access to the model under test: open weights you serve, or a provider that exposes proper logprobs and lets you score an arbitrary sequence. Fully closed APIs that only return text are a harder problem, and I do not have a clean answer there yet. For open-weight serving, which covers most self-hosting and a good chunk of the hosted market, the scoring challenge is a usable audit.

The takeaway: you can verify an open-weight model is what it claims, but only statistically, over many checks, and the intuitive method does the opposite of what you want. I think that pattern, where the obvious metric is backwards and the real signal needs accumulation, shows up all over verification.
