OpenAI-Compatible APIs Are Great Until Streaming Breaks: What I Check Before Switching Providers

wpnews.pro

Swapping an AI provider looks easy on paper.

Change the baseURL

, keep the OpenAI SDK, point your app at a different model, and you're done.

And honestly, for basic non-streaming chat completions, that often works.

But the first place I usually see things break is streaming.

Not because OpenAI-compatible APIs are bad. They're incredibly useful. But "compatible" can mean different things once you move beyond a simple request/response call:

I work on TokenBay, so I spend a lot of time testing OpenAI-compatible model routing across providers. This is the checklist I use before moving a production app from one provider to another.

Most people test provider compatibility with something like this:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.API_KEY,
  baseURL: process.env.BASE_URL,
});

const response = await client.chat.completions.create({
  model: process.env.MODEL,
  messages: [
    { role: "user", content: "Say hello in one sentence." }
  ],
});

console.log(response.choices[0].message.content);

If that works, great.

But it doesn't tell you whether streaming works in your actual app.

For a lot of AI products, streaming is not a nice-to-have. It's the difference between "this feels responsive" and "did the app freeze?"

So I test streaming separately.

Here's the smallest script I usually start with.

Create a file called test-streaming.mjs

:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.API_KEY,
  baseURL: process.env.BASE_URL,
  timeout: 30_000,
});

const model = process.env.MODEL;

if (!model) {
  throw new Error("Missing MODEL env var");
}

const startedAt = Date.now();
let firstTokenAt = null;
let chunkCount = 0;
let contentChunks = 0;
let emptyChunks = 0;
let finalText = "";

const stream = await client.chat.completions.create({
  model,
  stream: true,
  temperature: 0,
  messages: [
    {
      role: "user",
      content:
        "Write a short explanation of why streaming matters in AI apps. Keep it under 80 words.",
    },
  ],
});

for await (const chunk of stream) {
  chunkCount += 1;

  const delta = chunk.choices?.[0]?.delta;
  const content = delta?.content ?? "";

  if (content) {
    if (firstTokenAt === null) {
      firstTokenAt = Date.now();
    }

    contentChunks += 1;
    finalText += content;
    process.stdout.write(content);
  } else {
    emptyChunks += 1;
  }
}

const finishedAt = Date.now();

console.log("\n\n--- streaming diagnostics ---");
console.log({
  model,
  chunkCount,
  contentChunks,
  emptyChunks,
  firstTokenMs: firstTokenAt ? firstTokenAt - startedAt : null,
  totalMs: finishedAt - startedAt,
  chars: finalText.length,
});

Install the SDK:

npm install openai

Then run it against any OpenAI-compatible endpoint:

API_KEY="your_api_key" \
BASE_URL="https://your-provider.example/v1" \
MODEL="your-model-name" \
node test-streaming.mjs

If you're using OpenAI directly, the base URL is usually not needed:

API_KEY="your_openai_key" \
MODEL="gpt-4.1-mini" \
node test-streaming.mjs

If you're testing a gateway such as TokenBay, the idea is the same: keep the OpenAI SDK, change the baseURL

, and test the model you actually plan to use.

I don't just check whether text prints.

That is the first pass, but not enough.

The total response time matters, but streaming UX depends heavily on first-token latency.

If the full response takes 5 seconds but the first token arrives in 600ms, the app feels alive.

If the first token arrives after 5 seconds, streaming is technically working but the UX is basically the same as non-streaming.

In the script above, I look at:

firstTokenMs

For production apps, I usually compare this across:

I don't need perfect lab numbers. I just want to know if the new route is obviously slower before I ship it.

This one is sneaky.

Sometimes the SDK receives a stream, but an intermediate layer buffers the response and releases it all at once.

That can happen because of:

A rough smell test:

chunkCount
contentChunks
firstTokenMs
totalMs

If firstTokenMs

and totalMs

are almost identical, I get suspicious.

It doesn't always mean buffering, but it's worth checking.

Some streaming APIs send chunks that don't contain text content.

That can happen for role metadata, finish signals, tool call deltas, or provider-specific fields.

So I don't treat this as a failure:

emptyChunks > 0

But I do check whether the final assembled text is correct.

The thing I care about is not "every chunk has content." The thing I care about is:

finalText.length > 0

and whether the text is complete.

A lot of streaming bugs are not provider bugs. They're parser bugs.

For example, a frontend might assume every chunk has this shape:

chunk.choices[0].delta.content

That works for simple text.

But once you add tool calls, JSON mode, or multimodal responses, the stream can include other delta fields.

A safer frontend parser should tolerate chunks where content

is missing.

Bad:

const token = chunk.choices[0].delta.content;
render(token.toUpperCase());

Better:

const token = chunk.choices?.[0]?.delta?.content;

if (token) {
  render(token);
}

This sounds tiny, but it saves you from a lot of random "Cannot read properties of undefined" errors during provider migration.

Non-streaming calls usually fail before you render anything.

Streaming can fail after you've already shown partial output.

That means your app needs to decide what to do with incomplete text.

I usually test three cases:

For the timeout case, ask for a longer answer and lower your client timeout.

Example:

const client = new OpenAI({
  apiKey: process.env.API_KEY,
  baseURL: process.env.BASE_URL,
  timeout: 2_000,
});

Then ask for something long:

{
  role: "user",
  content: "Write a detailed 1500-word explanation of streaming APIs."
}

The exact error shape may differ by provider or network path. Your app should not depend on one extremely specific error message.

In production, I care about:

This is easy to forget.

For non-streaming calls, usage usually comes back in the response object.

For streaming calls, usage may be missing, delayed, provider-specific, or only available from a dashboard/API after the request finishes.

If your product depends on per-request cost tracking, don't assume streaming usage works the same way.

Before switching providers, I check:

For internal tools, this may not matter much.

For SaaS apps where you meter customer usage, it matters a lot.

Plain text streaming is the easy case.

Tool calls are where compatibility claims need more testing.

If your app uses tools/function calling, test that separately.

Things I check:

A basic text streaming test passing does not mean your agent loop is safe.

I learned this the annoying way, which is usually how production lessons arrive.

Before switching an app to a new OpenAI-compatible provider, I run through this:

The main point: don't test only the happy path.

OpenAI-compatible APIs can make provider switching much easier, but streaming is where the abstraction gets tested for real.

If you want to test this kind of provider switch without rewriting your OpenAI SDK code, TokenBay is one option.

The setup is intentionally boring:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.TOKENBAY_API_KEY,
  baseURL: "https://api.tokenbay.com/v1",
});

Then you can run the same streaming test against different models by changing the MODEL

env var.

TOKENBAY_API_KEY="your_tokenbay_key" \
MODEL="your-model-name" \
node test-streaming.mjs

That's the main reason I like OpenAI-compatible routing: the migration surface area stays small. You can test GPT, Claude, Gemini, Qwen, GLM, or other supported models without changing the rest of your application code.

TokenBay also gives you one place to manage API usage across providers, which is helpful when you're comparing models or setting up fallbacks.

A good first test is:

If those checks pass, then you can start thinking about routing rules, fallback models, and cost optimization.

The test script in this post is not fancy.

That's the point.

Before I move a real app to a different provider, I want a repeatable check that answers:

If you want to try the same checklist with TokenBay, you can start here:

Run the script with your own prompts, models, and frontend stack. The useful result is not just "the API call worked." The useful result is knowing whether your app still feels good when streaming, retries, fallbacks, and real users get involved.

source & further reading

dev.to — original article My app didn't go "viral". My AWS bill did. Give your AI agent company + on-chain wallet enrichment in 30 seconds (x402, pay-per-call USDC) How to Share Claude HTML Artifacts as a Live Preview URL

OpenAI-Compatible APIs Are Great Until Streaming Breaks: What I Check Before Switching Providers

Run your AI side-project on zahid.host