How to Build a Multi-Model LLM Fallback Layer Without Rewriting Your App

wpnews.pro

Most LLM integrations start as a single provider call.

That is usually the right move. You pick one strong model, wire up a chat completions request, ship the feature, and learn from real users.

The problem starts later.

Your support assistant needs better latency. Your document workflow needs a larger context window. Your extraction job is too expensive on the flagship model. A provider returns rate-limit errors during a launch. A new model is cheaper for background tasks but not good enough for customer-facing reasoning.

At that point, model choice is no longer a one-time SDK decision. It becomes application infrastructure.

This post walks through a practical way to build a small multi-model fallback layer so your product can use more than one provider without spreading provider-specific logic through the codebase.

A first integration often looks like this:

const response = await client.chat.completions.create({
  model: "gpt-4.1",
  messages,
});

That is fine for a prototype. In production, the feature usually grows around the provider call:

If each product feature owns those details, every model change becomes a product change. You do not only switch a model name. You update error handling, logging, pricing assumptions, quality tests, and maybe even prompt shape.

The goal is not to hide every model difference. Some differences matter. The goal is to keep provider decisions in one place.

Instead of letting every feature pick a provider directly, define the type of work the request represents.

For example:

type LlmTask =
  | "support_chat"
  | "document_summary"
  | "data_extraction"
  | "title_generation"
  | "long_context_analysis";

Then map tasks to model policies:

type ModelRoute = {
  primary: string;
  fallback?: string[];
  maxLatencyMs?: number;
  maxInputTokens?: number;
  allowFallback: boolean;
};

const routes: Record<LlmTask, ModelRoute> = {
  support_chat: {
    primary: "anthropic/claude-sonnet",
    fallback: ["openai/gpt-4.1", "google/gemini-pro"],
    maxLatencyMs: 5000,
    allowFallback: true,
  },
  data_extraction: {
    primary: "openai/gpt-4.1-mini",
    fallback: ["qwen/qwen-plus"],
    maxLatencyMs: 3000,
    allowFallback: true,
  },
  long_context_analysis: {
    primary: "google/gemini-pro",
    fallback: [],
    maxInputTokens: 1_000_000,
    allowFallback: false,
  },
  document_summary: {
    primary: "openai/gpt-4.1-mini",
    fallback: ["deepseek/deepseek-chat"],
    allowFallback: true,
  },
  title_generation: {
    primary: "qwen/qwen-plus",
    fallback: ["openai/gpt-4.1-mini"],
    allowFallback: true,
  },
};

This gives your application a stable interface:

const result = await llm.generate({
  task: "data_extraction",
  messages,
  customerId,
});

The feature does not need to know whether the request went to OpenAI, Anthropic, Gemini, Qwen, or another provider. It only needs the result and the metadata required for debugging.

Fallback sounds simple: if the primary model fails, try another one.

In practice, fallback rules need to be conservative because not all failures are the same.

You can usually retry or fall back on:

You should be careful with fallback on:

Here is a simplified fallback runner:

type GenerateRequest = {
  task: LlmTask;
  messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
  customerId: string;
};

async function generateWithFallback(request: GenerateRequest) {
  const route = routes[request.task];
  const candidates = [route.primary, ...(route.fallback ?? [])];

  let lastError: unknown;

  for (const model of candidates) {
    try {
      const startedAt = Date.now();

      const response = await callModelProvider({
        model,
        messages: request.messages,
      });

      await logUsage({
        customerId: request.customerId,
        task: request.task,
        model,
        latencyMs: Date.now() - startedAt,
        inputTokens: response.usage.inputTokens,
        outputTokens: response.usage.outputTokens,
        fallback: model !== route.primary,
      });

      return response;
    } catch (error) {
      lastError = error;

      if (!route.allowFallback || !isFallbackSafe(error)) {
        throw error;
      }
    }
  }

  throw lastError;
}

The important part is the policy, not the exact code. You want the fallback decision to be explicit, observable, and different for each workload.

LLM cost visibility is easy to postpone when usage is small. That is a trap.

By the time token cost is visible on your cloud bill, it is usually harder to know which feature, model, customer, or prompt caused the increase.

At minimum, log:

This lets you answer practical questions:

You do not need a complicated system to start. A database table or analytics event is enough:

await db.llmUsage.create({
  data: {
    customerId,
    task,
    model,
    inputTokens,
    outputTokens,
    latencyMs,
    fallback,
    createdAt: new Date(),
  },
});

An OpenAI-compatible API can reduce integration work, but compatibility is not the same as interchangeability.

Models can differ in:

The abstraction should keep common product code clean while still exposing model-specific facts where they matter.

A good rule: hide provider plumbing, not product-relevant behavior.

You can build this layer yourself if you have specific routing, compliance, or observability requirements.

You can also use an OpenAI-compatible AI gateway if you want the model catalog, routing, pricing, and fallback surface managed outside your app. For example, datallmlab is one implementation option for teams that want access to GPT, Claude, Gemini, Qwen, DeepSeek, and other models through a single API.

The architectural point is the same either way: keep model selection outside feature code.

Before adding a second provider, decide:

The best model for your product today may not be the best model next quarter.

That does not mean you should rewrite your app every time the model landscape changes. It means the app should treat model choice as a routing decision, not a hard-coded dependency.

Start small: one routing function, one usage log, one conservative fallback policy.

That is enough to keep your AI features flexible without turning your codebase into provider glue.

source & further reading

dev.to — original article From Software Engineer to AI Engineer - Part 1: A whole new world AI coding agents in a German company: the layer everyone forgets OpenAI’s National Science Initiative Brings Frontier AI Into Research Workflows

How to Build a Multi-Model LLM Fallback Layer Without Rewriting Your App

Run your AI side-project on zahid.host