# The Hidden Cost of Production AI: How to Build Fallback Chains That Don't Fail Silently > Source: > Published: 2026-06-20 09:02:38+00:00 The worst class of production bugs don't crash anything. They just silently degrade. One common pattern: an LLM provider has a partial outage that returns 200 OK with empty or nonsensical responses. No error, no alert, no 5xx. Just silence dressed as success. That's the hidden cost of production AI. Not the API bills, not the latency. The failures that look like normal operation until a user tells you something's wrong. I run a production LLM pipeline that scores 10,000+ job listings daily. I work with OpenAI, Anthropic, Gemini, DeepSeek, and Groq at various points in the stack. Here's what I've learned about building fallback chains that actually work. Most teams start with one LLM provider. It works fine in development. Then production traffic hits and you discover the failure modes that don't show up in your test suite. Rate limits hit at the worst possible moment. A provider's API can return degraded responses under load. A model version gets deprecated without enough notice. And the worst one: partial outages where the API responds but the content is garbage. The pattern that separates hobby projects from production systems is a fallback chain that's tested, cost-aware, and observable. The goal isn't to eliminate failures. It's to make sure every failure degrades gracefully instead of silently. After iterating on this across multiple projects, I've settled on a three-layer architecture that handles most failure modes without adding much complexity. ``` Layer 1: Primary model (best quality, highest cost) Layer 2: Fallback model (good quality, lower cost) Layer 3: Degraded mode (minimal quality, near-zero cost) ``` The key insight: each layer should be a different provider with a different failure profile. If one provider is slow or down, another one probably isn't affected. If both are slow, a cheaper or faster model can keep the lights on. Here's how I structure this in practice: ``` interface LLMFallbackConfig { primary: ModelConfig; fallback: ModelConfig; degraded: ModelConfig; timeout: number; maxRetries: number; } async function executeWithFallback( prompt: string, config: LLMFallbackConfig ): Promise { const providers = [ { name: 'primary', config: config.primary }, { name: 'fallback', config: config.fallback }, { name: 'degraded', config: config.degraded }, ]; for (const provider of providers) { try { const result = await executeWithTimeout( callProvider(provider.config), config.timeout ); if (isValidResponse(result)) { return result; } // Log the silent failure for observability logWarning('Empty response from provider', provider.name); } catch (error) { logError('Provider failed', provider.name, error); } } throw new Error('All providers exhausted'); } ``` The `isValidResponse` check is critical. You need to validate that the output is actually useful, not just that the HTTP response was 200. For structured outputs, this means schema validation. For text, it means length checks and content quality heuristics. Not every request needs GPT-4. The trick is knowing which ones do and routing accordingly. In my job scoring pipeline, I use three tiers: Tier 1: Complex extraction tasks that need function calling with strict schemas. These go to GPT-4o or Claude 3.5 Sonnet. Higher cost, higher reliability. Tier 2: Classification and scoring tasks where the schema is simple but the reasoning matters. These go to GPT-4o mini or Gemini 2.0 Flash. Good quality at a fraction of the cost. Tier 3: Pre-processing and fallback tasks where speed matters more than quality. These go to Groq or DeepSeek V4 Flash. Near-instant responses, minimal cost. The routing logic is straightforward: ``` function selectModel(task: Task, context: RequestContext): ModelConfig { if (task.complexity === 'high' || task.requiresStrictSchema) { return getPrimaryModel(); } if (context.timeBudget < 500) { // Speed-critical path return getFastModel(); } if (context.costBudget === 'minimal') { // Cost-sensitive path return getCheapModel(); } // Default to balanced model return getDefaultModel(); } ``` This approach cuts API costs by using expensive models only when they're actually needed, while keeping quality acceptable for most tasks. Most people think about LLM fallbacks. Few think about embedding fallbacks. But if your RAG pipeline's embedding provider goes down, your entire retrieval layer stops working. Suppose an embedding API has an outage. Your vector search returns zero results. Users see empty responses. No error, no context, just nothing. Now I maintain two embedding providers in parallel for every RAG pipeline I build: ``` interface EmbeddingProvider { name: string; embed(text: string): Promise; healthCheck(): Promise; } class RedundantEmbedder { private providers: EmbeddingProvider[]; private activeProvider: number = 0; async embed(text: string): Promise { for (let i = 0; i < this.providers.length; i++) { const index = (this.activeProvider + i) % this.providers.length; try { const result = await this.providers[index].embed(text); this.activeProvider = index; return result; } catch (error) { logError('Embedding provider failed', this.providers[index].name, error); } } throw new Error('All embedding providers failed'); } } ``` The vector store needs to support multiple embedding dimensions. I use pgvector with separate columns for each embedding provider. Queries check whichever column has data. The most dangerous failures in production AI are the ones that don't look like failures. Empty responses, degraded quality, hallucinated data that passes schema validation. I track three metrics for every LLM call: Response time. If it's suspiciously fast for a complex prompt, something probably went wrong. The model likely returned a cached or truncated response. Output length. Empty or very short responses are a red flag. I log warnings when response length falls below a configurable threshold for the task type. Schema compliance. For structured outputs, I validate the response against the expected schema. If it passes but the content is garbage (all nulls, default values, repetitive text), that's a silent failure. ``` js function monitorLLMCall(call: LLMCallResult, context: TaskContext) { const metrics = { duration: call.endTime - call.startTime, outputLength: call.response.length, schemaCompliance: validateSchema(call.response, context.schema), qualityScore: estimateOutputQuality(call.response, context.taskType), }; if (metrics.duration < 100) { alertEngine('Suspiciously fast response', call); } if (metrics.outputLength < context.minExpectedLength) { alertEngine('Short response detected', call); } if (metrics.schemaCompliance && metrics.qualityScore < 0.5) { alertEngine('Schema-compliant but low quality', call); } logMetrics(metrics, context); } ``` This catches silent failures before they cascade. The alert fires quickly after the first failure. You can have a fix deployed soon after. A well-designed fallback chain means each request passes through multiple layers. If the primary model fails, the fallback takes over quickly. If both fail, the degraded mode still returns a usable response instead of an error. The cost tradeoff is real. You pay for unused capacity. But the alternative is a silent outage that erodes user trust over hours or days. If your team is deploying AI features in production and worrying about reliability, that's the kind of thing I help with. Happy to compare notes on what's worked and what hasn't in your specific setup. *Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.*