{"slug": "same-model-different-provider-different-structured-output", "title": "Same model, different provider, different structured output", "summary": "A structured extraction pipeline using OpenRouter's `google/gemini-3-flash-preview` model failed to capture budget and date flexibility fields from user conversation history, but only when the model was routed to a specific upstream provider. The same model and prompt succeeded when routed to a different provider, revealing that provider identity — not prompt design, schema structure, or conversation history — caused the extraction failures. This means teams relying on latency-based routing for structured-output workloads must log and treat provider identity as part of the request fingerprint to avoid silent correctness failures.", "body_md": "# Model provider variance in structured extraction\n\n## TL;DR\n\nWe spent several hours debugging what looked like a prompt-engineering issue in a structured extraction pipeline. This turned out not to be a prompt issue.\n\nThe\n**same OpenRouter model**\n(`google/gemini-3-flash-preview`\n\n)\n**was being routed to different upstream\nproviders**, and\n**one provider consistently failed** to\nextract certain fields when assistant history contained\nlist-shaped content.\n\nPractical lesson:\n\n- log the routed provider\n- treat provider identity as part of the request fingerprint\n- be careful with latency-based routing on structured-output workloads\n\nFor structured-output workloads, provider choice can be part of correctness, not just latency or cost.\n\nWe run an LLM workflow that extracts structured constraints from conversation history and feeds them into a planning state machine.\n\nOne of the user flows kept failing. The conversation looked roughly like this:\n\nUser:\"I want to go to Brazil in two weeks.\"\n\nAssistant:suggests destination cities.\n\nUser:\"Let's go for Rio! We're two adults. No specific dates. Budget less than 5000 euros.\"\n\nAt that point the planner should advance toward itinerary generation, given that it had captured all necessary constraints: destinations, travellers, budget and date flexibility.\n\nInstead, the extraction layer returned:\n\n```\n{\n  \"destinations\": [{ \"type\": \"fixed\", \"options\": [\"Rio\"] }],\n  \"travellers\": { \"adults\": 2 }\n}\n```\n\nThe budget and date flexibility were missing.\n\nThe state machine interpreted this as incomplete planning state and routed the conversation back into information gathering.\n\n**Theory 1: prompt specificity.** Our first\nassumption was that the prompt probably was not explicit\nenough. The extraction prompt mentioned budget mostly in\nnegative examples (\"don't infer budget from destination,\netc.\"), so we added explicit extraction guidance:\n\nSet budget whenever the user explicitly states a price expectation, whether numeric or qualitative.\n\nNo meaningful improvement.\n\n**Theory 2: schema shape.** Maybe the\noptional-field schema was making it too easy for the\nmodel to emit a minimally valid object and stop early.\nWe tried two schema changes:\n\n- Add a mandatory field-scan checklist, in case the model was skipping optional fields too quickly.\n-\nAdd a wrapper schema forcing every field into either\n`extracted`\n\nor`not_mentioned`\n\n, in case the model was emitting a minimally valid object and stopping early.\n\nNone fixed it. The wrapper actually made things worse.\n\nIn some runs the model confidently classified explicitly stated fields as \"not mentioned\":\n\n```\n{\n  \"extracted\": {\n    \"destinations\": [{ \"type\": \"fixed\", \"options\": [\"Rio\"] }],\n    \"travellers\": { \"adults\": 2 }\n  },\n  \"not_mentioned\": [\n    \"budget\",\n    \"flexible_dates\"\n  ]\n}\n```\n\n**Theory 3: conversation history.** Maybe\nassistant history was contaminating extraction.\nRemoving assistant history made the extraction succeed\nreliably. Reintroducing assistant history caused\nfailures to return.\n\nInitially we thought the numerical values in the assistant response were confusing the model, such as:\n\n- flight prices\n- EUR amounts\n- durations\n\nBut the pattern was more specific than that.\n\n| Assistant history | Result |\n|---|---|\n| Full destination suggestions with prices | Fail |\n| Trailing conversational question only | Pass |\n| Numbered destination list without prices | Fail |\n| No assistant history | Pass |\n\nThe failure now looked more correlated with the\n*shape* of the assistant message than with its\nsize or numerical content: a long conversational\nparagraph passed, while a short numbered list failed. We\nwere close to implementing an architectural workaround\nand stripping suggestion lists before extraction, but\nfirst we reran the failing case several times to confirm\nthe pattern. It did not hold. One run passed\nunexpectedly, then failed again, then passed. Something\noutside the prompt, schema and visible message history\nhad to be changing.\n\nAfter the prompt and schema theories failed, we started looking for anything that could differ between otherwise identical extraction calls. The application was using OpenRouter with this model:\n\n```\nmodel: \"google/gemini-3-flash-preview\"\n```\n\nand latency-based routing:\n\n```\nsort: \"latency\"\n```\n\nThat meant two calls with the same model name, prompt, schema and temperature could still be routed to different upstream providers depending on real-time latency. We were not logging the routed provider (in retrospect, we should have been). We added a small helper:\n\n``` js\nexport function extractProvider(result: unknown): string | null {\n  const r = result as {\n    providerMetadata?: { openrouter?: { providerName?: string } };\n    response?: { body?: { provider?: string } };\n  };\n\n  return (\n    r?.providerMetadata?.openrouter?.providerName ??\n    r?.response?.body?.provider ??\n    null\n  );\n}\n```\n\nOnce provider information was visible in logs, the behavior became reproducible almost immediately.\n\nWe reran the evaluation:\n\n- 5 runs per provider\n- multiple history shapes\n- temperature 0\n- identical prompt and schema\n\n| Provider | Pass rate |\n|---|---|\n| Google AI Studio | 20 / 20 |\n| Google (Vertex AI) | 10 / 20 |\n\nThe Vertex breakdown was more interesting:\n\n| Assistant history | Vertex pass rate |\n|---|---|\n| Full suggestion list | 0 / 5 |\n| Trailing question only | 5 / 5 |\n| Numbered destination list | 0 / 5 |\n| No assistant history | 5 / 5 |\n\nSo the earlier \"history contamination\" hypothesis was not entirely wrong; it just only existed on one provider. The same prompt, schema, messages and model identifier produced materially different structured outputs depending on routing. A sanitized failing response looked like this:\n\n```\n{\n  \"provider\": \"Google\",\n  \"output\": {\n    \"destinations\": [{ \"type\": \"fixed\", \"options\": [\"Rio\"] }],\n    \"travellers\": { \"adults\": 2 }\n  }\n}\n```\n\nThe corresponding successful run:\n\n```\n{\n  \"provider\": \"Google AI Studio\",\n  \"output\": {\n    \"destinations\": [{ \"type\": \"fixed\", \"options\": [\"Rio\"] }],\n    \"travel_period\": \"no specific dates\",\n    \"travellers\": { \"adults\": 2 },\n    \"budget\": \"less than 5000 euros\",\n    \"flexible_dates\": true\n  }\n}\n```\n\nThe production fix was provider selection, not prompt tuning:\n\n```\nprovider: {\n  require_parameters: true,\n  order: [\"Google AI Studio\"],\n  allow_fallbacks: true,\n}\n```\n\n**We removed latency-based routing for this\nextraction workload and explicitly preferred the\nprovider that behaved reliably.**\n\nNo prompt changes were needed in the final version.\n\nA few things stood out afterwards.\n\n-\n**\"Same model\" is not necessarily the same behavior.** Once routing layers are involved, a model name becomes more like an abstract interface than a single inference implementation. Differences in templating, structured-output handling, or decoding behavior can matter a lot more than expected. -\n**Provider observability mattered more than prompt iteration.** Most of the debugging time was spent modifying prompts and schemas because we assumed the inference path was stable, when it wasn't. -\n**Preview-tier models seem especially susceptible to this kind of variance.** This was observed on a preview Gemini release, and OpenRouter has already written publicly about[provider variance and Exacto](https://openrouter.ai/announcements/provider-variance-introducing-exacto/), as well as[Auto Exacto](https://openrouter.ai/announcements/auto-exacto), partly for this reason.\n\nThe more important observation is simply that provider-level behavioral variance exists at all, and that it can remain invisible unless you log for it explicitly.", "url": "https://wpnews.pro/news/same-model-different-provider-different-structured-output", "canonical_source": "https://guilhermesfc.com/provider-variance-structured-extraction.html", "published_at": "2026-05-25 18:58:16+00:00", "updated_at": "2026-05-25 19:07:44.982081+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-products", "ai-infrastructure", "mlops"], "entities": ["OpenRouter", "Google", "Gemini"], "alternates": {"html": "https://wpnews.pro/news/same-model-different-provider-different-structured-output", "markdown": "https://wpnews.pro/news/same-model-different-provider-different-structured-output.md", "text": "https://wpnews.pro/news/same-model-different-provider-different-structured-output.txt", "jsonld": "https://wpnews.pro/news/same-model-different-provider-different-structured-output.jsonld"}}