{"slug": "semantic-caching-the-vlm-step-in-our-product-photo-pipeline", "title": "Semantic caching the VLM step in our product-photo pipeline", "summary": "Photoroom reduced its vision-language model inference costs by approximately 62% within three weeks by deploying Bifrost as a semantic caching layer in front of the VLM step of its product-photo diffusion pipeline. The company discovered that Claude and Gemini Vision calls together cost more than the GPU lease for the same workload, with the VLM and prompt-rewrite layer accounting for 58% of total inference spend. The caching implementation achieved a 71% cache hit rate for captioning calls and a 49% hit rate for rewrite calls, while also cutting p95 latency for the caption step from 1.8 seconds to 0.31 seconds.", "body_md": "**TL;DR: We put Bifrost in front of the VLM step that captions and rewrites prompts for our product-photo diffusion pipeline. Semantic caching cut that bill by ~62% in three weeks. The diffusion side, where the GPUs live, was never the cost we should have been worrying about.**\n\nOur pipeline at Photoroom (paraphrased, not exact internal numbers) does three things per product image. A vision-language model reads the input and produces structured captions. A second LLM call rewrites the user's prompt into something the diffusion model behaves well with. Then SDXL with our internal LoRAs does the actual generation on our own A100s.\n\nThe diffusion step is what we obsess over. To be precise, it is what we benchmark and profile every sprint. So when we looked at the Q1 numbers, the surprise was that Claude and Gemini Vision together cost more than the GPU lease for the same workload. The VLM and prompt-rewrite layer was 58% of total inference spend.\n\nThe nuance here is that we had been calling the providers directly from a Python service with no caching. Same product image, same user request. The response paid for again.\n\nI looked at LiteLLM and Portkey first. Both are good. LiteLLM is the path of least resistance if you want a Python library inside an existing FastAPI service, and its provider coverage is excellent. Portkey has a polished hosted UX and very clean dashboarding.\n\nWe landed on Bifrost for three reasons specific to our setup. It runs as a Go binary, which means the gateway isn't competing for the same GIL-bound CPU as our inference service. Semantic caching is built in rather than an add-on. The OpenAI-compatible endpoint meant we didn't need to change any of our SDK calls, [as documented here](https://docs.getbifrost.ai/features/drop-in-replacement).\n\nHonest comparison. LiteLLM has a larger Python ecosystem footprint and its routing config will feel more native if your stack is Python-first. Portkey's analytics UI is, frankly, prettier than what we get out of the box.\n\nBifrost runs as a sidecar next to the prompt-rewrite service. Both the captioning and rewrite calls now go through `http://bifrost:8080/v1/chat/completions`\n\n. Our config is small.\n\n```\nproviders:\n  anthropic:\n    keys:\n      - value: env.ANTHROPIC_KEY_PRIMARY\n      - value: env.ANTHROPIC_KEY_BACKUP\n  google_vertex:\n    keys:\n      - value: env.VERTEX_KEY\n\nsemantic_cache:\n  enabled: true\n  similarity_threshold: 0.94\n  ttl_seconds: 86400\n\nfallbacks:\n  - primary: anthropic/claude-3-5-sonnet\n    fallback:\n      - google_vertex/gemini-1.5-pro\n\ngovernance:\n  virtual_keys:\n    - id: vk_caption_team\n      budget_usd_monthly: 800\n    - id: vk_rewrite_team\n      budget_usd_monthly: 400\n```\n\nThree things matter here. The cache threshold of 0.94 was tuned against a held-out set of 5,000 captioning calls. At 0.97 we missed too many obvious duplicates. At 0.90 we started returning captions that were close but wrong about colour, which is unforgivable for an e-commerce use case. The fallback list isn't theatre. We measured Anthropic 5xx rates of 0.4% over March, which on our volume is real customer-visible failures.\n\n| Metric | Before | After |\n|---|---|---|\n| Cache hit rate (caption) | 0% | 71% |\n| Cache hit rate (rewrite) | 0% | 49% |\n| p95 latency, caption step | 1.8s | 0.31s |\n| Monthly VLM+LLM spend | baseline | -62% |\n| Provider failover events handled | 0 (we returned an error) | 14 |\n\nThe rewrite step caches less well because user prompts vary more. Captioning is the big win, because product photos from the same merchant cluster heavily in embedding space. Roughly 70% of merchants in our top tier upload 80% of their catalogue images within a 90-day window.\n\nThe migration was unromantic. Two lines.\n\n```\nclient = openai.OpenAI(\n    base_url=\"http://bifrost:8080/v1\",\n    api_key=os.environ[\"BIFROST_VIRTUAL_KEY\"],\n)\n```\n\nEverything else stayed. The VLM team didn't touch their code. The rewrite team flipped a config flag.\n\nSemantic caching has a real failure mode. If your downstream model output is meant to vary across calls (creative generation, sampling-heavy use cases) you don't want this on. We disable it for the diffusion-prompt-suggestion endpoint that gives editors three variants. The cache would happily return the same triplet twice.\n\nThe Go binary is one more service to operate. For a small team this is non-trivial. LiteLLM-as-a-library has fewer moving parts if you don't need the cache.\n\nCost attribution through virtual keys is per-key, not per-end-customer-of-our-customers. If you need full multi-tenant chargeback down to the merchant level, you will write some glue.\n\nThe semantic cache uses an embedding model itself. Read [the docs on what backs it](https://docs.getbifrost.ai/features/semantic-caching) before you assume your prompts stay inside your VPC.", "url": "https://wpnews.pro/news/semantic-caching-the-vlm-step-in-our-product-photo-pipeline", "canonical_source": "https://dev.to/elise_moreau/semantic-caching-the-vlm-step-in-our-product-photo-pipeline-5ahj", "published_at": "2026-05-27 14:52:40+00:00", "updated_at": "2026-05-27 15:12:25.519533+00:00", "lang": "en", "topics": ["generative-ai", "large-language-models", "computer-vision", "ai-infrastructure", "ai-products"], "entities": ["Photoroom", "Claude", "Gemini Vision", "LiteLLM", "Portkey", "Bifrost", "SDXL", "A100"], "alternates": {"html": "https://wpnews.pro/news/semantic-caching-the-vlm-step-in-our-product-photo-pipeline", "markdown": "https://wpnews.pro/news/semantic-caching-the-vlm-step-in-our-product-photo-pipeline.md", "text": "https://wpnews.pro/news/semantic-caching-the-vlm-step-in-our-product-photo-pipeline.txt", "jsonld": "https://wpnews.pro/news/semantic-caching-the-vlm-step-in-our-product-photo-pipeline.jsonld"}}