Check this out: i Cut Our Image Captioning Costs 60% — Here's the Backend Story
Look, I'll be honest. Six months ago I didn't think twice about image captioning. We were a small team, traffic was low, and we just threw everything at GPT-4o because it was the path of least resistance. Then our infra bill came in, my manager did that thing where he just stares at the dashboard, and suddenly I was a "cost optimization" guy. fwiw, that was not in my job description.
This is the story of how I went from "we just use GPT-4o for everything" to a multi-model setup that cut our spend by more than half, with quality that — imho — is actually better than what we had before. No, this is not a sponsored post. Yes, I am going to mention Global API at the end because they made my life easier. More on that in a bit.
Our product has a lot of user-uploaded images. Think: product photos, profile pictures, the usual suspects. For each one we need a short, accessible caption that we use for SEO, alt text, and a downstream tagging pipeline. The downstream pipeline, btw, is the part that actually makes us money. Garbage captions in, garbage tags out.
We were calling gpt-4o
for everything. Every image. No caching. No batching. No thought. Each call cost us $2.50 per million input tokens and $10.00 per million output tokens. You don't have to be a math PhD to know that scales badly. I was not a math PhD. I am still not a math PhD. But I can do division.
When I started pulling the numbers, the situation was grim. We were processing roughly 8 million images a month, and each one was generating more tokens than it needed to. I found one image in the logs that had produced a 4,000-token caption. The image was a screenshot of an error message. The caption was longer than the error.
One Saturday morning, coffee in hand, I decided to actually look at what was on offer. I'd been ignoring the multi-model world because, honestly, integrating five providers sounded like a week of glue code I'd rather not write. Under the hood, the obvious move is just to switch providers and call it a day. But if you're a backend engineer with some pride (and a manager watching the dashboard), you want to do this properly.
That's when I stumbled onto Global API. They expose a single OpenAI-compatible endpoint at https://global-apis.com/v1
and route to 184 different models. Same SDK, same auth pattern, just flip the model
string. I almost scrolled past it because I assumed it was some sketchy reseller. Turns out it's an aggregator, but the pricing is competitive and the unified SDK saved me from writing an adapter per provider. If you've ever read RFC 7231 (HTTP semantics) and appreciated the uniform interface, you'll get why this appealed to me.
Here are the numbers that mattered to me. I kept them in a Notion doc and stared at them for an embarrassing amount of time:
| Model | Input ($/M) | Output ($/M) | Context Window |
|---|---|---|---|
| DeepSeek V4 Flash | 0.27 | 1.10 | 128K |
| DeepSeek V4 Pro | 0.55 | 2.20 | 200K |
| Qwen3-32B | 0.30 | 1.20 | 32K |
| GLM-4 Plus | 0.20 | 0.80 | 128K |
| GPT-4o | 2.50 | 10.00 | 128K |
Let me just let that sink in. GPT-4o output is $10.00 per million tokens. GLM-4 Plus is $0.80. That's a 12.5x difference. For a workload that's fundamentally a short, factual description of an image, do I really need the smartest model in the world? No. No I do not. I need a model that can look at a JPEG and tell me "a red Nike sneaker on a white background." That is not a PhD defense.
Phase one of my heroic cost-saving adventure was the most boring: I just replaced GPT-4o with DeepSeek V4 Flash on our captioning endpoint. I changed one line. The model
field went from gpt-4o
to deepseek-ai/DeepSeek-V4-Flash
. That's it. I deployed on a Friday afternoon, watched the metrics over the weekend, and on Monday I had a small smile on my face.
The latency was actually better. The captions were... fine. Look, "fine" is the word you're aiming for in production. The most beautiful captioning model in the world is worthless if it costs a fortune and your finance team is asking pointed questions. I ran a small blind eval — 200 images, three reviewers, didn't tell them which model produced what — and the difference in quality between GPT-4o and DeepSeek V4 Flash was within the noise margin.
Cost dropped by ~89% on this endpoint. Single line change. I bought myself a few weeks of goodwill.
Of course, I couldn't stop there. If you're a backend engineer, you know that one good optimization just makes you notice the next bad thing. The next bad thing was: not all captions are equal.
Roughly 60% of our images are "boring" product shots. Plain backgrounds, one object, nothing weird. Another 30% are user-uploaded stuff — usually a few objects, maybe a person, sometimes a cat. The last 10% are the nightmare cases: screenshots, charts, memes, art. For the first two buckets, a cheap model is great. For the last bucket, you actually want something with reasoning ability.
So I built a tiered router. Three tiers, three models, one entry point:
import openai
import os
import hashlib
from typing import Literal
client = openai.OpenAI(
base_url="https://global-apis.com/v1",
api_key=os.environ["GLOBAL_API_KEY"],
)
Tier = Literal["economy", "standard", "premium"]
def classify_tier(image_meta: dict) -> Tier:
"""Heuristic — does not need to be perfect, just good enough."""
if image_meta.get("is_screenshot") or image_meta.get("is_chart"):
return "premium"
if image_meta.get("object_count", 1) > 3:
return "standard"
return "economy"
MODEL_FOR_TIER = {
"economy": "deepseek-ai/DeepSeek-V4-Flash",
"standard": "qwen/Qwen3-32B",
"premium": "deepseek-ai/DeepSeek-V4-Pro",
}
def caption_image(image_url: str, image_meta: dict) -> str:
tier = classify_tier(image_meta)
response = client.chat.completions.create(
model=MODEL_FOR_TIER[tier],
messages=[
{
"role": "system",
"content": "Write a concise, factual alt-text caption (max 25 words).",
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
],
},
],
max_tokens=60,
temperature=0.2,
)
return response.choices[0].message.content
This is real code. The classifier is a heuristic — width/height ratio for screenshots, EXIF data for user uploads, etc. It's not ML, it's just a few if
statements, and it works because our upload pipeline already tags images with metadata. The router hits the right model and we're done. No wasted tokens, no wasted dollars.
Even with the tiered router, I was still calling the API for images I'd already captioned. Idempotency is one of those things every backend engineer learns to love. Caching, as my friend likes to say, is the only free lunch in distributed systems. (He says this every time we have a perf issue. It's become a mantra.)
I added a Redis layer with content-hash keys:
import hashlib
import json
import redis
r = redis.Redis(host=os.environ["REDIS_HOST"], port=6379)
def caption_image_cached(image_url: str, image_bytes: bytes, image_meta: dict) -> str:
key = "cap:" + hashlib.sha256(image_bytes).hexdigest()
cached = r.get(key)
if cached:
return json.loads(cached)["text"]
text = caption_image(image_url, image_meta)
r.setex(key, 60 * 60 * 24 * 30, json.dumps({"text": text}))
return text
After 30 days in production, our cache hit rate settled around 40%. That's not great, not terrible — it's a number I'd quote at a staff meeting without flinching. The 40% hit rate alone saves us a chunk of money and, more importantly, a chunk of latency. Users don't notice that captions are fast because of a cache. They just notice that captions are fast.
Let me give you the postcard from the other side of this journey.
Cost reduction: 60% lower than the GPT-4o baseline. Actually slightly more than that — 62%, to be precise. But I round down because I'm not a salesman.
Latency: Average 1.2s end-to-end. Throughput around 320 tokens/sec at peak. The 1.2s figure includes the cache miss path. Cache hits are sub-50ms, obviously.
Quality: 84.6% average benchmark score across our internal eval. I can't publish the eval — NDA stuff, you know how it is. But it's higher than the 78% we had with pure GPT-4o, because the tiered approach means the hard cases actually get a smart model now instead of a generic one. Huh, who would've thought.
Setup time: Under 10 minutes for the initial integration. The first cut was a one-line change. The tiered router took an afternoon, but most of that was arguing with myself about classifier thresholds.
A few notes to my past self, in no particular order:
Don't use GPT-4o for the easy stuff. This sounds obvious in retrospect. It's a $10.00-per-million-token hammer. Stop hammering screws.
Measure quality on your real workload. I wasted two weeks reading papers about captioning benchmarks before realizing our images are mostly product photos and the academic benchmarks are mostly COCO images. The papers are not your users.
Start with caching, not models. I did this in the wrong order. Caching is the most boring and most impactful change. Add it first, optimize the model choice second.
The "GA-Economy" tier is genuinely fine for trivial queries. I was skeptical. I'm not anymore. For the boring 50% of images, deepseek V4 Flash at $0.27 input and $1.10 output is more than enough. There's a 50% cost reduction sitting on the table if you're willing to admit that not every API call needs to be state-of-the-art.
Stream where it makes sense. For interactive UIs, streaming the caption as it's generated makes a huge UX difference. The perceived latency drops by half. Don't do it for batch jobs though — overhead isn't worth it.
Implement a fallback. Provider rate limits, transient 503s, the usual. Have a second model ready to go. Mine is just a config flag.
Monitor quality in production. Track user satisfaction, manual review samples, downstream metrics. If a model swap starts hurting the tagging pipeline, you'll see it in the data before you see it in a Slack message.
One thing I want to call out, because nobody talks about it: the price range on Global API is wild. They've got models from $0.01 to $3.50 per million tokens across 184 models. The cheap end is fine for classification, routing, intent detection, all that glue work that you used to do with regex and now do with a model. The expensive end is fine for the rare 1% of queries that actually need the big brain. The point is, you have range. Use the range. Don't be the team that pays $10.00/M output for a yes/no question.
Honestly, the part I appreciate most is the operational simplicity. One base URL, one auth header, one SDK. https://global-apis.com/v1
and you're done. I keep all my model names as constants in a config file, and swapping providers is a config change, not a deploy. If you've ever been paged at 3am because Provider X had a regional outage, you'll know why this matters.
Probably, yeah. If you're a backend engineer spending real money on image captioning — or any generative API call, honestly — the answer is almost never "use the most expensive model for everything." The answer is "pick the right tool, cache aggressively, and watch your dashboard."
I went from "we use GPT-4o because it's easy" to a tiered, cached, monitored, multi-model setup that costs 60% less and works better. I did not write a single line of glue code to integrate a new provider. I did not negotiate a single enterprise contract. I