{"slug": "what-does-it-cost-to-process-an-image-with-a-vision-model", "title": "What does it cost to process an image with a vision model?", "summary": "Processing a single image through a vision-language model can cost anywhere from a fraction of a cent to several cents depending on the provider and image size, with GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro each using different tokenization rules that cause the same JPEG to consume 87 tokens on one platform and 6,636 on another. OpenAI charges $5.00 per million input tokens for GPT-5.5, Anthropic charges the same rate for Claude Opus 4.7 but uses an area-based formula that can produce up to 35% more tokens than prior versions, and Google's Gemini 3.1 Pro charges $2.00 per million tokens with a flat 258-token fee for small images and tile-based pricing for larger ones. The cost disparity matters for high-volume applications like inspection lines or content moderation, where processing one million images can vary by thousands of dollars across providers.", "body_md": "A reproducible breakdown of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro. Updated May 2026.\n\n## Why VLM pricing is harder than LLM pricing\n\nEstimating the cost of an LLM call is mostly arithmetic. Count the input tokens, count the output tokens, multiply by the rate card, done. [Vision-language models](https://blog.roboflow.com/what-is-a-vision-language-model/) break that habit. The same JPEG can become 87 tokens on one provider and 6,636 on another, before the model has generated a single word of output. If you are sizing a workload, the question of how much it costs to process an image only has an answer once you specify the image, the provider, and what you want back.\n\nThis piece walks through the cost equation, the per-provider tokenization rules as of May 2026, and a worked grid across five image sizes. The goal is to give you something you can plug your own numbers into.\n\n## The VLM cost equation\n\nCost per image = (image input tokens + text input tokens) × input price + output tokens × output price\n\nThree of those four terms behave like normal LLM math. The fourth, image input tokens, is where the providers diverge. The rest of this post focuses there, because that is the hardest part when making a budget.\n\nFor the comparisons below, we hold text input and output constant (a 100-token instruction, a 500-token JSON response) and vary the image. That isolates the variable that vision pricing actually depends on.\n\n## How each provider turns pixels into tokens\n\n### OpenAI GPT-5.5\n\nGPT-5.5 uses patch-based image tokenization. Images are covered by 32 by 32 pixel patches, and the image token count is based on the number of patches after any model resizing. In `high` detail mode, GPT-5.5 allows up to 2,500 patches or a 2,048-pixel maximum dimension. If either limit is exceeded, the image is resized while preserving aspect ratio.\n\nIn `original` detail mode, GPT-5.5 allows up to 10,000 patches or a 6,000-pixel maximum dimension. One important gotcha: on GPT-5.5, omitted `detail` and `auto` behave like `original`, not `high`. For the comparison grid below, we use `detail: \"high\"`.\n\nInput price: $5.00 per million tokens for GPT-5.5 standard input.\n\n### Anthropic Claude Opus 4.7\n\nAnthropic uses an area-based formula. Image tokens approximate (width × height) / 750. The long edge is capped at 2,576 pixels in Opus 4.7, up from 1,568 in prior Claude models. Anything larger gets resized down before tokenization.\n\nThere is one wrinkle worth knowing about. Opus 4.7 ships with a new tokenizer that produces 1.0x to 1.35x more tokens for the same input compared to Opus 4.6. Image tokens are affected too, so a phone photo that cost X on Opus 4.6 can cost noticeably more on Opus 4.7 even at the same nominal price per token.\n\nInput price: $5.00 per million tokens.\n\n### Google Gemini 3.1 Pro\n\nGemini has the simplest rule. Images where both dimensions are 384 pixels or smaller cost a flat 258 tokens. Anything larger is cropped and scaled as needed into 768 by 768 tiles, and each tile costs 258 tokens.\n\nInput price: $2.00 per million tokens (standard context). The lower per-token price partially offsets the higher tile count on big images.\n\n## VLM pricing comparison grid\n\nFive representative image sizes, run through each provider's rule. Image input tokens only.\n\nTranslating to dollars at current input prices:\n\nThe same grid at one million images, to give you an idea for real world applications like the volume of an inspection line, content moderation pipeline, or document processing:\n\nThese numbers are image-input only. Add 100 input tokens for the instruction and 500 output tokens for a JSON response and the total per call goes up by roughly $0.0130 on Claude, $0.0155 on GPT-5.5, and $0.0062 on Gemini, depending on output rates. For binary classification (one-token outputs), output cost is negligible. For long-form analysis (2,000+ output tokens), output cost can dominate the image cost entirely.\n\n## Key takeaways from comparing VLMs\n\nA few things that matter when you turn this into a budget.\n\nThe same image can produce very different token counts across providers. A phone photo is about 2,451 image tokens on GPT-5.5, 6,636 on Claude, and 6,192 on Gemini. That is a 2.7x spread between GPT-5.5 and Claude before output tokens.\n\nThose differences come from tokenization rules, not just price. GPT-5.5 uses patch-based accounting with a patch budget in `high` detail mode. Claude uses an area-based formula after resizing. Gemini uses fixed-cost image tiles.\n\nGPT-5.5 is capped in `high` detail mode, so large images tend to cluster in the low thousands of tokens rather than growing indefinitely. If you use `original` or leave `detail` on default/`auto`, GPT-5.5 token counts can be much higher.\n\nThe cheapest provider depends on the image. Claude wins on tiny images. Gemini wins on several medium and large rows. GPT-5.5 is competitive on large natural images and much cheaper than Claude there.\n\nOutput tokens can change the ranking. This grid is image-input only; long JSON responses or detailed reports can dominate total cost.\n\n## Generality becomes a tax at production scale\n\nFrontier VLMs are the right tool when you need general reasoning over an image, when prompt iteration matters more than per-call cost, or when volumes are low enough that an extra cent per image is invisible. A few thousand calls a day, a few cents each, is fine.\n\nThe math changes at scale. A factory inspection line running at 30 frames per second on three cameras is 7.8 million images a day. At about $0.002 per image, roughly the cheapest web-resolution cell in the grid above, that is $15,600 per day, every day, for one line. Add output tokens, retries, and a redundant model for cross-checking, and the number doubles.\n\nAt that volume, generality becomes a tax. Most production vision workloads do not need a model that can also write poetry; they need a model that runs a specific task fast and cheap on specific hardware.\n\nThis is the gap that purpose-built vision models fill. A fine-tuned [RF-DETR ](https://rfdetr.roboflow.com/latest/?ref=blog.roboflow.com)running on an edge GPU can do object detection at sub-millisecond latency for a fraction of a cent per frame, and it does not pay for tokens at all. [Roboflow](https://roboflow.com/?ref=blog.roboflow.com) exists because at production scale the right answer is usually not an API call to a frontier VLM. It is a smaller, specialized model trained on your data and deployed where the cameras actually are.\n\nThe frontier VLMs still have a role in that pipeline. They are useful for bootstrapping labels, handling the long tail of edge cases, and debugging failure modes. The point is not to pick one tool. It is to know where each tool earns its keep, which starts with knowing what each one actually costs.\n\n## VLM cost calculator\n\nThe formulas are stable enough to spreadsheet. If you want to plug in your own image distribution and instruction lengths, the rules above are everything you need. The python below reproduces the numbers in this post.\n\n``` python\nimport math\n\ndef gpt55_tokens(w, h, detail=\"high\"):\n\"\"\"\nApproximate GPT-5.5 image input tokens.\n\nGPT-5.5 uses 32x32 patch-based image tokenization.\nFor cost-controlled workloads, explicitly set detail=\"high\";\nGPT-5.5 default/auto behaves like \"original\".\n\"\"\"\nif detail == \"low\":\n# Low detail receives a 512x512 version of the image.\n# 512 / 32 = 16 patches per side.\nreturn 16 * 16\n\nif detail == \"high\":\npatch_budget = 2500\nmax_dim = 2048\nelif detail in (\"original\", \"auto\"):\npatch_budget = 10000\nmax_dim = 6000\nelse:\nraise ValueError(\"detail must be 'low', 'high', 'original', or 'auto'\")\n\n# Constraint 1: maximum dimension.\ndim_scale = min(1.0, max_dim / max(w, h))\n\n# Constraint 2: patch budget.\noriginal_patches = math.ceil(w / 32) * math.ceil(h / 32)\n\nif original_patches <= patch_budget:\npatch_scale = 1.0\nelse:\nshrink = math.sqrt((32**2 * patch_budget) / (w * h))\n\n# OpenAI's docs describe an adjustment so the integer resized dimensions\n# remain within the patch budget after 32px patch rounding.\npatch_scale = shrink * min(\nmath.floor(w * shrink / 32) / (w * shrink / 32),\nmath.floor(h * shrink / 32) / (h * shrink / 32),\n)\n\nscale = min(dim_scale, patch_scale)\n\nresized_w = math.floor(w * scale)\nresized_h = math.floor(h * scale)\n\nreturn math.ceil(resized_w / 32) * math.ceil(resized_h / 32)\n\ndef claude_tokens(w, h, max_long=2576):\n    if max(w, h) > max_long:\n        s = max_long / max(w, h); w, h = w*s, h*s\n    return round(w * h / 750)\n\ndef gemini_tokens(w, h):\n    if w <= 384 and h <= 384:\n        return 258\n    return 258 * math.ceil(w/768) * math.ceil(h/768)\n```\n\n## Sources for Vision Token Counts\n\nOpenAI vision and pricing: [ Images and vision guide](https://platform.openai.com/docs/guides/images-vision?ref=blog.roboflow.com),\n\n[.](https://openai.com/api/pricing/?ref=blog.roboflow.com)\n\n__API pricing__Anthropic Claude vision and pricing: [ Vision docs](https://docs.claude.com/en/docs/build-with-claude/vision?ref=blog.roboflow.com),\n\n[,](https://platform.claude.com/docs/en/about-claude/pricing?ref=blog.roboflow.com)\n\n__pricing__[.](https://www.anthropic.com/news/claude-opus-4-7?ref=blog.roboflow.com)\n\n__Opus 4.7 announcement__Google Gemini image understanding and pricing: [ Image understanding](https://ai.google.dev/gemini-api/docs/image-understanding?ref=blog.roboflow.com),\n\n[,](https://ai.google.dev/gemini-api/docs/tokens?ref=blog.roboflow.com)\n\n__tokens guide__[.](https://ai.google.dev/gemini-api/docs/pricing?ref=blog.roboflow.com)\n\n__pricing__**Cite this Post**\n\nUse the following entry to cite this post in your research:\n\n[Trevor Lynn](/author/trevor/). (May 4, 2026).\nVision Token Counts: What does it cost to process an image with a frontier vision model?. Roboflow Blog: https://blog.roboflow.com/image-token-cost-vlm/", "url": "https://wpnews.pro/news/what-does-it-cost-to-process-an-image-with-a-vision-model", "canonical_source": "https://blog.roboflow.com/image-token-cost-vlm/", "published_at": "2026-05-29 11:51:38+00:00", "updated_at": "2026-05-29 12:19:05.773175+00:00", "lang": "en", "topics": ["computer-vision", "large-language-models", "ai-infrastructure"], "entities": ["GPT-5.5", "Claude Opus 4.7", "Gemini 3.1 Pro", "OpenAI", "Roboflow"], "alternates": {"html": "https://wpnews.pro/news/what-does-it-cost-to-process-an-image-with-a-vision-model", "markdown": "https://wpnews.pro/news/what-does-it-cost-to-process-an-image-with-a-vision-model.md", "text": "https://wpnews.pro/news/what-does-it-cost-to-process-an-image-with-a-vision-model.txt", "jsonld": "https://wpnews.pro/news/what-does-it-cost-to-process-an-image-with-a-vision-model.jsonld"}}