What does it cost to process an image with a vision model?

wpnews.pro

A reproducible breakdown of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro. Updated May 2026.

Why VLM pricing is harder than LLM pricing #

Estimating the cost of an LLM call is mostly arithmetic. Count the input tokens, count the output tokens, multiply by the rate card, done. Vision-language models break that habit. The same JPEG can become 87 tokens on one provider and 6,636 on another, before the model has generated a single word of output. If you are sizing a workload, the question of how much it costs to process an image only has an answer once you specify the image, the provider, and what you want back.

This piece walks through the cost equation, the per-provider tokenization rules as of May 2026, and a worked grid across five image sizes. The goal is to give you something you can plug your own numbers into.

The VLM cost equation #

Cost per image = (image input tokens + text input tokens) × input price + output tokens × output price

Three of those four terms behave like normal LLM math. The fourth, image input tokens, is where the providers diverge. The rest of this post focuses there, because that is the hardest part when making a budget.

For the comparisons below, we hold text input and output constant (a 100-token instruction, a 500-token JSON response) and vary the image. That isolates the variable that vision pricing actually depends on.

How each provider turns pixels into tokens #

OpenAI GPT-5.5

GPT-5.5 uses patch-based image tokenization. Images are covered by 32 by 32 pixel patches, and the image token count is based on the number of patches after any model resizing. In high detail mode, GPT-5.5 allows up to 2,500 patches or a 2,048-pixel maximum dimension. If either limit is exceeded, the image is resized while preserving aspect ratio.

In original detail mode, GPT-5.5 allows up to 10,000 patches or a 6,000-pixel maximum dimension. One important gotcha: on GPT-5.5, omitted detail and auto behave like original, not high. For the comparison grid below, we use detail: "high".

Input price: $5.00 per million tokens for GPT-5.5 standard input.

Anthropic Claude Opus 4.7

Anthropic uses an area-based formula. Image tokens approximate (width × height) / 750. The long edge is capped at 2,576 pixels in Opus 4.7, up from 1,568 in prior Claude models. Anything larger gets resized down before tokenization.

There is one wrinkle worth knowing about. Opus 4.7 ships with a new tokenizer that produces 1.0x to 1.35x more tokens for the same input compared to Opus 4.6. Image tokens are affected too, so a phone photo that cost X on Opus 4.6 can cost noticeably more on Opus 4.7 even at the same nominal price per token.

Input price: $5.00 per million tokens.

Google Gemini 3.1 Pro

Gemini has the simplest rule. Images where both dimensions are 384 pixels or smaller cost a flat 258 tokens. Anything larger is cropped and scaled as needed into 768 by 768 tiles, and each tile costs 258 tokens.

Input price: $2.00 per million tokens (standard context). The lower per-token price partially offsets the higher tile count on big images.

VLM pricing comparison grid #

Five representative image sizes, run through each provider's rule. Image input tokens only.

Translating to dollars at current input prices:

The same grid at one million images, to give you an idea for real world applications like the volume of an inspection line, content moderation pipeline, or document processing:

These numbers are image-input only. Add 100 input tokens for the instruction and 500 output tokens for a JSON response and the total per call goes up by roughly $0.0130 on Claude, $0.0155 on GPT-5.5, and $0.0062 on Gemini, depending on output rates. For binary classification (one-token outputs), output cost is negligible. For long-form analysis (2,000+ output tokens), output cost can dominate the image cost entirely.

Key takeaways from comparing VLMs #

A few things that matter when you turn this into a budget.

The same image can produce very different token counts across providers. A phone photo is about 2,451 image tokens on GPT-5.5, 6,636 on Claude, and 6,192 on Gemini. That is a 2.7x spread between GPT-5.5 and Claude before output tokens.

Those differences come from tokenization rules, not just price. GPT-5.5 uses patch-based accounting with a patch budget in high detail mode. Claude uses an area-based formula after resizing. Gemini uses fixed-cost image tiles.

GPT-5.5 is capped in high detail mode, so large images tend to cluster in the low thousands of tokens rather than growing indefinitely. If you use original or leave detail on default/auto, GPT-5.5 token counts can be much higher.

The cheapest provider depends on the image. Claude wins on tiny images. Gemini wins on several medium and large rows. GPT-5.5 is competitive on large natural images and much cheaper than Claude there.

Output tokens can change the ranking. This grid is image-input only; long JSON responses or detailed reports can dominate total cost.

Generality becomes a tax at production scale #

Frontier VLMs are the right tool when you need general reasoning over an image, when prompt iteration matters more than per-call cost, or when volumes are low enough that an extra cent per image is invisible. A few thousand calls a day, a few cents each, is fine.

The math changes at scale. A factory inspection line running at 30 frames per second on three cameras is 7.8 million images a day. At about $0.002 per image, roughly the cheapest web-resolution cell in the grid above, that is $15,600 per day, every day, for one line. Add output tokens, retries, and a redundant model for cross-checking, and the number doubles.

At that volume, generality becomes a tax. Most production vision workloads do not need a model that can also write poetry; they need a model that runs a specific task fast and cheap on specific hardware.

This is the gap that purpose-built vision models fill. A fine-tuned RF-DETR running on an edge GPU can do object detection at sub-millisecond latency for a fraction of a cent per frame, and it does not pay for tokens at all. Roboflow exists because at production scale the right answer is usually not an API call to a frontier VLM. It is a smaller, specialized model trained on your data and deployed where the cameras actually are.

The frontier VLMs still have a role in that pipeline. They are useful for bootstrapping labels, handling the long tail of edge cases, and debugging failure modes. The point is not to pick one tool. It is to know where each tool earns its keep, which starts with knowing what each one actually costs.

VLM cost calculator #

The formulas are stable enough to spreadsheet. If you want to plug in your own image distribution and instruction lengths, the rules above are everything you need. The python below reproduces the numbers in this post.

import math

def gpt55_tokens(w, h, detail="high"):
"""
Approximate GPT-5.5 image input tokens.

GPT-5.5 uses 32x32 patch-based image tokenization.
For cost-controlled workloads, explicitly set detail="high";
GPT-5.5 default/auto behaves like "original".
"""
if detail == "low":
return 16 * 16

if detail == "high":
patch_budget = 2500
max_dim = 2048
elif detail in ("original", "auto"):
patch_budget = 10000
max_dim = 6000
else:
raise ValueError("detail must be 'low', 'high', 'original', or 'auto'")

dim_scale = min(1.0, max_dim / max(w, h))

original_patches = math.ceil(w / 32) * math.ceil(h / 32)

if original_patches <= patch_budget:
patch_scale = 1.0
else:
shrink = math.sqrt((32**2 * patch_budget) / (w * h))

patch_scale = shrink * min(
math.floor(w * shrink / 32) / (w * shrink / 32),
math.floor(h * shrink / 32) / (h * shrink / 32),
)

scale = min(dim_scale, patch_scale)

resized_w = math.floor(w * scale)
resized_h = math.floor(h * scale)

return math.ceil(resized_w / 32) * math.ceil(resized_h / 32)

def claude_tokens(w, h, max_long=2576):
    if max(w, h) > max_long:
        s = max_long / max(w, h); w, h = w*s, h*s
    return round(w * h / 750)

def gemini_tokens(w, h):
    if w <= 384 and h <= 384:
        return 258
    return 258 * math.ceil(w/768) * math.ceil(h/768)

Sources for Vision Token Counts #

OpenAI vision and pricing: Images and vision guide,

.

__API pricing__Anthropic Claude vision and pricing: Vision docs,

,

pricing.

__Opus 4.7 announcement__Google Gemini image understanding and pricing: Image understanding,

,

tokens guide.

pricingCite this Post

Use the following entry to cite this post in your research:

Vision Token Counts: What does it cost to process an image with a frontier vision model?. Roboflow Blog: https://blog.roboflow.com/image-token-cost-vlm/

source & further reading

blog.roboflow.com — original article The New Calculus of AI-Based Coding