If you're building AI agents that work with browser screenshots, you already know the pain.
You take a full 1920×1080 screenshot, pass it to GPT-4o or Claude, and watch your token bill climb — while the model downscales the image anyway and blurs the exact text you needed it to read.
There's a better way.
Vision LLMs are expensive for two reasons when you feed them full screenshots:
But your agent already knows where to look. Browser automation tools like Playwright and Puppeteer give you getBoundingClientRect()
— the exact pixel coordinates of any element on screen.
So why are you sending the whole screenshot?
I built a stateless pay-per-use API that takes a screenshot and pixel coordinates, and returns just the cropped element as a lossless PNG — ready to pass directly to your vision LLM.
POST /crop
{
"image": "<base64 screenshot>",
"x": 120,
"y": 45,
"width": 640,
"height": 80
}
Returns:
{
"success": true,
"data": {
"base64": "iVBORw0KGgo...",
"mime": "image/png",
"width": 640,
"height": 80,
"bytes": 4821
}
}
A 4KB crop instead of a 2MB screenshot. Same information. 95% fewer tokens.
Here's where it gets interesting. The API uses the x402 payment protocol — HTTP's long-dormant 402 Payment Required status code, finally put to use.
There are no API keys. No accounts. No subscriptions. The agent pays $0.0005 USDC per crop on Base L2 automatically.
The flow:
1. Agent POSTs to /crop (no payment header)
← 402 with payment instructions in headers
2. Agent transfers 0.0005 USDC to recipient wallet on Base
(near-zero gas, ~2 second settlement)
3. Agent POSTs again with x-payment-tx-hash header
← 200 with cropped PNG
The entire exchange happens inside the HTTP request cycle. No human intervention. No billing dashboard. The money lands directly in the operator's wallet on-chain.
Here's what using it looks like in a Playwright agent:
import { chromium } from 'playwright';
import { readFileSync } from 'fs';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/dashboard');
// Take screenshot
await page.screenshot({ path: 'screen.png' });
const imageB64 = readFileSync('screen.png').toString('base64');
// Get element coordinates
const rect = await page.$eval('.price-display', el => el.getBoundingClientRect().toJSON());
// Probe the API for payment instructions
const probe = await fetch('https://x402-vision-cropper.onrender.com/crop', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
image: imageB64,
x: Math.floor(rect.x),
y: Math.floor(rect.y),
width: Math.floor(rect.width),
height: Math.floor(rect.height),
}),
});
// → 402 response with payment details in headers
const recipient = probe.headers.get('x-payment-recipient');
const amount = probe.headers.get('x-payment-price-usdc');
// Pay on Base L2 using viem
const txHash = await sendUsdc({ recipient, amount }); // your wallet logic here
// Resubmit with payment proof
const result = await fetch('https://x402-vision-cropper.onrender.com/crop', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-payment-tx-hash': txHash,
},
body: JSON.stringify({
image: imageB64,
x: Math.floor(rect.x),
y: Math.floor(rect.y),
width: Math.floor(rect.width),
height: Math.floor(rect.height),
}),
});
const { data } = await result.json();
// Pass the tiny crop to your vision LLM instead of the full screenshot
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: `data:${data.mime};base64,${data.base64}` } },
{ type: 'text', text: 'What is the price shown?' }
]
}]
});
The server is intentionally minimal:
The entire codebase is about 400 lines across 7 files. No database. No session state. No auth layer beyond the payment itself.
The API is live now:
curl https://x402-vision-cropper.onrender.com/health
curl -X POST https://x402-vision-cropper.onrender.com/crop \
-H "Content-Type: application/json" \
-d '{"image":"'"$(python3 -c "print('A'*200)")"'","x":0,"y":0,"width":10,"height":10}'
Machine-readable docs for agents: https://x402-vision-cropper.onrender.com/llms.txt
x402 is genuinely exciting but very early. The protocol works cleanly — payment instructions in headers, proof in the retry, settlement on-chain. But the agent ecosystem is still catching up. Most frameworks don't have native wallet support yet.
Stateless by design is underrated. No database means no breach, no GDPR headache, no backup strategy, no connection pooling. Every request lives and dies in RAM. For a high-throughput API that processes sensitive screenshot data this is the right architecture.
The unit economics make sense at scale. At $0.0005 per crop the service costs less than a rounding error compared to what it saves on vision tokens. The challenge isn't pricing — it's volume.
If you're building browser agents or anything that feeds screenshots to vision models, give it a try. And if you're building in the x402 / agentic payments space I'd love to hear what you're working on.