# I built a $0.0005 screenshot cropper that saves AI agents 95% on vision LLM costs

> Source: <https://dev.to/aaroncarlisle94/i-built-a-00005-screenshot-cropper-that-saves-ai-agents-95-on-vision-llm-costs-2c41>
> Published: 2026-06-24 21:20:23+00:00

If you're building AI agents that work with browser screenshots, you already know the pain.

You take a full 1920×1080 screenshot, pass it to GPT-4o or Claude, and watch your token bill climb — while the model downscales the image anyway and blurs the exact text you needed it to read.

There's a better way.

Vision LLMs are expensive for two reasons when you feed them full screenshots:

But your agent already knows *where* to look. Browser automation tools like Playwright and Puppeteer give you `getBoundingClientRect()`

— the exact pixel coordinates of any element on screen.

So why are you sending the whole screenshot?

I built a stateless pay-per-use API that takes a screenshot and pixel coordinates, and returns just the cropped element as a lossless PNG — ready to pass directly to your vision LLM.

```
POST /crop
{
  "image":  "<base64 screenshot>",
  "x":      120,
  "y":      45,
  "width":  640,
  "height": 80
}
```

Returns:

```
{
  "success": true,
  "data": {
    "base64": "iVBORw0KGgo...",
    "mime":   "image/png",
    "width":  640,
    "height": 80,
    "bytes":  4821
  }
}
```

A 4KB crop instead of a 2MB screenshot. Same information. 95% fewer tokens.

Here's where it gets interesting. The API uses the **x402 payment protocol** — HTTP's long-dormant 402 Payment Required status code, finally put to use.

There are no API keys. No accounts. No subscriptions. The agent pays $0.0005 USDC per crop on Base L2 automatically.

The flow:

```
1. Agent POSTs to /crop (no payment header)
   ← 402 with payment instructions in headers

2. Agent transfers 0.0005 USDC to recipient wallet on Base
   (near-zero gas, ~2 second settlement)

3. Agent POSTs again with x-payment-tx-hash header
   ← 200 with cropped PNG
```

The entire exchange happens inside the HTTP request cycle. No human intervention. No billing dashboard. The money lands directly in the operator's wallet on-chain.

Here's what using it looks like in a Playwright agent:

``` js
import { chromium } from 'playwright';
import { readFileSync } from 'fs';

const browser = await chromium.launch();
const page    = await browser.newPage();
await page.goto('https://example.com/dashboard');

// Take screenshot
await page.screenshot({ path: 'screen.png' });
const imageB64 = readFileSync('screen.png').toString('base64');

// Get element coordinates
const rect = await page.$eval('.price-display', el => el.getBoundingClientRect().toJSON());

// Probe the API for payment instructions
const probe = await fetch('https://x402-vision-cropper.onrender.com/crop', {
  method:  'POST',
  headers: { 'Content-Type': 'application/json' },
  body:    JSON.stringify({
    image:  imageB64,
    x:      Math.floor(rect.x),
    y:      Math.floor(rect.y),
    width:  Math.floor(rect.width),
    height: Math.floor(rect.height),
  }),
});

// → 402 response with payment details in headers
const recipient = probe.headers.get('x-payment-recipient');
const amount    = probe.headers.get('x-payment-price-usdc');

// Pay on Base L2 using viem
const txHash = await sendUsdc({ recipient, amount }); // your wallet logic here

// Resubmit with payment proof
const result = await fetch('https://x402-vision-cropper.onrender.com/crop', {
  method:  'POST',
  headers: {
    'Content-Type':       'application/json',
    'x-payment-tx-hash':  txHash,
  },
  body: JSON.stringify({
    image:  imageB64,
    x:      Math.floor(rect.x),
    y:      Math.floor(rect.y),
    width:  Math.floor(rect.width),
    height: Math.floor(rect.height),
  }),
});

const { data } = await result.json();

// Pass the tiny crop to your vision LLM instead of the full screenshot
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'image_url', image_url: { url: `data:${data.mime};base64,${data.base64}` } },
      { type: 'text', text: 'What is the price shown?' }
    ]
  }]
});
```

The server is intentionally minimal:

The entire codebase is about 400 lines across 7 files. No database. No session state. No auth layer beyond the payment itself.

The API is live now:

```
# Check it's running
curl https://x402-vision-cropper.onrender.com/health

# Trigger the payment challenge
curl -X POST https://x402-vision-cropper.onrender.com/crop \
  -H "Content-Type: application/json" \
  -d '{"image":"'"$(python3 -c "print('A'*200)")"'","x":0,"y":0,"width":10,"height":10}'
```

Machine-readable docs for agents: [https://x402-vision-cropper.onrender.com/llms.txt](https://x402-vision-cropper.onrender.com/llms.txt)

**x402 is genuinely exciting but very early.** The protocol works cleanly — payment instructions in headers, proof in the retry, settlement on-chain. But the agent ecosystem is still catching up. Most frameworks don't have native wallet support yet.

**Stateless by design is underrated.** No database means no breach, no GDPR headache, no backup strategy, no connection pooling. Every request lives and dies in RAM. For a high-throughput API that processes sensitive screenshot data this is the right architecture.

**The unit economics make sense at scale.** At $0.0005 per crop the service costs less than a rounding error compared to what it saves on vision tokens. The challenge isn't pricing — it's volume.

If you're building browser agents or anything that feeds screenshots to vision models, give it a try. And if you're building in the x402 / agentic payments space I'd love to hear what you're working on.
