I Cut My AI Test Automation Cost by 300x by Ditching Vision Models A developer reduced AI test automation costs by 200-300x, from $0.011 per step to $0.00004, by replacing vision models with a pure-text approach. Instead of sending full-page screenshots to multimodal LLMs, the new framework extracts interactive elements directly from the DOM tree and uses DeepSeek V4 to analyze the structure and decide actions. The author notes that for standard CRUD operations like filling forms and clicking buttons, vision models are overkill since the DOM already contains all necessary information. From $0.011 per step to $0.00004 — here's how I learned vision models are overkill for most web testing, and what I built instead. It started with a $400 monthly API bill and yes, that's USD — I'm in China, but you'll feel the same pain in any currency . I was running an AI-powered test automation platform built on Midscene.js with Qwen-VL vision models. Every test step meant sending a full-page screenshot to a multimodal LLM — and paying about $0.011 per step. A 50-step test case cost about $0.55. Run it daily? $16.50/month. Add a few more test scenarios, and suddenly I was spending more on API calls than on coffee. And the worst part? Most of those screenshots contained information I already had for free. First, a quick backstory. I built ai-test-platform, a full-stack test automation management system: It worked. Beautiful reports, clean UI, easy test management. I even pushed it to Docker Hub xulingfeng/ai-test-platform:latest . But every time I ran a test, I could almost hear the coins dropping. $0.011 here, $0.011 there. A 29-step doctor-onboarding flow cost $0.32. For a solo QA engineer running tests multiple times a day, that adds up fast. I was watching a test run one afternoon. The AI was analyzing a screenshot of a web page — and I realized something: The AI could see 45 interactive elements in the screenshot. But Playwright had already extracted all 45 of them as clean structured text. I was paying to process pixels when the data was already neatly organized in the DOM tree. Here's what a page looks like to a vision model: screenshot image with pixel data, rendering details, colors, shadows... And here's what it looks like in the DOM: 0