I Cut My AI Test Automation Cost by 300x by Ditching Vision Models

A developer reduced AI test automation costs by 200-300x, from $0.011 per step to $0.00004, by replacing vision models with a pure-text approach. Instead of sending full-page screenshots to multimodal LLMs, the new framework extracts interactive elements directly from the DOM tree and uses DeepSeek V4 to analyze the structure and decide actions. The author notes that for standard CRUD operations like filling forms and clicking buttons, vision models are overkill since the DOM already contains all necessary information.

From $0.011 per step to $0.00004 — here's how I learned vision models are overkill for most web testing, and what I built instead. It started with a $400 monthly API bill and yes, that's USD — I'm in China, but you'll feel the same pain in any currency . I was running an AI-powered test automation platform built on Midscene.js with Qwen-VL vision models. Every test step meant sending a full-page screenshot to a multimodal LLM — and paying about $0.011 per step. A 50-step test case cost about $0.55. Run it daily? $16.50/month. Add a few more test scenarios, and suddenly I was spending more on API calls than on coffee. And the worst part? Most of those screenshots contained information I already had for free. First, a quick backstory. I built ai-test-platform, a full-stack test automation management system: It worked. Beautiful reports, clean UI, easy test management. I even pushed it to Docker Hub xulingfeng/ai-test-platform:latest . But every time I ran a test, I could almost hear the coins dropping. $0.011 here, $0.011 there. A 29-step doctor-onboarding flow cost $0.32. For a solo QA engineer running tests multiple times a day, that adds up fast. I was watching a test run one afternoon. The AI was analyzing a screenshot of a web page — and I realized something: The AI could see 45 interactive elements in the screenshot. But Playwright had already extracted all 45 of them as clean structured text. I was paying to process pixels when the data was already neatly organized in the DOM tree. Here's what a page looks like to a vision model: screenshot image with pixel data, rendering details, colors, shadows... And here's what it looks like in the DOM: 0 <input placeholder="Search..." name="q" 1 <button Sign in</button 2 <a Add new doctor</a ... The AI doesn't need to "see" the page. It needs to understand the structure and decide what to click. And structured text does that perfectly. I built deep-test — a pure-text AI testing framework. The architecture is embarrassingly simple: Task: "Login system, search product, add to cart" ↓ ① Extract interactive elements DOM tree / uiautomator No screenshots. No vision models. ↓ ② DeepSeek V4 analyzes structure + decides next action ~2000 tokens/step × $0.14/M = $0.0001/step ↓ ③ Execute action Playwright click / ADB tap ↓ ④ Back to ① until task completes The cost comparison is ridiculous: 200-300x cheaper. The 50-step test that cost $0.55 now costs less than a cent. I ran a complete hospital management workflow — login, navigate menus, add a new doctor with 12 fields, verify the result. 29 steps total. Result: 81.8 seconds, ~$0.001 total cost. For context, that's less than the price of a single step on the vision-based approach. Here's where it gets even more interesting. Android apps can't give you a clean DOM tree like a web page. So I added a hybrid approach: This means one AI agent handles both Web and Android with the same architecture. And I even solved the notorious hybrid app WebView input problem — where in-app web views ignore standard automation commands. The fix: uiautomator2.send keys instead of set text . Took days to figure out, one line to implement. Vision models are overkill for most web testing. They're great for: But for standard CRUD operations — filling forms, clicking buttons, navigating menus — the DOM already has all the information you need. The real optimization isn't about better prompting or smarter AI. It's about choosing the right data format for the job. Both projects are not yet public — they contain real test data from production healthcare applications. I plan to clean and open-source them once the company-specific content is stripped out. If you'd like early access or want to discuss the approach, feel free to reach out. The tech stack: I'm a test manager with 15 years of experience. I've been building AI testing tools on the side because I believe good testing shouldn't cost a fortune. If this resonates, I share more practical testing prompts and techniques in my toolkit: xulingfeng.gumroad.com/l/vkhhq