Day 9. Template matching works. But screen sizes, resolutions, and Android versions might break everything.
Eight days ago, the agent was an idea. Now it can read text, handle interruptions, and find icons on a screen.
But there's a question I've been avoiding: does it work on any phone other than mine?
The Cross-Device Problem
Every screenshot I've taken, every icon I've cropped, every coordinate I've mapped—it's all on one device. My phone. Same screen size. Same resolution. Same Android version. Same DPI.
Template matching relies on reference images that look exactly like the target on screen. Change the screen density, change the icon size, change the font scaling, and the match confidence drops. Suddenly "send_button.png" doesn't match anymore, and the agent can't press send. This isn't a bug in my code. It's a fundamental challenge in computer vision: reference-based matching breaks when the visual context changes.
Today's Experiment
I tested the same agent on a friend's phone—different manufacturer, different Android version, slightly larger screen. The results were humbling.
| Task | My Phone | Friend's Phone |
|---|---|---|
| OCR (text recognition) | ✅ 95% accuracy | ✅ ~90% accuracy |
| Find "Mom" in contacts | ✅ Found | ✅ Found |
| Template match: send button | ✅ 94% confidence | ❌ 62% confidence |
| Template match: back button | ✅ 91% confidence | ❌ 58% confidence |
OCR held up reasonably well because text is text. Fonts might change slightly, but the characters are the same. But the icons—the send button, the back arrow—were rendered at a different size and slightly different pixel arrangement on my friend's device.
The agent failed to send the message.
Why This Matters
An AI agent that only works on one phone isn't an agent. It's a script. If I want this to be useful to anyone else—or even to myself if I change phones—it needs to be device-agnostic.
Possible Solutions I'm Exploring
| Solution | Pros | Cons |
|---|---|---|
| Multi-resolution icon library | ||
| Simple. Just crop icons at different DPIs. | Tedious. How many variants are enough? | |
| AI-based icon detection | ||
| Could recognize icons by shape, not pixels. | Requires training data. Heavy for a phone. | |
| UI hierarchy inspection | ||
| Instead of "seeing" the screen, read the app's UI tree directly via ADB. | Requires root or accessibility service. Not universal. | |
| Relative positioning | ||
| Once OCR finds text, calculate icon positions relative to known landmarks. | Fragile. Different layouts on different devices. |
None of these are perfect. All of them are more work. But that's the reality of building something that's supposed to work in the wild, not just in a demo.
What I'm Trying First
The UI hierarchy approach. ADB has a command called uiautomator dump
that returns an XML tree of every visible element on screen—text, buttons, icons, everything. Each element has bounds, a class name, and a content description.
If I can parse that XML tree instead of taking screenshots, the agent doesn't need to "see" the screen at all. It just reads the structure. No OCR. No template matching. No resolution issues. This is a fundamental architectural shift. But it might be the right one.
What's Next (Day 10)
uiautomator dump
as a replacement for screenshot-based detectionThe Repo
👉 github.com/Dexter2344/phone-agent
All code from Day 8 is live. The Day 9 experiments are in a new branch called ui-tree-experiment
. I'll merge to main once I have results.
This is Day 9. The hard problems don't stop coming. But neither do I.