How to Debug LLM-Driven Android Automation Runs

wpnews.pro

cd /news/ai-agents/how-to-debug-llm-driven-android-auto… · home › topics › ai-agents › article

[ARTICLE · art-14403] src=dev.to ↗ pub=2026-05-26T12:21Z topic=ai-agents verified=true sentiment=· neutral

How to Debug LLM-Driven Android Automation Runs

A developer building LLM-driven Android automation tools has outlined a structured debugging approach that saves detailed run traces instead of just final screenshots. The method captures UI dumps, model decisions, tool calls, and exit codes at every step, enabling engineers to identify whether failures stem from the model, the app, the automation tool, or timing issues. The approach uses numbered files and compact action tables to make each run inspectable and replayable without requiring pixel-perfect video.

read3 min views15 publishedMay 26, 2026

LLM-driven Android automation fails in strange ways.

The model may tap the wrong label. The screen may change between observation and action. A keyboard may cover the button. A permission dialog may appear. The app may still be . The UI dump may expose two identical "Continue" buttons.

If all you saved is the final screenshot, debugging is painful.

You need a run trace.

For every Android agent step, save:

The minimum useful trace looks like this:

observe: tap Button "Continue" #continue 540,860
model:   tap "Continue"
action:  hs tap "Continue" --visible --unique
result:  ok
wait:    hs wait "Dashboard" --timeout 15s
result:  TIMEOUT

That is much easier to debug than "the agent failed."

Android agent failures usually fall into a few buckets.

Failure	What it means
`NOT_FOUND`
The target label or selector was not visible
`AMBIGUOUS`
More than one visible node matched
`TIMEOUT`
The expected next state never appeared
`SECURE_WINDOW`
Android blocked screenshots for the current window
Wrong action	The model chose a bad label or command
Stale observation	The UI changed after the model saw it

Good tooling should preserve which bucket happened.

If everything becomes "click failed", the agent cannot recover intelligently.

The UI dump is the agent's view of the world.

Save it before each model decision:

hs ui > run/0007-ui.txt

For LLM agents, a compact action table is usually better than full XML:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

When a model picks the wrong action, this file tells you whether the model had a reasonable choice.

Screenshots are valuable, but you do not need a full native PNG on every step.

For most agent debugging:

hs see --size 768 run/0007-screen.jpg

Use screenshots when:

Use the text UI as the default. Use screenshots as evidence.

Do not only save the final command.

Save what the model actually emitted:

{
  "step": 7,
  "model_action": "tap \"Continue\"",
  "tool_call": ["hs", "tap", "Continue", "--visible", "--unique"],
  "reason": "The login form is filled and Continue is visible."
}

This matters because the bug may be in translation:

Keep the model layer and tool layer separate.

Exit codes and error codes are better than stderr scraping.

Handsets has common exit codes:

0  ok
2  NOT_FOUND
3  TIMEOUT
4  AMBIGUOUS

In JSON mode, preserve the structured error:

hs --json tap "Continue" --visible --unique

Then your agent can decide:

NOT_FOUND

: dump UI again or scrollAMBIGUOUS

: ask for a narrower selectorTIMEOUT

: capture screenshot and logsSECURE_WINDOW

: continue without screenshotAndroid logs are noisy. A small tail near the failure is usually enough:

hs logs --tail 200 > run/0007-logcat.txt

Pair logs with the UI dump and screenshot from the same step. Otherwise you end up with artifacts that are technically present but hard to correlate.

Use numbered files:

run/
  0001-ui.txt
  0001-action.json
  0001-result.json
  0002-ui.txt
  0002-screen.jpg
  0002-action.json
  0002-result.json
  0002-logcat.txt

This is not fancy. That is the point.

Before building a dashboard, make the run inspectable with plain files.

Once you have traces, replay becomes possible.

The useful replay is not pixel-perfect video. It is a timeline:

Step 1: observed Sign in
Step 2: tapped Sign in
Step 3: filled Email
Step 4: filled Password
Step 5: tapped Continue
Step 6: timed out waiting for Dashboard

For teams, this timeline becomes the product. It lets an engineer see whether the model, the tool, or the app caused the failure.

Because failures can come from the model, the app, the Android UI state, the automation tool, or timing. A final screenshot does not tell you which layer failed.

Not always. Save compact UI dumps for every step. Add screenshots for visual states, failures, and custom-rendered screens.

The pre-action UI dump. It shows what the model saw when it chose the action.

Structured traces let you build targeted recovery: scroll on NOT_FOUND

, narrow selectors on AMBIGUOUS

, capture logs on TIMEOUT

, and avoid retrying blindly.

Originally published at https://handsets.dev/blog/debug-llm-android-automation-runs/.

source & further reading

dev.to — original article Do We Actually Need Fable 5? A Reality Check on Frontier AI RNAValidate: CPU-only validator for AI-predicted 3D RNA structures The benchmark that built the tools

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-debug-llm-driven-…

Read original on dev.to → dev.to/elliotgao2/how-to-debug-llm-driven-androi…

mentioned entities

Android

LLM

metadata

slughow-to-debug-llm-driven-android-automation-runs

topic#ai-agents

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevVibe Code Tours — student setup …

next →Permute – Media Converter for ma…

── more in #ai-agents 4 stories · sorted by recency

avriz.io · 12 Jul · #ai-agents

We taught our platform to learn its own pricing decisions

pub.towardsai.net · 12 Jul · #ai-agents

One Line, Any Model: Multi-Provider Agents in Google ADK via LiteLLM

github.com · 12 Jul · #ai-agents

Argocd-AI-Assistant

dev.to · 12 Jul · #ai-agents

SDLC in the AI Era with Spec-Driven Development

── more on @android 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required