# Giving an LLM Eyes and Hands on a Mobile Simulator

> Source: <https://dev.to/joduchan/-giving-an-llm-eyes-and-hands-on-a-mobile-simulator-5963>
> Published: 2026-05-30 08:23:03+00:00

When a person does QA in tapflow, the loop is:

This is exactly the perception-action loop that vision-capable LLMs are built for. The model sees a screenshot, reasons about what it shows, decides what action to take, and calls a tool to execute it.

We didn't need to build a new automation layer. We just needed to expose tapflow's existing WebSocket and REST APIs as MCP tools.

`@tapflowio/mcp-server`

connects to a running tapflow relay and registers 13 tools that any MCP-compatible client can call:

```
list_devices       — see all simulators registered on the relay
connect_device     — join a device session
boot_device        — boot a simulator (waits up to 30s for ready state)
screenshot         — capture the current screen
tap                — tap at a pixel coordinate
swipe              — swipe between two coordinates
type_text          — type into the focused field
press_key          — press a keyboard key (Return, Delete, Escape...)
press_button       — press a hardware button (home, lock)
install_app        — install a build from App Center
launch_app         — launch an installed app
list_builds        — list available builds on the relay
disconnect_device  — end the session
```

Setup is two environment variables:

```
TAPFLOW_RELAY_URL=wss://your-relay-url
TAPFLOW_TOKEN=your-pat-token
npx @tapflowio/mcp-server
```

Add it as an MCP server in your client config, and those tools appear in the model's tool list.

The `screenshot`

tool calls the REST endpoint we added in v0.3.0 (`GET /api/v1/sessions/:id/screenshot`

), gets back a PNG or JPEG buffer, base64-encodes it, and returns it as MCP `image`

content alongside the pixel dimensions:

```
return {
  content: [
    { type: 'image', data: buf.toString('base64'), mimeType },
    { type: 'text', text: `Screenshot saved: ${filePath} (${width}×${height}px)` },
  ],
}
```

The model receives the actual image. It can read text on screen, identify UI elements, notice error states — the same things a human would.

Here's the part that took a few iterations to get right. The simulator's logical coordinate space is different from screenshot pixel coordinates, and it changes with screen resolution, device type, and scale factor.

Rather than exposing logical coordinates (which the model can't reason about without device-specific knowledge), we have the model work entirely in screenshot pixel space. The `tap`

tool takes pixel coordinates plus the screenshot dimensions, then normalizes internally:

```
// tools.ts
client.tap(sessionId, x / screenshotWidth, y / screenshotHeight)
```

The model calls `screenshot`

first, reads the dimensions from the response, then uses those same dimensions when calling `tap`

. This means the model can identify "the button is at roughly pixel 200, 450" from the image and tap it directly — no coordinate system translation required.

Swipe works the same way, with 8 interpolated `touch:move`

events across the duration to simulate a natural gesture:

``` js
// client.ts — swipe interpolation
const STEPS = 8
const interval = durationMs / STEPS

this.send({ type: 'input:touch:start', sessionId, payload: { x: startX, y: startY } })
for (let i = 1; i < STEPS; i++) {
  await delay(interval)
  const t = i / STEPS
  this.send({
    type: 'input:touch:move',
    sessionId,
    payload: {
      x: Math.round(startX + (endX - startX) * t),
      y: Math.round(startY + (endY - startY) * t),
    },
  })
}
```

Several tools involve async operations — booting a device, installing an app — where the relay sends a confirmation back over WebSocket after the operation completes.

The client uses a `waitFor`

pattern: register a predicate against incoming messages, return a promise that resolves when a matching message arrives, and reject if a timeout fires first.

``` js
// client.ts — waitFor
private waitFor(predicate: (msg) => boolean, timeoutMs: number): Promise<RelayMsg> {
  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => {
      this.waiters.splice(this.waiters.findIndex(w => w.resolve === resolve), 1)
      reject(new Error('Request timed out'))
    }, timeoutMs)
    this.waiters.push({ predicate, resolve, reject, timer })
  })
}
```

`boot_device`

waits up to 30 seconds. `install_app`

waits 60 seconds. Each resolves on the confirmation message or rejects with the error payload.

A model running a login flow might do this:

```
1. list_devices → pick a session
2. connect_device
3. list_builds → find the build to test
4. boot_device
5. install_app
6. launch_app
7. screenshot → see the login screen
8. tap(email field coordinates) → focus the input
9. type_text("test@example.com")
10. tap(password field coordinates)
11. type_text("password")
12. tap(login button coordinates)
13. screenshot → verify the home screen loaded
14. disconnect_device
```

Each screenshot gives the model a chance to verify state before proceeding. If step 13 shows an error message instead of the home screen, the model knows something went wrong.

The version says `0.3.1-experimental.1`

for a reason. The tools work, but the layer needs more hardening before we'd call it reliable.

The core issue is consistency. The same sequence of tool calls should produce predictable behavior every time. Right now it doesn't always — there are timing edge cases where an action fires before the UI has fully settled, device state can drift between steps without the model noticing, and error recovery when something unexpected happens mid-flow is rough.

These are solvable problems, but we want to solve them before presenting this as something teams should build pipelines on.

The direction we're aiming at is using the MCP server as the foundation for LLM-driven smoke tests in CI.

The scenario: a new build passes unit tests and gets uploaded to App Center. A CI step spins up the MCP server, points it at the relay, and gives a model a natural-language test spec:

"Install the latest build. Log in with test credentials. Navigate to the cart, add an item, and confirm the checkout screen shows the correct total. Take a screenshot at each step."

The model does the steps, captures evidence, and reports what it saw. No automation code to write. No selectors to maintain when the UI changes. The spec is just a description of what a human would do.

This isn't production-ready yet. The stability work comes first. But the pieces — browser-controllable simulators, screenshot REST endpoint, MCP tool layer — are in place. The question is whether the model can run a flow reliably enough to be trusted in CI without a human verifying each run.

We think it can. That's what we're building toward.

```
npm install -g @tapflowio/mcp-server@experimental
```

You'll need a running tapflow relay and a PAT token with viewer scope. Configure it in your MCP client:

```
{
  "mcpServers": {
    "tapflow": {
      "command": "npx",
      "args": ["@tapflowio/mcp-server"],
      "env": {
        "TAPFLOW_RELAY_URL": "wss://your-relay-url",
        "TAPFLOW_TOKEN": "your-pat-token"
      }
    }
  }
}
```

If you try it and hit rough edges, open an issue — that feedback is exactly what's shaping the stability work.