Giving an LLM Eyes and Hands on a Mobile Simulator

Tapflow has released an MCP server that connects vision-capable LLMs to its mobile simulator platform, giving AI models the ability to see screenshots and execute actions like tapping, swiping, and typing. The `@tapflowio/mcp-server` exposes 13 tools through existing WebSocket and REST APIs, allowing models to perform QA testing by reasoning about screenshots and interacting with the simulator in pixel coordinate space. The tool normalizes coordinates internally, enabling the model to identify UI elements from images and tap them directly without device-specific coordinate translation.

When a person does QA in tapflow, the loop is: This is exactly the perception-action loop that vision-capable LLMs are built for. The model sees a screenshot, reasons about what it shows, decides what action to take, and calls a tool to execute it. We didn't need to build a new automation layer. We just needed to expose tapflow's existing WebSocket and REST APIs as MCP tools. @tapflowio/mcp-server connects to a running tapflow relay and registers 13 tools that any MCP-compatible client can call: list devices — see all simulators registered on the relay connect device — join a device session boot device — boot a simulator waits up to 30s for ready state screenshot — capture the current screen tap — tap at a pixel coordinate swipe — swipe between two coordinates type text — type into the focused field press key — press a keyboard key Return, Delete, Escape... press button — press a hardware button home, lock install app — install a build from App Center launch app — launch an installed app list builds — list available builds on the relay disconnect device — end the session Setup is two environment variables: TAPFLOW RELAY URL=wss://your-relay-url TAPFLOW TOKEN=your-pat-token npx @tapflowio/mcp-server Add it as an MCP server in your client config, and those tools appear in the model's tool list. The screenshot tool calls the REST endpoint we added in v0.3.0 GET /api/v1/sessions/:id/screenshot , gets back a PNG or JPEG buffer, base64-encodes it, and returns it as MCP image content alongside the pixel dimensions: return { content: { type: 'image', data: buf.toString 'base64' , mimeType }, { type: 'text', text: Screenshot saved: ${filePath} ${width}×${height}px }, , } The model receives the actual image. It can read text on screen, identify UI elements, notice error states — the same things a human would. Here's the part that took a few iterations to get right. The simulator's logical coordinate space is different from screenshot pixel coordinates, and it changes with screen resolution, device type, and scale factor. Rather than exposing logical coordinates which the model can't reason about without device-specific knowledge , we have the model work entirely in screenshot pixel space. The tap tool takes pixel coordinates plus the screenshot dimensions, then normalizes internally: // tools.ts client.tap sessionId, x / screenshotWidth, y / screenshotHeight The model calls screenshot first, reads the dimensions from the response, then uses those same dimensions when calling tap . This means the model can identify "the button is at roughly pixel 200, 450" from the image and tap it directly — no coordinate system translation required. Swipe works the same way, with 8 interpolated touch:move events across the duration to simulate a natural gesture: js // client.ts — swipe interpolation const STEPS = 8 const interval = durationMs / STEPS this.send { type: 'input:touch:start', sessionId, payload: { x: startX, y: startY } } for let i = 1; i < STEPS; i++ { await delay interval const t = i / STEPS this.send { type: 'input:touch:move', sessionId, payload: { x: Math.round startX + endX - startX t , y: Math.round startY + endY - startY t , }, } } Several tools involve async operations — booting a device, installing an app — where the relay sends a confirmation back over WebSocket after the operation completes. The client uses a waitFor pattern: register a predicate against incoming messages, return a promise that resolves when a matching message arrives, and reject if a timeout fires first. js // client.ts — waitFor private waitFor predicate: msg = boolean, timeoutMs: number : Promise<RelayMsg { return new Promise resolve, reject = { const timer = setTimeout = { this.waiters.splice this.waiters.findIndex w = w.resolve === resolve , 1 reject new Error 'Request timed out' }, timeoutMs this.waiters.push { predicate, resolve, reject, timer } } } boot device waits up to 30 seconds. install app waits 60 seconds. Each resolves on the confirmation message or rejects with the error payload. A model running a login flow might do this: 1. list devices → pick a session 2. connect device 3. list builds → find the build to test 4. boot device 5. install app 6. launch app 7. screenshot → see the login screen 8. tap email field coordinates → focus the input 9. type text "test@example.com" 10. tap password field coordinates 11. type text "password" 12. tap login button coordinates 13. screenshot → verify the home screen loaded 14. disconnect device Each screenshot gives the model a chance to verify state before proceeding. If step 13 shows an error message instead of the home screen, the model knows something went wrong. The version says 0.3.1-experimental.1 for a reason. The tools work, but the layer needs more hardening before we'd call it reliable. The core issue is consistency. The same sequence of tool calls should produce predictable behavior every time. Right now it doesn't always — there are timing edge cases where an action fires before the UI has fully settled, device state can drift between steps without the model noticing, and error recovery when something unexpected happens mid-flow is rough. These are solvable problems, but we want to solve them before presenting this as something teams should build pipelines on. The direction we're aiming at is using the MCP server as the foundation for LLM-driven smoke tests in CI. The scenario: a new build passes unit tests and gets uploaded to App Center. A CI step spins up the MCP server, points it at the relay, and gives a model a natural-language test spec: "Install the latest build. Log in with test credentials. Navigate to the cart, add an item, and confirm the checkout screen shows the correct total. Take a screenshot at each step." The model does the steps, captures evidence, and reports what it saw. No automation code to write. No selectors to maintain when the UI changes. The spec is just a description of what a human would do. This isn't production-ready yet. The stability work comes first. But the pieces — browser-controllable simulators, screenshot REST endpoint, MCP tool layer — are in place. The question is whether the model can run a flow reliably enough to be trusted in CI without a human verifying each run. We think it can. That's what we're building toward. npm install -g @tapflowio/mcp-server@experimental You'll need a running tapflow relay and a PAT token with viewer scope. Configure it in your MCP client: { "mcpServers": { "tapflow": { "command": "npx", "args": "@tapflowio/mcp-server" , "env": { "TAPFLOW RELAY URL": "wss://your-relay-url", "TAPFLOW TOKEN": "your-pat-token" } } } } If you try it and hit rough edges, open an issue — that feedback is exactly what's shaping the stability work.