Giving an LLM Eyes and Hands on a Mobile Simulator Tapflow has released an MCP server that connects vision-capable LLMs to its mobile simulator platform, giving AI models the ability to see screenshots and execute actions like tapping, swiping, and typing. The `@tapflowio/mcp-server` exposes 13 tools through existing WebSocket and REST APIs, allowing models to perform QA testing by reasoning about screenshots and interacting with the simulator in pixel coordinate space. The tool normalizes coordinates internally, enabling the model to identify UI elements from images and tap them directly without device-specific coordinate translation. When a person does QA in tapflow, the loop is: This is exactly the perception-action loop that vision-capable LLMs are built for. The model sees a screenshot, reasons about what it shows, decides what action to take, and calls a tool to execute it. We didn't need to build a new automation layer. We just needed to expose tapflow's existing WebSocket and REST APIs as MCP tools. @tapflowio/mcp-server connects to a running tapflow relay and registers 13 tools that any MCP-compatible client can call: list devices — see all simulators registered on the relay connect device — join a device session boot device — boot a simulator waits up to 30s for ready state screenshot — capture the current screen tap — tap at a pixel coordinate swipe — swipe between two coordinates type text — type into the focused field press key — press a keyboard key Return, Delete, Escape... press button — press a hardware button home, lock install app — install a build from App Center launch app — launch an installed app list builds — list available builds on the relay disconnect device — end the session Setup is two environment variables: TAPFLOW RELAY URL=wss://your-relay-url TAPFLOW TOKEN=your-pat-token npx @tapflowio/mcp-server Add it as an MCP server in your client config, and those tools appear in the model's tool list. The screenshot tool calls the REST endpoint we added in v0.3.0 GET /api/v1/sessions/:id/screenshot , gets back a PNG or JPEG buffer, base64-encodes it, and returns it as MCP image content alongside the pixel dimensions: return { content: { type: 'image', data: buf.toString 'base64' , mimeType }, { type: 'text', text: Screenshot saved: ${filePath} ${width}×${height}px }, , } The model receives the actual image. It can read text on screen, identify UI elements, notice error states — the same things a human would. Here's the part that took a few iterations to get right. The simulator's logical coordinate space is different from screenshot pixel coordinates, and it changes with screen resolution, device type, and scale factor. Rather than exposing logical coordinates which the model can't reason about without device-specific knowledge , we have the model work entirely in screenshot pixel space. The tap tool takes pixel coordinates plus the screenshot dimensions, then normalizes internally: // tools.ts client.tap sessionId, x / screenshotWidth, y / screenshotHeight The model calls screenshot first, reads the dimensions from the response, then uses those same dimensions when calling tap . This means the model can identify "the button is at roughly pixel 200, 450" from the image and tap it directly — no coordinate system translation required. Swipe works the same way, with 8 interpolated touch:move events across the duration to simulate a natural gesture: js // client.ts — swipe interpolation const STEPS = 8 const interval = durationMs / STEPS this.send { type: 'input:touch:start', sessionId, payload: { x: startX, y: startY } } for let i = 1; i < STEPS; i++ { await delay interval const t = i / STEPS this.send { type: 'input:touch:move', sessionId, payload: { x: Math.round startX + endX - startX t , y: Math.round startY + endY - startY t , }, } } Several tools involve async operations — booting a device, installing an app — where the relay sends a confirmation back over WebSocket after the operation completes. The client uses a waitFor pattern: register a predicate against incoming messages, return a promise that resolves when a matching message arrives, and reject if a timeout fires first. js // client.ts — waitFor private waitFor predicate: msg = boolean, timeoutMs: number : Promise