{"slug": "how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button", "title": "How We Made Our AI Browser Agent Stop Clicking the Wrong Button", "summary": "Smoketest.sh improved its AI browser agent's reliability by switching from letting the model invent selectors to using stable accessibility tree refs. The fix uses Playwright's ariaSnapshot in AI mode to provide a role-based tree with ref IDs, which the agent uses to target elements precisely. This approach, similar to Microsoft's Playwright MCP server, eliminates the ambiguity of fuzzy English descriptions.", "body_md": "At Smoketest.sh, you describe a flow in a sentence (\"log in, add a paid seat, confirm the invoice updates\") and an AI agent runs it in a real browser. The agent reads the page, decides what to do, and drives Playwright to do it.\n\nThe first version worked great in the demo and fell apart on the second run. This is the story of why, and the fix that made element targeting reliable: never let the model invent a selector. Hand it stable IDs from the accessibility tree and make it point at those.\n\n**TL;DR**\n\n`page.ariaSnapshot({ mode: 'ai' })`\n\nreturns the page as a role-based tree and stamps every interactive element with a stable `[ref=eN]`\n\nID.`aria-ref=eN`\n\nas a first-class locator, so the model can act on the exact element it just saw.Here is how each piece works.\n\nThe naive design is the obvious one. Give the model a `click`\n\ntool that takes a description, and let it figure out the rest:\n\n```\n// tempting, and wrong\nclick({ description: \"the Sign in button\" })\n```\n\nUnder the hood you turn that string into a locator. On a clean login page, `getByRole('button', { name: 'Sign in' })`\n\nfinds exactly one element and it works. Ship it, watch the demo pass, feel good.\n\nThen it meets a real app:\n\nNone of these are bugs in the model. They are the consequence of using a regenerated English phrase as a selector. The phrase is fuzzy by construction, and fuzzy selectors on a busy page do not resolve to one element.\n\nThe fix starts by changing what the model looks at. Instead of letting it guess from a screenshot or raw HTML, we hand it Playwright's [accessibility snapshot](https://playwright.dev/docs/aria-snapshots) in AI mode, a compact view of the page's [accessibility tree](https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree). That is one tool:\n\n```\n{\n  name: 'getAccessibilityTree',\n  description:\n    'Return a structured representation of page content as an accessibility tree to understand the page.',\n  parameters: { type: 'object', properties: {} },\n  execute: async () => {\n    const tree = await page.ariaSnapshot({ mode: 'ai' });\n    return { tree };\n  },\n}\n```\n\n`page.ariaSnapshot({ mode: 'ai' })`\n\nreturns the page as a compact, role-based tree. The important part of AI mode: every interactive element gets a `[ref=eN]`\n\ntag. A login page comes back looking roughly like this:\n\n```\n- heading \"Welcome back\" [level=1]\n- textbox \"Email\" [ref=e4]\n- textbox \"Password\" [ref=e5]\n- button \"Sign in\" [ref=e6]\n- link \"Forgot password?\" [ref=e7]\n```\n\nThe model no longer has to describe the button. It can refer to `e6`\n\n. That ref is the contract between \"what the model saw\" and \"what Playwright clicks,\" and it is the whole game.\n\nThis is the same structured-snapshot approach Microsoft's [Playwright MCP server](https://github.com/microsoft/playwright-mcp) takes: let the model act on accessibility refs, not on pixels or guesses.\n\nThe reason refs work is that Playwright resolves them directly. `aria-ref=e6`\n\nis a real locator engine, not something we built. So the `click`\n\ntool prefers the ref and only falls back to a description when it has none:\n\n``` js\nexecute: async ({ ref, description }) => {\n  const refStr = ref?.trim() || null;\n  const text = description?.trim() || null;\n\n  if (!refStr && !text) {\n    throw new Error('click requires either ref or description');\n  }\n\n  const locator = refStr\n    ? page.locator(`aria-ref=${refStr}`)        // stable: resolves against the snapshot\n    : await resolveLocator(page, text!);         // fallback: fuzzy, best-effort\n\n  await locator.click();\n  // ...\n}\n```\n\nThe ref path is stable because it is resolved against the exact snapshot the model just read, not re-derived from a phrase. Same idea for `fill`\n\n, `select`\n\n, and `getText`\n\n. Every interaction tool takes `ref`\n\nfirst and `description`\n\nsecond.\n\nTools that accept a ref are not enough. The model will still reach for a description if you let it, because describing things in English is what language models love to do. So the rules you give it have to make the ordering non-negotiable:\n\nThat last rule is the one that earns its place. The instinct of a language model after a failed action is to try a more elaborate description. That is exactly the wrong move, because the description was never the reliable path. Re-reading the tree gives it fresh refs that match the current DOM, which is what actually changed.\n\nRefs are not always available. The model might be acting on something it inferred rather than something in the last snapshot. So `resolveLocator`\n\nis a deliberate ladder, not a single guess. For each candidate phrase it tries [role, then label, then placeholder, then text](https://playwright.dev/docs/locators), and takes the first one that is actually visible:\n\n``` js\nfor (const phrase of phrases) {\n  if (roleHint) {\n    const roleLocator = page.getByRole(roleHint, { name: phrase, exact: false });\n    if (await isVisible(roleLocator)) return roleLocator;\n  }\n\n  const labelLocator = page.getByLabel(phrase, { exact: false });\n  if (await isVisible(labelLocator)) return labelLocator;\n\n  const placeholderLocator = page.getByPlaceholder(phrase, { exact: false });\n  if (await isVisible(placeholderLocator)) return placeholderLocator;\n\n  const textLocator = page.getByText(phrase, { exact: false });\n  if (await isVisible(textLocator)) return textLocator;\n}\n\nthrow new Error(`Could not find a visible element for description: ${description}`);\n```\n\n`isVisible`\n\nis a 5-second `waitFor({ state: 'visible' })`\n\nwrapped in a try/catch, so a candidate that exists but is hidden does not win. The phrase extraction pulls quoted substrings out of the description first (\"click the button labeled \\\"Place order\\\"\" yields `Place order`\n\n), so the model's verbosity does not poison the match.\n\nThis is the fuzzy path, and we treat it as such. It is good enough to recover, and it is exactly why we want the model on refs whenever it can be.\n\nWhen even the fallback misses, the worst thing you can return is a bare \"element not found.\" The model has nothing to act on and will flail. So a failed click collects diagnostics about what the page actually contains and returns them with the error:\n\n``` js\nconst diagnostics = await collectClickDiagnostics(page, text!);\nthrow new Error(`${getErrorMessage(error)}. Diagnostics: ${JSON.stringify(diagnostics)}`);\n```\n\n`collectClickDiagnostics`\n\ncounts how many elements matched by role, by label, and by text, and includes a sample of the page's links:\n\n```\nreturn {\n  description,\n  roleHint: roleMatch?.role ?? null,\n  roleCount,    // e.g. 0 buttons matched\n  labelCount,   // e.g. 0 labels matched\n  textCount,    // e.g. 3 text nodes matched\n  sampleLinks: linkSamples,\n  currentUrl: page.url(),\n};\n```\n\nNow the failure is legible. `textCount: 3, roleCount: 0`\n\ntells the model (and us, in the trace) that the thing it called a button is really three pieces of text, so it should re-read the tree and target a real interactive element. The recovery loop closes because the error carries enough to act on.\n\nThere is also a small specialization for links: if the model meant to click a link and the locator missed, we look up the href by matching link text or `aria-label`\n\nand navigate directly, which sidesteps a whole class of overlay-and-intercept clicks.\n\nThis is reliable element targeting, not a deterministic agent. Two limits worth stating plainly:\n\n`e6`\n\nmay point at nothing or at the wrong node. That is why the prompt forces a fresh `getAccessibilityTree`\n\nafter failures and on new pages. Treat refs as per-snapshot, not durable.And the model still decides *what* to do. Refs make sure that when it decides to click the Sign in button, it clicks that button and not the footer link with the same name. They do not stop it from deciding to click the wrong thing in the first place. That is a different problem, solved with a separate evaluation pass.\n\nThe one idea to take away: do not let the model emit selectors. A selector invented from an English phrase is regenerated every run and rarely resolves to one element. Instead,\n\n`page.ariaSnapshot({ mode: 'ai' })`\n\n).`page.locator('aria-ref=eN')`\n\n).That sequence is what moved our agent from \"passes the demo\" to \"passes on the second run, and the hundredth.\"\n\nWe run this in production at Smoketest. You describe the flows that matter (login, checkout, onboarding, billing), and we run them in a real browser after every deploy and tell you what broke. No Playwright suite for you to own or maintain. Take a look at [smoketest.sh](https://smoketest.sh).", "url": "https://wpnews.pro/news/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button", "canonical_source": "https://dev.to/omidseyfan/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button-3kkl", "published_at": "2026-06-30 13:07:00+00:00", "updated_at": "2026-06-30 13:19:32.961653+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "developer-tools", "large-language-models", "natural-language-processing"], "entities": ["Smoketest.sh", "Playwright", "Microsoft", "Playwright MCP server"], "alternates": {"html": "https://wpnews.pro/news/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button", "markdown": "https://wpnews.pro/news/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button.md", "text": "https://wpnews.pro/news/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button.txt", "jsonld": "https://wpnews.pro/news/how-we-made-our-ai-browser-agent-stop-clicking-the-wrong-button.jsonld"}}