How We Made Our AI Browser Agent Stop Clicking the Wrong Button

wpnews.pro

At Smoketest.sh, you describe a flow in a sentence ("log in, add a paid seat, confirm the invoice updates") and an AI agent runs it in a real browser. The agent reads the page, decides what to do, and drives Playwright to do it.

The first version worked great in the demo and fell apart on the second run. This is the story of why, and the fix that made element targeting reliable: never let the model invent a selector. Hand it stable IDs from the accessibility tree and make it point at those.

TL;DR

page.ariaSnapshot({ mode: 'ai' })

returns the page as a role-based tree and stamps every interactive element with a stable [ref=eN]

ID.aria-ref=eN

as a first-class locator, so the model can act on the exact element it just saw.Here is how each piece works.

The naive design is the obvious one. Give the model a click

tool that takes a description, and let it figure out the rest:

// tempting, and wrong
click({ description: "the Sign in button" })

Under the hood you turn that string into a locator. On a clean login page, getByRole('button', { name: 'Sign in' })

finds exactly one element and it works. Ship it, watch the demo pass, feel good.

Then it meets a real app:

None of these are bugs in the model. They are the consequence of using a regenerated English phrase as a selector. The phrase is fuzzy by construction, and fuzzy selectors on a busy page do not resolve to one element.

The fix starts by changing what the model looks at. Instead of letting it guess from a screenshot or raw HTML, we hand it Playwright's accessibility snapshot in AI mode, a compact view of the page's accessibility tree. That is one tool:

{
  name: 'getAccessibilityTree',
  description:
    'Return a structured representation of page content as an accessibility tree to understand the page.',
  parameters: { type: 'object', properties: {} },
  execute: async () => {
    const tree = await page.ariaSnapshot({ mode: 'ai' });
    return { tree };
  },
}

page.ariaSnapshot({ mode: 'ai' })

returns the page as a compact, role-based tree. The important part of AI mode: every interactive element gets a [ref=eN]

tag. A login page comes back looking roughly like this:

- heading "Welcome back" [level=1]
- textbox "Email" [ref=e4]
- textbox "Password" [ref=e5]
- button "Sign in" [ref=e6]
- link "Forgot password?" [ref=e7]

The model no longer has to describe the button. It can refer to e6

. That ref is the contract between "what the model saw" and "what Playwright clicks," and it is the whole game.

This is the same structured-snapshot approach Microsoft's Playwright MCP server takes: let the model act on accessibility refs, not on pixels or guesses.

The reason refs work is that Playwright resolves them directly. aria-ref=e6

is a real locator engine, not something we built. So the click

tool prefers the ref and only falls back to a description when it has none:

execute: async ({ ref, description }) => {
  const refStr = ref?.trim() || null;
  const text = description?.trim() || null;

  if (!refStr && !text) {
    throw new Error('click requires either ref or description');
  }

  const locator = refStr
    ? page.locator(`aria-ref=${refStr}`)        // stable: resolves against the snapshot
    : await resolveLocator(page, text!);         // fallback: fuzzy, best-effort

  await locator.click();
  // ...
}

The ref path is stable because it is resolved against the exact snapshot the model just read, not re-derived from a phrase. Same idea for fill

, select

, and getText

. Every interaction tool takes ref

first and description

second.

Tools that accept a ref are not enough. The model will still reach for a description if you let it, because describing things in English is what language models love to do. So the rules you give it have to make the ordering non-negotiable:

That last rule is the one that earns its place. The instinct of a language model after a failed action is to try a more elaborate description. That is exactly the wrong move, because the description was never the reliable path. Re-reading the tree gives it fresh refs that match the current DOM, which is what actually changed.

Refs are not always available. The model might be acting on something it inferred rather than something in the last snapshot. So resolveLocator

is a deliberate ladder, not a single guess. For each candidate phrase it tries role, then label, then placeholder, then text, and takes the first one that is actually visible:

for (const phrase of phrases) {
  if (roleHint) {
    const roleLocator = page.getByRole(roleHint, { name: phrase, exact: false });
    if (await isVisible(roleLocator)) return roleLocator;
  }

  const labelLocator = page.getByLabel(phrase, { exact: false });
  if (await isVisible(labelLocator)) return labelLocator;

  const placeholderLocator = page.getByPlaceholder(phrase, { exact: false });
  if (await isVisible(placeholderLocator)) return placeholderLocator;

  const textLocator = page.getByText(phrase, { exact: false });
  if (await isVisible(textLocator)) return textLocator;
}

throw new Error(`Could not find a visible element for description: ${description}`);

isVisible

is a 5-second waitFor({ state: 'visible' })

wrapped in a try/catch, so a candidate that exists but is hidden does not win. The phrase extraction pulls quoted substrings out of the description first ("click the button labeled "Place order"" yields Place order

), so the model's verbosity does not poison the match.

This is the fuzzy path, and we treat it as such. It is good enough to recover, and it is exactly why we want the model on refs whenever it can be.

When even the fallback misses, the worst thing you can return is a bare "element not found." The model has nothing to act on and will flail. So a failed click collects diagnostics about what the page actually contains and returns them with the error:

const diagnostics = await collectClickDiagnostics(page, text!);
throw new Error(`${getErrorMessage(error)}. Diagnostics: ${JSON.stringify(diagnostics)}`);

collectClickDiagnostics

counts how many elements matched by role, by label, and by text, and includes a sample of the page's links:

return {
  description,
  roleHint: roleMatch?.role ?? null,
  roleCount,    // e.g. 0 buttons matched
  labelCount,   // e.g. 0 labels matched
  textCount,    // e.g. 3 text nodes matched
  sampleLinks: linkSamples,
  currentUrl: page.url(),
};

Now the failure is legible. textCount: 3, roleCount: 0

tells the model (and us, in the trace) that the thing it called a button is really three pieces of text, so it should re-read the tree and target a real interactive element. The recovery loop closes because the error carries enough to act on.

There is also a small specialization for links: if the model meant to click a link and the locator missed, we look up the href by matching link text or aria-label

and navigate directly, which sidesteps a whole class of overlay-and-intercept clicks.

This is reliable element targeting, not a deterministic agent. Two limits worth stating plainly:

e6

may point at nothing or at the wrong node. That is why the prompt forces a fresh getAccessibilityTree

after failures and on new pages. Treat refs as per-snapshot, not durable.And the model still decides what to do. Refs make sure that when it decides to click the Sign in button, it clicks that button and not the footer link with the same name. They do not stop it from deciding to click the wrong thing in the first place. That is a different problem, solved with a separate evaluation pass.

The one idea to take away: do not let the model emit selectors. A selector invented from an English phrase is regenerated every run and rarely resolves to one element. Instead,

page.ariaSnapshot({ mode: 'ai' })

).page.locator('aria-ref=eN')

).That sequence is what moved our agent from "passes the demo" to "passes on the second run, and the hundredth."

We run this in production at Smoketest. You describe the flows that matter (login, checkout, onboarding, billing), and we run them in a real browser after every deploy and tell you what broke. No Playwright suite for you to own or maintain. Take a look at smoketest.sh.

source & further reading

dev.to — original article Cutting Idle Agent Costs by 90% with Agent Substrate NVIDIA Nemotron 3 Ultra & GLM-5.2: The Open Model Flood Is Here (June 2026) How to Automate the ChatGPT & Gemini Web UIs Without an API Key

How We Made Our AI Browser Agent Stop Clicking the Wrong Button

Run your AI side-project on zahid.host