If you've maintained a Playwright or Cypress test suite for more than a few months, you know the drill. A designer renames a class, a developer restructures a form, and suddenly 30 tests are broken — not because the feature broke, but because .submit-btn became [data-action="submit"]. You end up in a loop: fix selectors, ship, selectors break, fix selectors. The tests stop being useful because nobody trusts them.
We built Confidence Gate — an AI-powered test execution engine where you describe test steps in plain English and the system figures out the rest.
Instead of:
await page.locator('[data-testid="email-input"]').fill('[user@example.com](mailto:user@example.com)');
await page.locator('button[type="submit"]').click();
await expect(page).toHaveURL('/dashboard');
You write:
{ "action": "enter the email from the test data in the email field",
"expected": "the email field contains the entered address" }
"expected": "the dashboard is displayed and the login form is gone" }
The engine translates each step into a typed intent, resolves the target element from the accessibility tree, executes it in a real Playwright browser, takes a screenshot, and verifies the outcome visually.
Each step goes through four stages:
1. Intent generation — The natural language action is converted to a structured JSON ({ action: "click", target: { label: "Sign In", role: "button" }, value: null }). This separates intent from implementation.
2. Element resolution — A multi-tier resolver finds the element: accessibility tree first (fast, reliable), CSS heuristics second, AI-assisted fallback third.
3. Execution + behavior detection — Playwright executes the action. A mutation observer watches for DOM changes, URL changes, and value changes to confirm something actually happened.
4. Verification — A vision model looks at the post-action screenshot and checks it against the expected result. If behavior was detected but verification fails, the engine assumes it hit the wrong element and
retries with a blacklisted selector.
When a selector stops working between deploys, the repair loop kicks in. It re-queries the accessibility tree, scores candidate elements against the original target description, and picks the best match. The new selector is cached so the next run is fast.
After a run, every step result feeds into a score (0–100) built from:
The score maps to a gate decision: ship, caution, or block. You can call the API from CI and fail a deployment if the score drops below your threshold.
git clone https://github.com/OaktreeInnovations/confidence-gate.git cd confidence-gate
cp .env.example .env
make up
Open http://localhost:3001 and you're running. We're working on four things in order:
The repo is MIT licensed and open to contributions. If any of this is interesting to you — especially the browser recording or the AI execution engine — come say hi on GitHub.