Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM Alibaba released Page Agent, an open-source JavaScript library that runs inside a webpage to control its interface using natural language commands. The agent reads the live DOM as text, compressing it into a FlatDomTree for efficient processing by text-based AI models. This approach enables in-page automation without external browser drivers, making it suitable for building copilots and form-filling features within owned web applications. Most browser automation runs from the outside. Playwright, Puppeteer, Selenium, and browser-use all drive a browser from an external process. They read the page through screenshots or the Chrome DevTools Protocol. Alibaba’s Page Agent takes the opposite path. The agent lives inside the webpage as plain JavaScript. It reads the live DOM as text and acts as the real user. No headless browser, no screenshots, no multi-modal model. The project is open-source under the MIT license. The codebase is TypeScript-first. It builds on browser-use, from which its DOM processing and prompt are derived. TL;DR - Page Agent runs inside the page as JavaScript, reading the live DOM as text, not screenshots. - DOM dehydration compresses the page into a FlatDomTree so smaller text models can act precisely. - It is model-agnostic through any OpenAI-compatible endpoint and ships under the MIT license. - Prompt-level safety and single-page scope are real limits; keep server-side validation for risky actions. - Best fit: copilots and form-filling inside apps you own, not external or locked-down sites. What is Page Agent? Page Agent is a client-side library for adding agent behavior to a web app. You embed it, then issue commands in natural language. The agent finds elements, clicks buttons, and fills forms from within the page. Because it runs in the browser session, it inherits the user’s cookies, session, and authentication. There is no separate backend to write. The existing UI validation and security rules stay in place. The design is model-agnostic. You bring your own large language model through any OpenAI-compatible endpoint. Only text is sent to the model, so a strong text model is enough. How DOM Dehydration Works The core technique is what the team calls DOM dehydration . A modern page can hold thousands of nodes. Sending raw HTML to a model would be slow and expensive. When a command arrives, the agent scans the Document Object Model. It identifies every interactive element, such as buttons, links, and input fields. Each element receives an index plus a role and a label. The live DOM is converted into a FlatDomTree , a clean text map of what matters. Redundant markup is stripped out. The model reads this compact representation, not pixels. The interactive demo on this page mirrors this loop. Watch the “Dehydrated DOM” and “Action trace” panels update as commands run. Under the hood, the agent delegates work to a PageController : await this.pageController.updateTree await this.pageController.clickElement index await this.pageController.inputText index, text await this.pageController.scroll { down: true, numPages: 1 } The monorepo splits these concerns into small packages. @page-agent/core holds the headless agent logic. page-agent is the full entry class with a UI panel. @page-agent/page-controller handles DOM extraction and element indexing, with optional visual feedback through a SimulatorMask . Developers keep control of scope. Operation allowlists limit which actions the agent may run. Data masking can hide sensitive fields, such as passwords, from the model. Custom knowledge can be injected so the agent follows your domain rules. How It Compares | Approach | Where it runs | Reads the page via | Setup | Best fit | |---|---|---|---|---| | Page Agent | Inside the page client-side JS | Dehydrated text DOM | One script tag or npm | Copilots inside apps you own | | Selenium / Playwright / Puppeteer | External process | DOM via driver WebDriver/CDP | Driver plus runtime or server | Scripted end-to-end testing | | browser-use | External process | DOM plus optional vision | Python plus a browser | Autonomous multi-site agents | | WebMCP | Server-side tools | Structured function calls | Requires standard adoption | Native agent tool access | The takeaway is scope, not speed. Page Agent fits products you control and can add code to. External drivers still win for cross-site scraping and locked-down environments. Use Cases, With Examples SaaS AI copilot : Ship an assistant that operates the product, not one that only gives instructions. A support bot can perform the steps for the user instead of describing them. Smart form filling : Collapse a multi-step ERP or CRM form into one instruction. A user types ‘Submit a travel expense for $50 for lunch yesterday.’ The agent handles the navigation and data entry. Accessibility : Pair it with the Web Speech API for voice control. Any web app becomes reachable through natural language, with screen-reader friendly announcements. Legacy app modernization : It can wrap a legacy internal tool that has no API. You add a command bar without changing the original code. Quick Start For evaluation, one script tag loads Page Agent with a free testing LLM: