{"slug": "how-should-we-benchmark-lightpanda-for-ai-agents", "title": "How should we benchmark Lightpanda for AI agents?", "summary": "Lightpanda, a headless browser without graphical rendering, outperformed Chrome in speed and cost when benchmarked on text-based AI agent tasks using AssistantBench and GAIA Level 1, with the tool surface exposed by the MCP server mattering more than the underlying browser engine.", "body_md": "# How should we benchmark Lightpanda for AI agents?\n\n### Adrià Arrufat\n\n#### Software Engineer\n\n## TL;DR\n\nWe ran four browser-MCP configurations through AssistantBench and GAIA Level 1,\nholding the LLM brain constant (Claude Sonnet 4.6 in `claude --print`\n\nmode). The\nsetup lets you hold either variable constant: same engine, different MCP\nsurfaces, or same MCP surface, different engines. Two findings came out of it.\n\nThe tool surface the MCP server exposes to the model matters more than the browser engine underneath it. When the tool surface is held constant, Lightpanda is the better engine: faster, cheaper, fewer timeouts. agent-browser wrapping Lightpanda outperforms our own MCP, because we haven’t yet built around that finding. An upgrade for that is coming soon.\n\n## Why most browser benchmarks don’t fit Lightpanda\n\nLightpanda has no graphical rendering. That’s the design choice the whole\nproject is built around, and it makes most existing browser benchmarks a poor\nfit for us. Screenshot-graded benchmarks like [WebVoyager ](https://arxiv.org/abs/2401.13919) and Online-Mind2Web\nscore agents on what they see on screen. We don’t render to a screen, so we’re\nnot going to argue our way past that.\n\nWe picked benchmarks that match what Lightpanda actually does. Text-grounded,\nmulti-step research tasks where success is measured deterministically against a\nknown answer. [AssistantBench ](https://assistantbench.github.io/) and [GAIA Level 1 ](https://huggingface.co/datasets/gaia-benchmark/GAIA) both fit. Both grade with\ntext-based comparators against published gold answers. Both reward reasoning\nacross multiple sources, and neither requires the agent to inspect rendered\npixels.\n\n## What we ran\n\nWe compared four backends, all driven by the same Claude Sonnet 4.6 instance through the MCP interface:\n\n**Lightpanda MCP.** Lightpanda’s own`lightpanda mcp`\n\nserver on the main branch, 24 tools including`goto`\n\n,`markdown`\n\n,`semantic_tree`\n\n,`evaluate`\n\n.**agent-browser + Chrome.** The[agent-browser](https://github.com/vercel-labs/agent-browser)MCP server driving headless Chromium, 13 tools including`open`\n\n,`snapshot`\n\n,`find`\n\n,`get`\n\n.**agent-browser + Lightpanda.** Same agent-browser MCP server, but with`AGENT_BROWSER_ENGINE=lightpanda`\n\nso Lightpanda is the underlying engine instead of Chrome.**browser-use + Chrome.** The[browser-use](https://github.com/browser-use/browser-use)MCP server driving Chromium, 11 tools.\n\nThe benchmarks both grade against gold answers with string comparators. Both have a per-task timeout of 1800 seconds. We ran each backend once across the full suite at concurrency 4. Per-turn token usage was captured live off the Claude stream, so cost and context growth are measured rather than estimated.\n\n## Caveats up front\n\nA few things to flag before the numbers, because they shape how much weight to put on small differences:\n\n**Single run per configuration, no error bars.** API non-determinism, open-web volatility (pages going down, search engines throttling), and small sample sizes (33 AB tasks, 53 GAIA tasks) put the noise floor on accuracy gaps at roughly ±10 pp. We treat differences ≥10 pp as meaningful and smaller gaps as directional.**Text-heavy research workloads only.** AssistantBench and GAIA Level 1 are Wikipedia lookups, store directories, news articles, and government data. Exactly the workloads where Lightpanda’s text-only design plays to its strengths. We didn’t test JS-heavy SPAs, CAPTCHA or Cloudflare challenges, or anything that needs actual rendering. “Browser fidelity for arbitrary modern web apps” is not what we measured.**One model.** Sonnet 4.6 is one point on the model-size axis. Larger models might handle Lightpanda MCP’s verbose tool outputs better. Smaller ones might struggle more.**GAIA Level 1 only.** Levels 2 and 3 would stress multi-hop reasoning more, where tool-surface quality probably matters even more than what we observed here.\n\nThis is a first honest pass at understanding the relationship between MCP design, browser engine, and agent accuracy.\n\n## The results\n\n| Lightpanda MCP | agent-browser + Chrome | agent-browser + Lightpanda | browser-use | |\n|---|---|---|---|---|\nAssistantBench (33) strict | 0.424 | 0.45 | 0.606 | 0.42 |\n| AB avg duration | 1112 s | 1045 s | 956 s | 1130 s |\n| AB timeouts | 11/33 | 8/33 | 7/33 | 4/33 |\n| AB cost / task | $2.17 | $3.10 | $2.85 | $3.85 |\nGAIA Level 1 (53) strict | 0.755 | 0.83 | 0.887 | 0.43 |\n| GAIA avg duration | 416 s | 453 s | 321 s | 287 s |\n| GAIA timeouts | 6/53 | 4/53 | 1/53 | 2/53 |\n| GAIA cost / task | $0.63 | $0.94 | $0.94 | $0.73 |\n\nThree things stand out:\n\n**agent-browser + Lightpanda is the Pareto winner on both benchmarks.** It wins accuracy outright on both, and it’s roughly tied with the cheapest configuration on cost per task on GAIA.**Lightpanda’s own MCP is the cheapest per task.**$2.17 on AB, $0.63 on GAIA. But it’s also the slowest per task and has the most timeouts on AB (11 of 33). On accuracy, our MCP is behind agent-browser + Lightpanda by 18 pp on AssistantBench and 13 pp on GAIA.**browser-use answers the most AssistantBench tasks (29 of 33) but matches Lightpanda MCP on accuracy** at 0.42, and collapses on GAIA at 0.43. The model spends 55% of its calls on navigate, runs the longest turn counts (219 on AB, 152 on GAIA), and produces confident answers that aren’t grounded in careful page reading. On GAIA’s exact-match grader, “close enough” doesn’t score.\n\n## Tool surface beats engine\n\nThe interesting comparison holds either variable constant. Two configurations share an engine (Lightpanda) but use different MCP surfaces. Two configurations share a surface (agent-browser) but use different engines. Holding either variable constant tells you which one carries the weight.\n\n| Lightpanda engine | Chrome engine | |\n|---|---|---|\n| Lightpanda MCP surface | AB 0.424 / GAIA 0.755 | not measurable (Lightpanda MCP is built into the Lightpanda binary, it can’t drive Chrome) |\n| agent-browser MCP surface | AB 0.606 / GAIA 0.887 | AB 0.45 / GAIA 0.83 |\n\nSame engine, different MCP surface: +18 pp on AB, +13 pp on GAIA.\n\nSame MCP surface, different engine: +16 pp on AB, +6 pp on GAIA.\n\nThe MCP tool surface is the dominant variable. The engine is secondary, but consistently favours Lightpanda.\n\n## Why our MCP currently loses\n\nLightpanda’s engine is fast. When wrapped by agent-browser’s MCP, it runs at 4.82 seconds per turn on AssistantBench, faster than Chrome+agent-browser at 5.03 seconds. It’s Lightpanda’s own MCP that’s slow, at 7.04 seconds per turn. About 46% more time per turn on the same engine, driven through a different MCP. Same pattern on GAIA (7.92 s vs 5.38 s).\n\nA turn is “Claude emits a tool call, the MCP server runs it, returns a payload,\nClaude reads it and emits the next tool call.” Bigger payloads mean more\nserialization, more bytes over stdio, and more tokens for Claude to process.\nAnd our MCP leans hard on one specific tool that returns large payloads:\n`markdown`\n\n.\n\n`markdown`\n\nreturns the readable text of a page or subtree. On a real research\npage that’s commonly 10 to 30 KB of text. On Lightpanda MCP, 34% of all tool\ncalls are `markdown`\n\n. On agent-browser variants, it’s effectively 0% because\nagent-browser doesn’t have a markdown tool at all. browser-use sits at 18%.\n\nagent-browser’s design replaces full-page markdown with a combination of\n`snapshot`\n\n(accessibility tree, structured and small), `get`\n\n(focused data\nfetches), and `find`\n\n(locate by role). Smaller payloads per call, more calls,\nlower total bytes flowing through Claude’s context per useful piece of\ninformation. On AssistantBench:\n\n| Lightpanda MCP | agent-browser + Lightpanda | |\n|---|---|---|\n| avg turns per task | 158 | 198 |\n| avg tokens / turn | ~860 | ~660 |\n| avg final-turn context | 136 K | 130 K |\n| avg duration per turn | 7.04 s | 4.82 s |\n\nagent-browser uses more turns, but each turn is smaller and faster. The agent gets more bites at the apple before the 30-minute clock runs out. On AB, where 11 of 33 Lightpanda MCP runs timed out vs 7 of 33 for agent-browser + Lightpanda, those extra bites translate directly into answered tasks.\n\n## Where our engine clearly wins\n\nHold the MCP surface constant and the engine comparison is cleaner. Same agent-browser tools, Chrome vs Lightpanda:\n\n| agent-browser + Chrome | agent-browser + Lightpanda | |\n|---|---|---|\n| AB avg duration | 1045 s | 956 s |\n| AB timeouts | 8/33 | 7/33 |\n| GAIA avg duration | 453 s | 321 s |\n| GAIA timeouts | 4/53 | 1/53 |\n\nOn GAIA, the Lightpanda engine cuts wall time per task by 29% and quarters the timeout rate. There are three concrete drivers behind this, none of them magic.\n\n**Lightpanda has more answered tasks.** On AssistantBench, three of the five tasks\nwhere Lightpanda beat Chrome were tasks Chrome timed out on. On GAIA, two of\nfive. There were zero cases on either benchmark where Lightpanda timed out but\nChrome answered. The engine swap catches what Chrome runs out of time on\nwithout trading away any wins.\n\n**Lightpanda is faster per task even before any timeout shows up.** Among tasks\nboth engines completed within budget, Lightpanda was 9% faster on AB (656 s vs\n718 s) and 20% faster on GAIA (274 s vs 343 s). That extra wall-time budget\ncompounds, more retry attempts before the cap hits.\n\n**The page state stays where the agent left it.** Same MCP server, but the agent\nmakes meaningfully fewer “redo” calls on Lightpanda. On GAIA, Chrome agents\ncall open (navigation) 70% more often (24.7 vs 14.5 per task) because pages\ndrift out from under them: cookie banners, lazy-load reflows, post-load\nredirects. On AssistantBench, Chrome calls snapshot 54% more often (22.8 vs\n14.8) because DOM mutations from ads and tracking JS invalidate prior\nsnapshots, forcing re-reads.\n\nLightpanda is text-only, so the DOM the model sees stays stable across turns, and fewer retries are needed.\n\n## What we’re shipping next\n\nThis data is driving three changes which we’re currently developing and testing.\n\n**Workflow guidance, not tool count.** Lightpanda’s MCP exposes 24 tools. agent-browser exposes 13 and outperforms it. The 34%`markdown`\n\ndominance is the model reaching for the obvious “see what’s on the page” because nothing in our guidance pushes it elsewhere. The fix is workflow: start with`tree`\n\n(semantic overview, cheap) on any unfamiliar page, drill down with`nodeDetails`\n\nor`findElement`\n\nto locate the interesting region, then call`markdown(backendNodeId | selector)`\n\nto materialize prose for just that subtree. Full-page markdown stays available but is explicitly the fallback.**A first-class** The model used to synthesise web searches by`search`\n\ntool.`goto`\n\n-ing a search engine, calling`markdown`\n\non the results page, and parsing manually. A dedicated search tool collapses that whole sequence into a single call. On our internal runs this also drops`eval`\n\nfrom 17% of calls to 3%, because most of those JavaScript-evaluation calls were workarounds for things`search`\n\nnow covers directly. The preview wraps Tavily search API as the primary backend, with DuckDuckGo as a fallback. A dedicated`search`\n\ncall is cleaner than the goto+markdown pattern either way, but a hosted search API contributes to the preview’s speed.**An integrated agent inside Lightpanda that talks to the model directly.** MCP is a great interop layer, but it adds round-trip overhead on every tool call, and the model has to repeatedly re-read large prefixes (system prompt, tool definitions, prior tool results) at full input price. We’re developing an agent that owns the conversation, and uses prompt caching on the system prompt and tool definitions. On Anthropic’s published pricing, cached input tokens are roughly 10x cheaper than fresh ones. Early internal runs put 99% of input tokens into cache reads after the first turn.\n\n## Try it yourself\n\nBenchmarks, gold answers, harness, and per-task traces are at\n[github.com/lightpanda-io/agent-benchmarks ](https://github.com/lightpanda-io/agent-benchmarks) under Apache 2.0. The fastest way to\nreproduce the table is to clone the repo, open Claude Code in it, and ask it to\nreproduce the results with the same models and timeouts. That’s the whole\nworkflow.\n\nThe [quickstart\nguide ](https://lightpanda.io/docs/quickstart/installation-and-setup) gets you\nrunning Lightpanda locally in under 10 minutes if you want to try it on your\nown workloads first.\n\n## FAQ\n\n### Why didn’t you use WebVoyager?\n\nWebVoyager grades agents on screenshots, and Lightpanda doesn’t render to a screen. There’s no fair way to run a non-rendering browser through a benchmark that scores visual matches. We focused on text-graded benchmarks where the comparison is meaningful.\n\n### Why does Lightpanda’s own MCP underperform agent-browser wrapping Lightpanda?\n\nThe tool mix leans heavily on `markdown`\n\n, which returns 10 to 30 KB of page text\nper call. That inflates per-turn latency by about 46% compared to the same\nengine wrapped by agent-browser, where smaller payloads (`snapshot`\n\n, `get`\n\n, `find`\n\n)\ndominate. Our system-level workflow guidance points the model at full-page\nmarkdown as the default page-inspection step, where agent-browser implicitly\nsteers it toward a tree-first pattern.\n\n### What model did you use?\n\nClaude Sonnet 4.6 across every configuration. Driven through `claude --print --output-format stream-json`\n\nso per-turn cost and token usage came live off the\nstream. The model is held constant so the variable is the browser layer.\n\n### Is the benchmark harness open source?\n\nYes. The runner, prompt configurations, gold answers, and per-task traces are\nat [github.com/lightpanda-io/agent-benchmarks ](https://github.com/lightpanda-io/agent-benchmarks) under Apache 2.0.\n\n### What’s the difference between agent-browser and browser-use?\n\nagent-browser exposes a CDP-style tool surface: `open`\n\n, `snapshot`\n\n, `find`\n\n, `get`\n\n.\nPages come back as accessibility-tree snapshots with element IDs. browser-use\nexposes a raw-HTML surface with `browser_get_html`\n\nreturning the full page\nsource, plus its own autonomous agent loop that we disabled for this\ncomparison. agent-browser leans on small structured payloads. browser-use leans\non full-page text and trusts the model to find what it needs.\n\n### How many runs did you average?\n\nOne per configuration. The headline differences are well above the ~10 pp noise floor we’d expect from API non-determinism and open-web drift. We treat differences ≥10 pp as meaningful and smaller ones as directional.\n\n### Did the agent know which browser it was using?\n\nNot deliberately. The agent was told what tools it had access to and how to use them. It wasn’t told whether the browser underneath was Lightpanda or Chrome. We can’t fully rule out that something like a user-agent string leaked through on a given page, but nothing in our prompt or tool descriptions identified the engine.\n\n### Adrià Arrufat\n\n#### Software Engineer\n\nAdrià is an AI engineer at Lightpanda, where he works on making the browser more useful for AI workflows. Before Lightpanda, Adrià built machine learning systems and contributed to open-source projects across computer vision and systems programming.", "url": "https://wpnews.pro/news/how-should-we-benchmark-lightpanda-for-ai-agents", "canonical_source": "https://lightpanda.io/blog/posts/benchmarking-lightpanda-for-agents", "published_at": "2026-06-03 00:00:00+00:00", "updated_at": "2026-06-17 14:52:35.672749+00:00", "lang": "en", "topics": ["ai-agents", "ai-research", "ai-tools", "ai-infrastructure"], "entities": ["Lightpanda", "Claude Sonnet 4.6", "AssistantBench", "GAIA", "agent-browser", "browser-use", "Chrome", "MCP"], "alternates": {"html": "https://wpnews.pro/news/how-should-we-benchmark-lightpanda-for-ai-agents", "markdown": "https://wpnews.pro/news/how-should-we-benchmark-lightpanda-for-ai-agents.md", "text": "https://wpnews.pro/news/how-should-we-benchmark-lightpanda-for-ai-agents.txt", "jsonld": "https://wpnews.pro/news/how-should-we-benchmark-lightpanda-for-ai-agents.jsonld"}}