Benchmarking Lightpanda’s native agent

wpnews.pro

Adrià Arrufat

Software Engineer

TL;DR #

Our first benchmark post found that the tools an agent calls have an impact on accuracy and speed. We rebuilt that tool surface, and it paid off in two places. Our MCP (Model Context Protocol) server jumped from 0.424 to 0.697 on AssistantBench and cut its GAIA wall time from 416 to 219 seconds per task. We put the same tools in a native agent built into the Lightpanda binary , which drops the MCP serialization boundary and runs faster still: 86 seconds on GAIA, 0.830 accuracy, zero timeouts. The accuracy comes from the tools and an improved prompt, the native agent adds speed.

The gap we set out to close #

The first benchmark post held the model constant (Claude Sonnet 4.6) and compared browser backends on two text-graded benchmarks. GAIA Level 1 is 53 deterministically-graded tasks. AssistantBench is 33 text-answer research tasks. Both grade against published gold answers with string comparators, both run with an 1800-second per-task timeout, and we run each configuration once at concurrency 4.

Our own MCP underperformed agent-browser wrapping the same Lightpanda engine, because our tool mix leaned too hard on full-page markdown

and put large payloads through the model on every turn. That left a clear gap. On GAIA, our MCP scored 0.755 against the 0.887 of agent-browser with Lightpanda. On AssistantBench, 0.424 vs 0.606.

One caveat carries into this post. These are single runs with no error bars, on samples of 33 and 53 tasks. We set the noise floor at 10 pp in the first post and we maintain that here. Treat any accuracy difference under 10 pp as directional. Treat the speed and cost numbers as measured.

The fix: better tools, then no protocol #

We changed the tool surface in two places.

The MCP server itself. We replaced the markdown-heavy tool mix with scoped tools: a markdown call that targets a single node, a nodeDetails

call that returns an element’s CSS selector directly, and a fastergoto

. On AssistantBench the MCP went from 0.424 to 0.697 strict. On GAIA its wall time per task dropped from 416 seconds to 219. - The native agent. MCP is a clean interop layer, but it puts a server and a serialization boundary between the model and the browser. Every tool call means a round-trip to that server and back. The native agent runs the same improved tools inside the Lightpanda binary, so those round-trips disappear and each turn finishes faster. On GAIA it runs in 86 seconds against the new MCP’s 219.

Putting the agent in the browser does not cut token cost by itself. MCP tool definitions and tool_result

blocks cache like any other input. The token savings here come from prompt caching and the leaner tools, which both setups share. What the native agent does improve, on top of the tool fix, is time per turn.

Where speed turns into accuracy #

AssistantBench is a strict grader that scores an unanswered task as zero. A task that runs past the 1800-second cap is unanswered. So on any benchmark with tasks long enough to hit that cap, faster turns raise the score directly, because the agent finishes more tasks inside the budget.

Our MCP timed out on 11 of its 33 tasks. The native agent times out on none. The rebuilt tool surface recovers most of those, which is why the AssistantBench score moves from 0.424 to 0.697. It moves for the new MCP and for the native agent, because they run the same tools. The reasoning didn’t get smarter; the agent stopped running out of time.

On GAIA Level 1, tasks finish in 86 to 453 seconds against the same 1800-second cap, so there is plenty of headroom. Faster turns still help cost and stability, but there are almost no timed-out tasks to recover. That’s the reason the tool fix moves AssistantBench accuracy and leaves GAIA accuracy flat.

The climb, iteration by iteration #

Each result below is measured against the same model and timeouts as the first post. This chart compares the old and rebuilt MCP with the native agent across each rebuild.

native iteration	GAIA accuracy	GAIA duration	AB strict	AB duration
MCP (old)	0.755	416 s	0.424	1045 s
MCP (new)	0.792	219 s	0.697	837 s
branch	0.698	155 s	0.545	557 s
cached	0.755	114 s	0.606	488 s
leaner	0.792	158 s	0.576	505 s
verify	0.811	121 s	0.636	477 s
fastgoto	0.830	86 s	0.697	412 s

The rebuilt MCP (new) reaches 0.697 on AssistantBench and 0.792 on GAIA, while halving GAIA wall time. That is the tool fix, with MCP still in the loop. The native builds below then run the same tools without the protocol, and the duration column keeps dropping: fastgoto

runs GAIA in 86 seconds against the

new MCP’s 219 and the old MCP’s 416. We iterated five times to improve accuracy on the native agent:

The first native build ( branch

) was already about twice as fast as the old MCP, but on GAIA it scored 0.698, below the 0.755 it replaced. - Prompt caching ( cached

) brought GAIA up to 0.755 and AssistantBench to 0.606, matching the level agent-browser wrapping Lightpanda reached in the first post. - A leaner tool surface with condensed system prompt and tool descriptions ( leaner

). - Self-verification of answers ( verify

). - A faster goto ( fastgoto

) is the most accurate Lightpanda configuration we have measured on either benchmark. It ties the agent-browser on Chrome reference from the first post on GAIA at 0.830, and it does it at a fraction of the wall time.

The useful pattern in this table is the duration column. It trends down across the rebuilds, and fastgoto

runs GAIA in 86 seconds against the old MCP’s 416.

AssistantBench: the accuracy win #

AssistantBench is the longer, harder benchmark, and it’s where the native agent’s accuracy gain is clearest. It went from 0.424 strict with our MCP to 0.697. That clears the 10 pp floor with room to spare. It also passes our agent-browser on Lightpanda reference (0.606) from the first post. Across the native climb, the first build (branch

) to fastgoto

, it added 15.2 pp strict, answered five more tasks, and ran 26% faster.

What the native agent adds on AssistantBench is speed. The new MCP runs the suite at 837 seconds per task. The native agent runs it at 412 (about twice as fast) because it drops the serialization boundary.

metric	first build `branch`	latest `fastgoto`	Δ
accuracy_strict	0.545	0.697	+15.2 pp
accuracy_soft	0.473	0.571	+9.8 pp
answered	27/33	32/33	+5
timeouts	0	0	-
avg duration	557 s	412 s	−26%

Break it down by difficulty and the gain shows up in every tier, with the biggest jump in Medium.

|---|---|---|---|---|
| Medium (14) | strict | 0.714 (10/14) | 0.929 (13/14) | +21.4 pp |

| Medium (14) | soft | 0.613 | 0.767 | +15.4 pp | | Hard (19) | strict | 0.421 (8/19) | 0.526 (10/19) | +10.5 pp | | Hard (19) | soft | 0.370 | 0.427 | +5.6 pp |

| Overall (33) | strict | 0.545 (18/33) | 0.697 (23/33) | +15.2 pp | Medium is now nearly saturated at 13 of 14, up from 10 in the first native build. The changes across the climb (caching, a leaner tool surface, self-verification, and a faster goto

) flipped three Medium tasks that were losing time to navigation overhead rather than reasoning.

Hard rose from 8 to 10 of 19, a smaller step that still leaves it the weakest tier and the benchmark’s ceiling. Hard AssistantBench tasks are long multi-source lookups that run 400 to 750 seconds each. They are bottlenecked by depth of multi-step reasoning and cross-source synthesis, not by how fast a page loads or how tight the tool surface is.

This is the same structural gap the first post flagged: the native work made the agent faster and cheaper and lifted every tier, but it cannot buy the reasoning depth hard tasks need and the model does not have.

GAIA: same accuracy, much faster #

On GAIA the native agent reached 0.830, level with the agent-browser on Chrome reference (0.83). The step up from the old MCP’s 0.755 sits inside the ±10 pp noise floor we committed to, and so does the gap to the new MCP’s 0.792. Speed and cost are the story.

Against the MCP it replaces, the native agent holds GAIA accuracy, cuts wall time per task from 416 seconds to 86, drops cost per task from $0.63 to $0.34, and takes GAIA timeouts from 6 of 53 to zero. The headline is parity with the Chrome reference, at a fraction of the time and cost.

metric	first build branch	latest fastgoto	Δ
accuracy	0.698	0.830	+13.2 pp
answered	51/53	52/53	+1
timeouts	1	0	−1
avg duration	155 s	86 s	−45%

Cost across both benchmarks #

The cost drop comes from cache reads and smaller payloads, not fewer turns. The biggest lever is prompt caching: caching the system prompt and tool definitions turns most input into cheap cache reads and brings cost down to roughly a third of the uncached first build.

The first build always dumped whole pages, while markdown can now be scoped to a single node and nodeDetails hands back a node’s unique CSS selector directly, so the agent pulls just the region it needs instead of the full page. With a faster goto reaching answers with less re-reading too, each task pushes fewer tokens through the model. We observed cost per task fall to about $0.34 on GAIA and $1.94 on AssistantBench, at higher accuracy.

Where this leaves us #

The lesson from the first post was that the tools an agent can call matter more than the engine. We rebuilt that tool surface, and the payoff shows up in both of our setups.

Try it yourself #

The benchmarks, gold answers, and harness are published here under Apache 2.0. The fastest way to reproduce the tables is to clone the repo, open Claude Code in it, and ask it to reproduce the results with the same models and timeouts.

If you want to test the agent on your own workloads, the [quickstart guide ](https://lightpanda.io/docs/usage/agent) gets
you going in under 10 minutes, or you can try this [end to end tutorial ](https://lightpanda.io/docs/guides/lightpanda-agent-tutorial).

FAQ #

What does “native agent” mean here?

It means the agent is built into the Lightpanda binary and talks to the model directly, instead of being driven as a separate MCP server. There is no protocol round-trip on every tool call, and the system prompt and tool definitions are prompt-cached, so most input tokens after the first turn are cheap cache reads.

Why does going native help AssistantBench accuracy but not GAIA?

A strict grader scores a timed-out task as zero. AssistantBench has long tasks that hit the 1800-second cap, and our MCP timed out on 11 of 33. Faster turns recover most of those, which is where the AssistantBench gain comes from (in both the rebuilt MCP and the native agent). GAIA tasks finish well inside the cap, so there are almost no timeouts to recover, and accuracy stays flat.

How does this compare to the MCP from the first post?

Against the original MCP, the native agent is well ahead: 0.830 against 0.755 on GAIA, 0.697 against 0.424 on AssistantBench. We also rebuilt that MCP, and the new version reaches 0.697 on AssistantBench and 0.792 on GAIA. So on accuracy the native agent matches the rebuilt MCP, and its edge is speed. The accuracy gain is the tool surface, which both share, exactly the variable the

first post identified.

How many runs did you average?

One per configuration, same as the first post. The headline differences on AssistantBench and on speed and cost are well above the 10 pp noise floor. Treat the smaller per-step accuracy moves as directional.

What does the verify step do?

It’s a system-prompt rule, not a separate model or pass. When the first source is ambiguous, the agent does one more lookup before committing, and for multi-candidate questions it commits instead of leaving the answer blank. The same Sonnet 4.6 does it in its normal loop, so it costs a few extra lookups on uncertain cases, as opposed to a second run.

What’s next for the agent?

Closing the GAIA gap to agent-browser wrapping Lightpanda, which still leads at 0.887. We trail by a few tasks there and run cheaper, and the tool surface is now ours to tune directly.

Adrià Arrufat

Software Engineer

Adrià is an AI engineer at Lightpanda, where he works on making the browser more useful for AI workflows. Before Lightpanda, Adrià built machine learning systems and contributed to open-source projects across computer vision and systems programming.

source & further reading

lightpanda.io — original article The browser agent stack, explained Using Lightpanda with agent-browser Lightpanda Agent and PandaScript – LLM at buildtime, not runtime

Benchmarking Lightpanda’s native agent

Adrià Arrufat

Software Engineer

TL;DR #

The gap we set out to close #

The fix: better tools, then no protocol #

Where speed turns into accuracy #

The climb, iteration by iteration #

AssistantBench: the accuracy win #

GAIA: same accuracy, much faster #

Cost across both benchmarks #

Where this leaves us #

Try it yourself #

FAQ #

What does “native agent” mean here?

Why does going native help AssistantBench accuracy but not GAIA?

How does this compare to the MCP from the first post?

How many runs did you average?

What does the verify step do?

What’s next for the agent?

Adrià Arrufat

Software Engineer

Run your AI side-project on zahid.host