# 🧩 Runtime Snapshots #18 - Structured Runtime Perception: The Layer Agents Are Missing

> Source: <https://dev.to/alexey_sokolov_10deecd763/runtime-snapshots-18-structured-runtime-perception-the-layer-agents-are-missing-2hgn>
> Published: 2026-06-29 08:44:46+00:00

The last three Runtime Snapshots posts built a stack. [#15](https://dev.to/alexey_sokolov_10deecd763/runtime-snapshots-15-your-ai-agent-is-blind-were-fixing-that-4166) said your agent is blind and needs eyes and hands. [#16](https://dev.to/alexey_sokolov_10deecd763/runtime-snapshots-16-the-three-architectures-of-browser-agents-4gkc) named the three ways it can see: vision, accessibility tree, and runtime perception. [#17](https://dev.to/alexey_sokolov_10deecd763/runtime-snapshots-17-bass-browser-as-shared-space-okg) asked what happens when more than one agent shares the same live browser space. This post goes underneath all of them, to the question those posts assume an answer to: what does an agent actually perceive in the half-second before it acts?

Most browser-agent failures get filed as model failures. The model clicked the wrong button, missed the menu, filled the wrong field, thought the page had loaded when it hadn't, couldn't recover when the UI shifted. Sometimes that diagnosis is right. Often the model was handed the wrong surface and asked to reason from it.

Here's the part that gets skipped: agents don't fail the way people do. You open a tab and quietly compensate for everything the page doesn't say out loud - a greyed-out button you don't click, a spinner still turning so you wait, a modal over the page so you know the thing underneath isn't live yet. You get that continuity for free, from being a human looking at a rendered page. An LLM gets none of it for free. It gets exactly the representation we hand it, and nothing else. So the representation is the whole game.

A browser is not a screenshot, not an accessibility tree, not raw DOM. Those are views of the application - useful, but still views. Hand the model pixels and it infers state from appearance: is that button disabled, or just grey? Hand it an accessibility tree and it reads a structure built for assistive technology, which modern web apps often populate incompletely or inconsistently. Hand it raw DOM and it drowns - framework residue, stale nodes, hidden branches, generated identifiers, duplicated text, elements that exist in markup but not in the user's current experience.

The problem isn't that these surfaces are useless. They're useful. The problem is that none of them is the application. What's missing is a representation built from the live page at the moment the agent needs to act.

That layer is structured runtime perception. (In [#16](https://dev.to/alexey_sokolov_10deecd763/runtime-snapshots-16-the-three-architectures-of-browser-agents-4gkc) I called it runtime structural perception; the name has since settled.) A structured runtime snapshot answers the questions an action loop actually has: what can I see right now, what can I act on right now, what's disabled or hidden or covered or loading or stale, which element identities will survive the next action, which text is the task and which is nav and chrome and framework residue, what changed since the last step.

Concretely, it carries the gap between what the HTML says and what the page is:

```
form#login (action=/auth)
  input[email]      "user@example.com"
  input[password]   required
  button[submit]    "Sign in"   [disabled]
  div.error.hidden  "Invalid credentials"
```

The disabled submit button and the not-yet-visible error are exactly the kind of state that decides whether the next action does anything. They're ambiguous in pixels, often incomplete or unreliable in a thin accessibility view, and present but noisy in raw DOM. SiFR, the Structured Interface Representation used by [E2LLM](https://e2llm.com), carries that state compactly, with every relevant node addressable, so the model can point back at the element it means. That's the practical difference between "the model saw a page" and "the model received a usable state representation."

Structured runtime perception runs next to the user's own browser session - the real session, already authenticated, with the same permissions and constraints the user already has. That matters because many important applications don't have a clean API for the task the user needs done: banking portals, government portals, internal tools, legacy admin panels, SaaS dashboards with workflow state behind login. In those places a detached bot account or a fresh logged-out browser isn't the same application state. The agent needs to perceive the state the user is actually looking at, inside an explicitly authorized session, before it can make a useful next-step proposal. That's not a convenience feature - it's the difference between "can attempt the task" and "cannot see the relevant page at all."

For several posts this has been architecture, so, plainly, where it stands. The substrate is live: E2LLM exposes structured browser state through a browser extension and MCP tools, documented at [e2llm.com/docs/mcp-tools](https://e2llm.com/docs/mcp-tools/). The public category and evidence surface is at [insitu.im/e2llm](https://insitu.im/e2llm/) and [insitu.im/e2llm/evidence](https://insitu.im/e2llm/evidence/), and the Runtime Snapshots index lives at [insitu.im/e2llm/runtime-snapshots](https://insitu.im/e2llm/runtime-snapshots/). The next snapshot is the part this series has been building toward: a full walkthrough of structured runtime perception driving a real task on a site with no API, start to finish - including where it gets hard.

The next wave of browser automation won't be defined only by better models. It'll be defined by better perception boundaries - by what the agent is allowed to know before it acts. Screenshots are useful, accessibility trees are useful, DOM is useful. But an agent operating inside a real browser session needs something more specific: a structured representation of what the page is right now, what can be acted on right now, and what changed since the previous step. That's the category I care about. Not "AI looking at pages." Structured runtime perception: the layer browser agents are missing.

*This is part 18 of the Runtime Snapshots series - exploring how structured browser data changes the way we build, test, and ship software. #16 named the three architectures; #17 made them share a session; this one is about what any of them perceives before it acts.*

*If you've built or used browser agents - where did yours fail first: perception, planning, or action?*
