# DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables

> Source: <https://dev.to/hottbunny/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web-tables-1j5k>
> Published: 2026-05-20 15:07:18+00:00

**Status:** Current best available technique as of 2026. Treat as standard practice, not a workaround.

## The Problem

Three naive approaches fail on modern sites:

-
**view-source / static fetch**— returns server HTML before JavaScript runs. JS-rendered tables show only empty`<tbody>`

tags. -
**Screenshot + OCR**— slow, pixel-dependent, brittle, compounds errors on numeric data. -
**Screenshot + vision model**— expensive, context-limited, fails on tables larger than one viewport.

**Root cause:** The web has shifted to client-side rendering. Data lives in JavaScript runtime state, not HTML.

## The Method

**Intuition:** Programmatic equivalent of: Highlight table → Copy → Paste into Notepad → Import to Excel → Delete irrelevant columns → Sort and count.

**Steps:**

- Load page in headless browser (Playwright recommended) — JavaScript executes, table renders
- Interact with any dropdowns or filters, wait for
`networkidle`

- Call
`inner_text()`

on the table element - Write extracted text to file (audit trail, enables re-parsing)
- Parse in Python — split on newlines/tabs, cast numerics, filter and count

## Why It Works

The accessibility tree is structured, semantic, not pixel-dependent, already parsed by the browser, and fast. No OCR transcription errors on numbers.

## Pseudocode

``` python
python
from playwright.sync_api import sync_playwright
import re

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(URL, wait_until="networkidle")

    page.select_option("select#view-filter", label="All Cities")
    page.wait_for_load_state("networkidle")

    table_text = page.query_selector("table").inner_text()
    browser.close()

with open("table_output.txt", "w") as f:
    f.write(table_text)

lines = table_text.strip().split("\n")
rows = [line.split("\t") for line in lines[1:] if line.strip()]
temps = [float(re.sub(r"[^\d.\-]", "", r[2])) for r in rows if r[2].strip()]
print(f"Below 32°F: {sum(t < 32 for t in temps)}")
print(f"Above 100°F: {sum(t > 100 for t in temps)}")

Real Example
Source: timeanddate.com, 472-city weather table, “Somewhat Popular” view.
• Execution time: ~8 seconds
• Cities below 32°F: 47
• Cities above 100°F: 12
• OCR errors: 0
Limitations
• Requires real browser runtime (Playwright/Puppeteer)
• Some sites block headless automation
• Canvas-rendered tables require  page.accessibility.snapshot()  fallback
• Infinite scroll requires simulating scroll events
• Always prefer an official API if one exists
Full writeup with detailed tips and examples on GitHub: https://github.com/hottbunny/LLM-AI-Perplexity-Skills-and-Updates/blob/hottbunny-tested-works-htmlsearchtablecrawldataretrivalskill/dom_extraction_method.md
```


