{"slug": "dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web", "title": "DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables", "summary": "A method for extracting data from dynamic web tables by using Playwright to access the browser's accessibility tree, which provides structured, semantic data without OCR errors. The approach involves launching a headless browser, waiting for the page to fully load, and using the `inner_text()` method on the table element to capture the data as tab-separated text. The method successfully extracted temperature data from a 472-city weather table in approximately 8 seconds with zero OCR errors, though it requires a real browser runtime and may face limitations with headless automation blocking or canvas-rendered tables.", "body_md": "**Status:** Current best available technique as of 2026. Treat as standard practice, not a workaround.\n\n## The Problem\n\nThree naive approaches fail on modern sites:\n\n-\n**view-source / static fetch**— returns server HTML before JavaScript runs. JS-rendered tables show only empty`<tbody>`\n\ntags. -\n**Screenshot + OCR**— slow, pixel-dependent, brittle, compounds errors on numeric data. -\n**Screenshot + vision model**— expensive, context-limited, fails on tables larger than one viewport.\n\n**Root cause:** The web has shifted to client-side rendering. Data lives in JavaScript runtime state, not HTML.\n\n## The Method\n\n**Intuition:** Programmatic equivalent of: Highlight table → Copy → Paste into Notepad → Import to Excel → Delete irrelevant columns → Sort and count.\n\n**Steps:**\n\n- Load page in headless browser (Playwright recommended) — JavaScript executes, table renders\n- Interact with any dropdowns or filters, wait for\n`networkidle`\n\n- Call\n`inner_text()`\n\non the table element - Write extracted text to file (audit trail, enables re-parsing)\n- Parse in Python — split on newlines/tabs, cast numerics, filter and count\n\n## Why It Works\n\nThe accessibility tree is structured, semantic, not pixel-dependent, already parsed by the browser, and fast. No OCR transcription errors on numbers.\n\n## Pseudocode\n\n``` python\npython\nfrom playwright.sync_api import sync_playwright\nimport re\n\nwith sync_playwright() as p:\n    browser = p.chromium.launch(headless=True)\n    page = browser.new_page()\n    page.goto(URL, wait_until=\"networkidle\")\n\n    page.select_option(\"select#view-filter\", label=\"All Cities\")\n    page.wait_for_load_state(\"networkidle\")\n\n    table_text = page.query_selector(\"table\").inner_text()\n    browser.close()\n\nwith open(\"table_output.txt\", \"w\") as f:\n    f.write(table_text)\n\nlines = table_text.strip().split(\"\\n\")\nrows = [line.split(\"\\t\") for line in lines[1:] if line.strip()]\ntemps = [float(re.sub(r\"[^\\d.\\-]\", \"\", r[2])) for r in rows if r[2].strip()]\nprint(f\"Below 32°F: {sum(t < 32 for t in temps)}\")\nprint(f\"Above 100°F: {sum(t > 100 for t in temps)}\")\n\nReal Example\nSource: timeanddate.com, 472-city weather table, “Somewhat Popular” view.\n• Execution time: ~8 seconds\n• Cities below 32°F: 47\n• Cities above 100°F: 12\n• OCR errors: 0\nLimitations\n• Requires real browser runtime (Playwright/Puppeteer)\n• Some sites block headless automation\n• Canvas-rendered tables require  page.accessibility.snapshot()  fallback\n• Infinite scroll requires simulating scroll events\n• Always prefer an official API if one exists\nFull writeup with detailed tips and examples on GitHub: https://github.com/hottbunny/LLM-AI-Perplexity-Skills-and-Updates/blob/hottbunny-tested-works-htmlsearchtablecrawldataretrivalskill/dom_extraction_method.md\n```\n\n", "url": "https://wpnews.pro/news/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web", "canonical_source": "https://dev.to/hottbunny/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web-tables-1j5k", "published_at": "2026-05-20 15:07:18+00:00", "updated_at": "2026-05-20 15:33:54.678457+00:00", "lang": "en", "topics": ["developer-tools", "data", "large-language-models"], "entities": ["Playwright", "timeanddate.com"], "alternates": {"html": "https://wpnews.pro/news/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web", "markdown": "https://wpnews.pro/news/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web.md", "text": "https://wpnews.pro/news/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web.txt", "jsonld": "https://wpnews.pro/news/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web.jsonld"}}