{"slug": "justhtml-3-0-0-a-new-html5-parser-architecture", "title": "JustHTML 3.0.0: A new HTML5 parser architecture", "summary": "JustHTML 3.0.0 introduces a new plan-driven parser architecture that collapses tokenization, tree building, and sanitization into a single loop, achieving a 2x speedup over traditional HTML5 parsers. The new engine compiles behavior into an EnginePlan before parsing, allowing immediate DOM mutation and policy decisions without intermediate token objects.", "body_md": "# JustHTML 3.0.0: A new HTML5 parser architecture\n\nJustHTML 3.0.0 is out, and the biggest change is not a new API. It's a new parser core.\n\nUp until now, JustHTML looked like most HTML5 parsers. First tokenize the input, then feed those tokens into a tree builder, and only after that apply the default-safe cleanup that makes untrusted HTML usable in applications.\n\nThat's the normal structure. The HTML5 spec itself is written that way. The tokenizer is one state machine, the tree builder is another, and the boundary between them is a stream of tokens: start tags, end tags, text, comments, doctypes, parse errors.\n\n`html5lib`\n\n, browser engines, and `html5ever`\n\nall broadly follow that shape, even if the details differ a lot.\n\n## What changed in 3.0.0[#](#what-changed-in-300)\n\nJustHTML 3.0.0 collapses that split into one plan-driven parser engine.\n\nSo instead of scanning characters into token objects, handing those tokens to a second subsystem, and then applying sanitizer decisions as a later pass, the new engine does that work in one loop.\n\nIt still implements the same HTML5 concepts: insertion modes, the open-element stack, active formatting elements, foster parenting, fragment parsing, RAWTEXT/RCDATA handling, foreign content rules, and all the other painful details that make browser parsing browser parsing.\n\nBut the control flow is different now. The parser scans the source string directly, decides what the current tag means in context, mutates the DOM immediately, and can apply default-safe policy decisions while it is still in the hot path.\n\nThis is a real architecture change, not just another round of optimization.\n\n## How it works[#](#how-it-works)\n\nThe key idea is the word \"plan\".\n\nBefore parsing starts, JustHTML compiles the requested behavior into an `EnginePlan`\n\n. There are different plans for the common cases:\n\n- the default safe path\n- custom sanitization policies that can be compiled into parser actions\n- the raw path used by\n`sanitize=False`\n\nand transform-heavy cases\n\nThat plan contains the parser-time decisions that used to be scattered across later steps: tag actions, allowed tags, attribute handling, URL policy hooks, void-element knowledge, formatting-element behavior, and other mode-specific tables.\n\nSo the hot path is no longer asking \"what should I do with this node later?\" It already knows.\n\nIn practice the engine now looks more like this:\n\n```\nplan = compile_default_engine_plan(fragment=False)\nengine = ParseEngine(html, fragment=False, plan=plan)\nroot = engine.parse()\n```\n\nInside `parse()`\n\n, the engine sets up either a document shell or fragment root, then walks the input with a single range parser. On the fast path it uses specialized start-tag and end-tag parsers for compiled-safe mode, so it avoids building generic token objects and skips the tokenizer-to-treebuilder handoff completely.\n\nAttributes are handled differently too. In the old shape, a tokenizer typically parses all attributes into token payloads, and then the tree builder or sanitizer revisits them. In the new JustHTML engine, attribute scanning can be projected directly through the current plan: preserve what is needed, drop what is not, and keep only the state required for correct tree construction.\n\nThat last part matters. HTML parsing is not just \"keep the allowed attrs\". Some information is needed for parser state even if it will never survive serialization.\n\n## Why this is faster[#](#why-this-is-faster)\n\nThe 3.0.0 changelog reports about a **2x speedup**, and the reason is not very mysterious.\n\nTraditional parser structure pays several overhead costs:\n\n- token objects have to be allocated\n- token payloads have to be normalized and handed off\n- the tree builder has to re-interpret information the tokenizer already discovered\n- default-safe behavior often becomes a separate tree walk or transform stage\n\nThe fused engine removes a lot of that machinery from the common path.\n\nWhen JustHTML is used in its default mode, the parser can scan characters, recognize a tag, decide whether that tag is allowed, project the interesting attributes, and mutate the DOM immediately. Less indirection, fewer temporary objects, fewer full-tree passes.\n\nThis is the kind of optimization that sounds boring until you remember it's happening in Python, where object churn and extra passes cost real time.\n\n## The comparison to other parsers[#](#the-comparison-to-other-parsers)\n\nI still think the standard architecture is the safest place to start.\n\nIf you are implementing HTML5 from scratch, tokenizer and tree builder as separate layers is easier to reason about, easier to debug, and closer to the specification. It is also friendlier to test harnesses that want to inspect intermediate token streams.\n\nSo I don't think this proves everyone else wrong. `html5ever`\n\nand browser parsers are structured the classic way because that structure maps well to the spec and to large codebases with many contributors.\n\nWhat JustHTML 3.0.0 changes is the tradeoff. It keeps the browser-style recovery model, but stops treating token emission as a required architectural boundary.\n\nThat makes JustHTML a bit unusual among HTML parsers. It is still pure Python, still targets exact HTML5 behavior, and still does safe-by-default parsing for application use. But the parser core is now closer to a fused execution engine than a textbook tokenizer plus tree builder pipeline.\n\nI also think this is a better fit for what JustHTML actually is. Most users are not consuming a token stream. They want a correct DOM tree, and often they want it sanitized. If that is the real product, it makes more sense to optimize around that end-to-end job than around intermediate artifacts.\n\n## What did not change[#](#what-did-not-change)\n\nFrom the outside, this release is pleasantly boring.\n\n`JustHTML(html)`\n\nstill gives you a DOM. Fragment parsing still works. Streaming, source-location tracking, strict mode, and safe-by-default behavior are still there.\n\nThe main breaking change is diagnostics. Since the old tokenizer/tree-builder internals are gone, `collect_errors=True`\n\nand `strict=True`\n\nnow report a smaller, more intentional built-in error set. If you depended on exact error codes, counts, or ordering, you will need to adjust.\n\nEveryone else should mostly experience 3.0.0 as \"the same parser, only faster\".\n\nThat's exactly what I wanted.", "url": "https://wpnews.pro/news/justhtml-3-0-0-a-new-html5-parser-architecture", "canonical_source": "https://friendlybit.com/python/justhtml-3-parser-architecture/", "published_at": "2026-06-21 19:40:54+00:00", "updated_at": "2026-06-21 20:05:06.490945+00:00", "lang": "en", "topics": ["developer-tools"], "entities": ["JustHTML", "html5lib", "html5ever"], "alternates": {"html": "https://wpnews.pro/news/justhtml-3-0-0-a-new-html5-parser-architecture", "markdown": "https://wpnews.pro/news/justhtml-3-0-0-a-new-html5-parser-architecture.md", "text": "https://wpnews.pro/news/justhtml-3-0-0-a-new-html5-parser-architecture.txt", "jsonld": "https://wpnews.pro/news/justhtml-3-0-0-a-new-html5-parser-architecture.jsonld"}}