Feeding Raw HTML to Your LLM Is a Token Tax. I Measured It on 10 Real Pages — Median 7.4 , and It Hits Every Scheduled Run
A developer measured the token cost of feeding raw HTML versus extracted text to large language models across 10 real web pages, finding a median token multiplier of 7.4×. The spread ranged from 1.1× on a stripped-down test page to 47.8× on a news homepage, with the multiplier recurring on every scheduled run. The engineer published a 30-line Python meter using `requests`, BeautifulSoup, and `tiktoken` to reproduce the results.
There's a whole wave of posts right now telling you the same thing: don't feed raw HTML to your LLM, convert it to markdown first, it's more token-efficient. AlterLab has five of them. There's a popular one called "Raw HTML is where LLM context goes to die." A blog called searchcans says converting HTML to markdown cuts tokens by "20-30%." I read most of them. Here's what bugged me. Not one shows a number you can reproduce. "More efficient" — how much more? "Our benchmarks show" — which benchmark, which pages, which tokenizer? searchcans says 20-30%. The "context goes to die" post is a vibe, not a measurement. AlterLab's posts say markdown wins and then sell you a managed extraction API. None of them hand you a script and say run it on your own pages and see. So I did the boring thing. I wrote a 30-line meter, pointed it at ten real public pages, and counted tokens. Raw HTML versus the text you actually wanted. The median multiplier was 7.4× . The spread was wild — from 1.1× on a stripped-down test page to 47.8× on a news homepage. And the part nobody talks about: that multiplier doesn't hit you once. It hits every single scheduled run. I've got 2,190 production scraper runs behind me. So this isn't an abstraction for me — a multiplier on a recurring job is a recurring bill. Let me show you the whole thing. Code, numbers, the math, and the one place where "just use markdown" is only half true. No magic. requests to fetch, BeautifulSoup to pull text out, tiktoken to count. The only thing that matters is that you count tokens with the same encoder for both sides, so the ratio is honest. bash /usr/bin/env python3 token-cost meter: raw HTML vs extracted text, in tokens. stdlib + requests + bs4 + tiktoken. cl100k base GPT-4o-era tokenizer . import statistics import requests from bs4 import BeautifulSoup import tiktoken enc = tiktoken.get encoding "cl100k base" URLS = "https://en.wikipedia.org/wiki/Web scraping", "https://docs.python.org/3/library/json.html", "https://www.rfc-editor.org/rfc/rfc9110.html", "https://news.ycombinator.com/", "https://www.gnu.org/licenses/gpl-3.0.en.html", "https://httpbin.org/html", "https://blog.python.org/", "https://developer.mozilla.org/en-US/docs/Web/HTTP/Status", "https://www.bbc.com/news", "https://quotes.toscrape.com/", H = {"User-Agent": "token-cost-meter/1.0 +research; bs4+tiktoken "} def n s : tokens return len enc.encode s rows = print f"{'url':<50} {'tok raw': 9} {'tok text': 9} {'mult': 6}" for url in URLS: r = requests.get url, headers=H, timeout=20 if r.status code = 200 or not r.text: print f"{url :48 :<50} {'SKIP — ' + str r.status code : 26}" continue raw = r.text text = BeautifulSoup raw, "html.parser" .get text " ", strip=True t raw, t txt = n raw , max 1, n text rows.append t raw, t txt, t raw / t txt print f"{url :48 :<50} {t raw: 9} {t txt: 9} {t raw/t txt: 6.1f}" mults = m for , m in rows print f"\nmedian multiplier: {statistics.median mults :.1f}x" print f"min / max: {min mults :.1f}x / {max mults :.1f}x" That's it. Thirty-odd lines. You can swap in your own URLs and have your own number in five minutes, which is more than any of the "markdown beats HTML" posts gave you. Here's the raw output from my run. Every one of these ten URLs returned HTTP 200 the day I ran it 2026-05-28 ; none were skipped. url tok raw tok text mult https://en.wikipedia.org/wiki/Web scraping 48975 6658 7.4 https://docs.python.org/3/library/json.html 32245 6271 5.1 https://www.rfc-editor.org/rfc/rfc9110.html 378840 110723 3.4 https://news.ycombinator.com/ 11794 1150 10.3 https://www.gnu.org/licenses/gpl-3.0.en.html 12502 7882 1.6 https://httpbin.org/html 886 836 1.1 https://blog.python.org/ 11279 465 24.3 https://developer.mozilla.org/.../HTTP/Status 62411 4649 13.4 https://www.bbc.com/news 112721 2356 47.8 https://quotes.toscrape.com/ 2880 388 7.4 median multiplier: 7.4x min / max: 1.1x / 47.8x Median 7.4×. So the headline number I'll stand behind is: feeding raw HTML to a model costs you, on a typical page, around seven times what feeding the extracted text would. But honestly, the median is the least interesting line in that table. Look at the spread. The httpbin.org/html page is 1.1×. It's a deliberately minimal test page — almost all text, almost no markup. The wikipedia article is 7.4×, right on the median. And then the BBC news homepage is 47.8× . Forty-seven times. That page is 112,721 tokens of HTML wrapping 2,356 tokens of text. Ninety-eight percent of what you'd send the model is scaffolding — scripts, inline JSON state blobs, tracking, navigation, the same boilerplate stamped around every card. So when someone says "HTML is roughly N× more tokens," the honest answer is: N depends entirely on what you're scraping. A clean documentation page? Maybe 3–5×. A modern, ad-heavy, JavaScript-state-dumped news or e-commerce page? You're easily in double digits. Don't trust a single number — and especially don't trust mine. Run the meter on the actual pages you scrape. That's the whole point. One caveat I want to be upfront about, because it's the kind of thing the vendor posts gloss over. I counted with cl100k base , the GPT-4o-era tokenizer. Newer models the GPT-5.x family use a different encoder. Does that move the multiplier? Barely — because the multiplier is a ratio , and both the numerator HTML and the denominator text get encoded by the same tokenizer. Switching encoders shifts both sides together. The absolute token counts move; the ratio mostly doesn't. I'd put it at ±15% across tokenizers, but I'll be straight with you — I couldn't load o200k base in my environment to prove that side by side, so treat the ±15% as my estimate, not a measurement. Here's where I think the whole wave gets it half-right and stops. The pitch is "convert HTML to markdown and your tokens shrink." True. But it quietly implies the hard part is the format . It isn't. The hard part is what you keep . BeautifulSoup.get text already strips every tag. So in a sense it's already doing what markdown extraction does for token count — tags gone, text out. But that text still drags along the menu, the cookie banner copy, the footer links, the "related articles" rail, the newsletter nag. All of it is real text. All of it costs tokens. None of it is the thing you scraped the page for. So I ran a second measurement. Same pages, but this time I dropped the obvious boilerplate tags — script , style , nav , header , footer , aside , noscript , svg , form — before extracting text. BOILER = "script", "style", "nav", "header", "footer", "aside", "noscript", "svg", "form" soup = BeautifulSoup raw, "html.parser" for tag in soup BOILER : tag.decompose clean = soup.get text " ", strip=True The result: page get text + drop boilerplate wikipedia /Web scraping 6,658 tok 6,234 tok -6% bbc.com/news 2,356 tok 1,607 tok -32% mozilla /HTTP/Status 4,649 tok 2,703 tok -42% On the MDN docs page, naive text extraction left 42% extra tokens on the table. Forty-two percent — after I'd already thrown away the HTML. The boilerplate that survives the tags is the expensive part. On the BBC homepage, another 32% gone. That's the bit the markdown posts don't mention, because it complicates the sales pitch. "Just convert to markdown" gets you the 7× win. Getting the rest is about deciding what counts as content on your pages, and that's site-specific work no managed API does perfectly for you. The format was never the hard problem. The judgment is. Here's why I care more than someone running this once as a curiosity. A multiplier on a single page is a rounding error. You think: it's a few thousand tokens, who cares. That's exactly what I'd have thought before I'd watched a scraper run on a schedule for a year. I've put 2,190 runs through 32 published actors on Apify — that's a raw lifetime counter off my own dashboard apify.com/knotless cadence, as of May 2026, not a sample . My Trustpilot review scraper alone is at 962 runs. The thing about production scraping that nobody warns you about: the work isn't the first run. It's the same job firing again tomorrow, and the day after, and the day after that. A scheduled scraper is a tax that compounds. So let's do the arithmetic with the median 7.4× and real prices. Take a modest pipeline: 500 pages per run, one run a day. Say the extracted text is about 4,000 tokens per page close to the median text size in my run . At 7.4×, the raw HTML is ~29,600 tokens — so you're shipping ~25,600 wasted tokens per page if you feed HTML instead of text. Now price it. From OpenAI's API pricing page developers.openai.com/api/docs/pricing, retrieved 2026-05-28 — note GPT-4o is no longer listed there; the current lineup is the GPT-5.x family , input token prices are: gpt-5.4 at $2.50 / 1M, gpt-5.4-mini at $0.75 / 1M, gpt-5.4-nano at $0.20 / 1M. Prices change — re-check before you quote them. And to be clear about what I'm not claiming: this is the token cost only. I'm not telling you what a scraper run costs me in compute or storage — I'm not going to make up infrastructure numbers I can't back. The multiplier is measured. The prices are public. The pages-per-run and runs-per-month are yours to plug in. I gave you a calculator, not a verdict on your bill. But notice the shape of it. The per-page waste — $0.06 on gpt-5.4 — feels like nothing. Multiply by your real volume and your real schedule and it's a salary line. That's the gap between "I ran a token counter once" and "I run scrapers in production." A one-time multiplier is trivia. A multiplier on a cron schedule is a budget. Not a framework, not a managed API. Three habits. Strip before you tokenize, not after. get text on raw HTML is the lazy default and it leaves 6–42% on the table depending on the page. Drop script / style / nav / footer first. It's four lines. Measure your own pages, not someone's blog claim. My median is 7.4×; yours could be 3× or 30× depending on whether you scrape clean docs or bloated SPAs. The meter is right there. Spend the five minutes. Watch the schedule, not the page. The decision that matters isn't "HTML or markdown for this one page." It's "what am I sending the model, times how many pages, times how many runs." Optimize the thing that repeats. I ran the meter expecting a tidy 10× to confirm the headline I had in my head. I got a 7.4× median and a 47.8× outlier that taught me more than the median did. The number isn't the lesson. The spread is — and the fact that whatever your number is, you pay it again every night. Written by Alexey Spinov. I run production scrapers — 2,190 lifetime runs across 32 actors, Trustpilot scraper at 962 of them apify.com/knotless cadence . The meter and both measurements in this post were run live on 2026-05-28; every URL returned 200, and the output blocks are copied straight from the run, not cleaned up. Disclosure: I drafted this with help from an AI writing assistant. The code, the measurements, and the numbers are mine — run on real pages, not generated. Paying an LLM bill that's mostly HTML markup, or running a scraper on a schedule that's quietly compounding? I've put 2,190 runs through production and I count the tokens before they hit the model. If you want a scraping-to-LLM pipeline that ships clean text instead of scaffolding, tell me what you're scraping and I'll scope it — spinov001@gmail.com mailto:spinov001@gmail.com .