Reduce LLM Token Waste in RAG with Markdown

A developer describes a technique to reduce token waste when feeding web content to large language models in RAG pipelines. By rendering dynamic pages in a headless browser and converting the DOM to clean Markdown, token consumption can drop by up to 90% while preserving semantic structure and improving retrieval accuracy.

Feeding raw HTML to Large Language Models wastes tokens on markup, scripts, and styling. By rendering dynamic web pages in a headless browser and converting the final DOM to clean Markdown, you reduce token consumption by up to 90% while preserving semantic structure and improving retrieval accuracy in RAG pipelines. Building Retrieval-Augmented Generation RAG pipelines over web data introduces a specific data engineering problem. The web is built on HTML. Large Language Models operate on tokens. When you pass raw HTML to an embedding model or an LLM context window, you pay a steep tax. You pay for <div class="mt-4 flex flex-col justify-center" , <script type="application/json" , SVG paths, and inline CSS. These non-semantic tokens dilute the actual content. They increase latency, exhaust context limits, and drive up API costs. Worse, this noise degrades your embeddings. When an embedding model processes a chunk of text dominated by CSS classes and HTML attributes, the resulting vector represents the markup structure more heavily than the actual information. This leads to poor retrieval performance. When a user queries your RAG system, the vector database returns chunks based on matching HTML boilerplate rather than semantic relevance. Markdown solves the HTML tax problem. It preserves semantic meaning without the syntactic overhead of HTML. It maintains hierarchical structure through headers, relationships through links, and tabular data through Markdown tables. A standard product page or a long-form article converted from HTML to Markdown often drops from 50,000 tokens to roughly 3,000. This 94% reduction in token count directly translates to lower inference costs and higher context density. When you feed clean Markdown into a context window, the LLM processes dense, high-signal information. It pays attention to the data you care about. Consider this raw HTML snippet: Converted to Markdown, it becomes: markdown title="example.md" {1-3} Specifications - Weight: 2.4 lbs - Battery Life: 12 hours The Markdown version contains the exact same information but requires a fraction of the tokens. The LLM understands the header and the list items natively. Converting static HTML to Markdown is straightforward using libraries like html2text or turndown . The challenge lies in modern web architecture. Most single-page applications SPAs ship an empty <div id="root" and render content client-side via JavaScript. If you fetch these pages with a standard HTTP client like requests in Python or curl in bash, your Markdown converter will output nothing. You capture the loading state, not the data. You need a headless browser to execute the JavaScript, wait for the network to idle, and then extract the final computed DOM. Doing this at scale introduces significant infrastructure overhead. You must manage a fleet of headless Chrome instances. You have to handle memory leaks, process crashes, and concurrent execution limits. Beyond browser management, you face access barriers. Many web servers employ strict rate limiting and automated traffic detection, even for publicly accessible data. Fetching the fully rendered DOM requires robust proxy rotation and systems capable of sophisticated anti-bot handling https://alterlab.io/smart-rendering-api . If you fail to solve a CAPTCHA or trigger a firewall block, your RAG pipeline starves for data. Before generating the Markdown, it is crucial to sanitize the HTML. Modern web pages contain elements like <nav , <footer , <aside , and hidden modals that contribute no value to the core content. If you convert the entire page blindly, your Markdown will include navigation links, newsletter signups, and related article previews. This reintroduces noise into your RAG pipeline. A robust extraction pipeline evaluates DOM nodes based on text density, link-to-text ratios, and semantic HTML5 tags like <main or <article . It prunes the DOM tree of boilerplate, ensuring the resulting Markdown represents only the primary article or data payload. When implementing custom conversion pipelines, you must build this sanitization step yourself using tools like Mozilla's Readability.js. Offloading this eliminates the need to maintain complex DOM pruning rules across diverse web layouts. Instead of building a complex pipeline with Puppeteer, proxy managers, HTML parsing libraries, and Markdown converters, you can request Markdown directly from the AlterLab API. We built AlterLab to abstract this infrastructure away. Our systems handle the headless browser execution, manage the proxy rotation, sanitize the DOM, and return the data in your requested format. You pass the target URL to the API. You specify that you want Markdown. AlterLab navigates to the page, waits for JavaScript execution to complete, parses the rendered HTML, strips navigation and footer boilerplate using heuristics, and returns a clean Markdown string. Here is how to implement this using our Python SDK https://alterlab.io/web-scraping-api-python . python title="rag scraper.py" {4-8} client = alterlab.Client api key=os.getenv "ALTERLAB API KEY" response = client.scrape url=" https://example-news-site.com/article https://example-news-site.com/article ", formats= "markdown" , wait for="networkidle" markdown content = response.markdown print markdown content For systems where you prefer standard HTTP requests, the same configuration works via cURL. See the API reference https://alterlab.io/docs for full parameter details. bash title="Terminal" {4-4} curl -X POST https://api.alterlab.io/v1/scrape \ -H "X-API-Key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example-news-site.com/article", "formats": "markdown" , "wait for": "networkidle" }' The wait for: "networkidle" parameter ensures the headless browser waits until all client-side rendering completes before extracting the DOM. The formats: "markdown" parameter handles the conversion pipeline internally. Once you have clean Markdown, your chunking strategy improves drastically. Standard text chunking methods split text arbitrarily by character count. This often breaks paragraphs in half or separates a table header from its rows, destroying the context the LLM needs to answer queries. With Markdown, you chunk by semantic boundaries using headers , , . Markdown-aware text splitters read these headers to keep related concepts together. When a section exceeds your chunk size limit, the splitter drops down to the next header level. python title="chunker.py" {4-10} from langchain.text splitter import MarkdownHeaderTextSplitter headers to split on = " ", "Header 1" , " ", "Header 2" , " ", "Header 3" , markdown splitter = MarkdownHeaderTextSplitter headers to split on=headers to split on md header splits = markdown splitter.split text markdown content for split in md header splits: print split.metadata print split.page content This ensures that every chunk sent to your vector database contains complete, logically grouped information. It also preserves the header hierarchy in the metadata, allowing you to filter or weight retrieval results based on section context. Handling Tabular Data and Complex Structures Tables present a notorious challenge for RAG systems. HTML tables <table , <tr , <td confuse text embedding models. Flattening a table into plain text removes the row and column relationships, rendering the data incomprehensible. Markdown tables maintain a rigid, predictable structure. markdown title="table.md" {1-4} | Parameter | Type | Description | |---|---|---| | url | string | The target webpage URL | | formats | array | Requested output formats | LLMs parse Markdown tables natively. When a user asks a question requiring data aggregation across columns, the LLM correctly interprets the intersections of the Markdown table provided in the context window. Converting HTML directly to Markdown preserves this critical tabular structure without writing custom extraction logic. Beyond text and tables, web pages contain images and complex nested structures. Raw HTML encodes images with <img tags, srcset attributes, and lazy-loading wrappers. When converting to Markdown, the process extracts the primary src and alt text, formatting it as alt text image url . If your RAG system incorporates multimodal LLMs, you can parse these Markdown image tags to fetch and analyze the visual content. The LLM receives the semantic description via the alt text, maintaining context even if you choose not to download the image. For nested structures like accordions or tabbed interfaces, headless browser execution is paramount. SPAs often delay rendering the content of an inactive tab until the user clicks it. By using interaction features to simulate user clicks before triggering the Markdown extraction, you ensure all hidden content surfaces in the final DOM. This guarantees your RAG pipeline ingests the complete dataset, rather than missing critical information hidden behind UI components. Operating a RAG ingestion pipeline requires fault tolerance. When scraping dynamic websites, you must account for network timeouts, changing DOM structures, and temporary IP blocks. By relying on an API to handle the extraction and conversion, you reduce your surface area for errors. You do not need to debug Puppeteer timeouts or update Chrome versions. Your error handling focuses entirely on your ingestion logic. Implement exponential backoff for failed requests. Queue URLs for processing rather than executing them synchronously. Monitor the token count of the returned Markdown. If a site undergoes a major redesign, the heuristics stripping boilerplate might fail, resulting in a sudden spike in token count. Set up alerts for unexpected deviations in response size to catch these anomalies early. Processing web data for AI requires minimizing noise. Extracting dynamically rendered pages directly as Markdown removes token bloat at the source. It simplifies your ingestion pipeline, lowers LLM API costs, and provides your embedding models with highly structured, high-signal text. By offloading browser rendering, JavaScript execution, and Markdown conversion to an API, your engineering team can focus on improving embedding models and retrieval strategies rather than managing headless Chromium instances. Build data pipelines that scale reliably by treating web extraction as a solved infrastructure primitive.