HTML vs Markdown for AI: which format is better for LLMs?
A comparison of raw HTML and Markdown formats for feeding web content into large language models shows that HTML contains significant boilerplate—navigation menus, scripts, cookie banners, and ads—that wastes tokens without contributing useful information. Markdown, by stripping away this extraneous code, preserves only the meaningful content, reducing token count and improving model efficiency for tasks like RAG, fine-tuning, and in-context retrieval. The choice between formats depends on the specific application, but Markdown generally offers a cleaner, more token-efficient alternative for LLM processing.
If you are feeding web content into a large language model, whether for RAG, fine-tuning, or in-context retrieval, the format you use matters more than most people realise when they first start building. The obvious choice is to grab the raw HTML from a page and pass it in. It is what the browser receives. It contains everything. Why not use it? The problem is that "everything" includes a lot of things that are not the content you care about. Navigation menus, script tags, CSS classes, cookie banners, ad slots, footer links repeated across every page on the site. These take up tokens and contribute nothing to what the model learns or retrieves. This article looks at what raw HTML actually contains versus what Markdown looks like for the same page, how the token count compares, what each format preserves and loses, and which one to use depending on what you are building. What is actually in raw HTML Take a typical news article page. Before any of the article text appears, a raw HTTP response from a modern website might look something like this: < DOCTYPE html