How to scrape unstructured website data and turn it into structured JSON for AI Developers building AI search or RAG pipelines often struggle to extract structured data from websites with inconsistent page layouts. A new approach uses a scraping layer that handles AI extraction natively, converting HTML to Markdown first to reduce token overhead by 90% before sending it to an LLM, then returning clean structured JSON. This method minimizes costs and rate limits while enforcing consistent output shapes across varied pages, enabling scalable vector database ingestion. Here is a problem a lot of developers run into when building AI search or RAG pipelines. You have a website with hundreds of pages. The content is inconsistent. Product pages, article pages, listing pages, all laid out slightly differently from each other. You want to extract structured data from all of them, push it into a vector database, and make it searchable. The obvious approach is to loop through every page, send the HTML to an LLM with a structured output prompt, and save what comes back. It works. Then you have 300 pages to process, rate limits kicking in every few minutes, a token bill that adds up faster than you expected, and you realize the HTML you are sending is mostly navigation, scripts, ads, and footer content rather than the actual page data you care about. This guide covers the right way to approach this problem: how to minimize token overhead, enforce consistent output shapes across inconsistent pages, avoid hammering an LLM directly for every extraction, and build a pipeline that scales. The core problem with scraping unstructured pages Most web pages are not designed for machines. The same information appears in different positions on different pages. Class names change between sections. Some fields are present on some pages and absent on others. There is no guarantee that the title is always in an h1 , the price is always in a .price element, or the category is labeled consistently. Traditional scraping with CSS selectors breaks here. You cannot write a selector for "the price" when the price appears in a