{"slug": "how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai", "title": "How to scrape unstructured website data and turn it into structured JSON for AI", "summary": "Developers building AI search or RAG pipelines often struggle to extract structured data from websites with inconsistent page layouts. A new approach uses a scraping layer that handles AI extraction natively, converting HTML to Markdown first to reduce token overhead by 90% before sending it to an LLM, then returning clean structured JSON. This method minimizes costs and rate limits while enforcing consistent output shapes across varied pages, enabling scalable vector database ingestion.", "body_md": "Here is a problem a lot of developers run into when building AI search or RAG pipelines.\n\nYou have a website with hundreds of pages. The content is inconsistent. Product pages, article pages, listing pages, all laid out slightly differently from each other. You want to extract structured data from all of them, push it into a vector database, and make it searchable.\n\nThe obvious approach is to loop through every page, send the HTML to an LLM with a structured output prompt, and save what comes back. It works. Then you have 300 pages to process, rate limits kicking in every few minutes, a token bill that adds up faster than you expected, and you realize the HTML you are sending is mostly navigation, scripts, ads, and footer content rather than the actual page data you care about.\n\nThis guide covers the right way to approach this problem: how to minimize token overhead, enforce consistent output shapes across inconsistent pages, avoid hammering an LLM directly for every extraction, and build a pipeline that scales.\n\n## The core problem with scraping unstructured pages\n\nMost web pages are not designed for machines. The same information appears in different positions on different pages. Class names change between sections. Some fields are present on some pages and absent on others. There is no guarantee that the title is always in an `h1`\n\n, the price is always in a `.price`\n\nelement, or the category is labeled consistently.\n\nTraditional scraping with CSS selectors breaks here. You cannot write a selector for \"the price\" when the price appears in a `<span class=\"price\">`\n\non one page and a `<div class=\"product-cost\">`\n\non another.\n\nThe three options developers usually land on are:\n\n**Write page-type-specific scrapers.** Map out all the different page layouts, write a parser for each, and maintain them as the site changes. This works but is expensive in time and breaks every time the site redesigns.**Send raw HTML to an LLM.** Let the model figure out where everything is. This works but the raw HTML includes navigation, scripts, ads, and boilerplate that can easily push a single page to 10,000+ tokens. At scale the cost and rate limit problem becomes significant.**Convert HTML to Markdown first, then send to an LLM.** This is better. Research has shown that the same page in Markdown averages around 90% fewer tokens than the raw HTML while preserving the structural content that matters. Still involves one LLM call per page.\n\nThe better approach is to use a scraping layer that handles AI extraction natively so the LLM calls happen inside the scraping infrastructure rather than in your application code, and you only receive clean structured JSON back.\n\n## What good structured output looks like for a vector DB\n\nBefore writing any code, it helps to think about what the output should look like. For a searchable product database, you want two things in each record:\n\nA **structured fields object** with the discrete data points: title, price, category, URL, date, and any other fields that are filterable or sortable.\n\nA **text content field** for semantic search: the full cleaned body text of the page, stripped of boilerplate, that you will embed and store in the vector database.\n\nKeeping these separate matters. Structured fields support exact filtering (\"show me all cars under $20,000\"). The text content supports semantic search (\"find me something with good fuel economy for city driving\"). You want both.\n\nA good record looks something like this:\n\n```\n{\n  \"url\": \"https://example.com/products/fiat-500\",\n  \"title\": \"Fiat 500 Hatchback 2024\",\n  \"price\": 18990,\n  \"currency\": \"USD\",\n  \"categories\": [\"car\", \"hatchback\", \"city car\"],\n  \"brand\": \"Fiat\",\n  \"description\": \"The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...\",\n  \"content\": \"Full cleaned page text for embedding goes here...\",\n  \"scraped_at\": \"2026-05-01T10:22:00Z\"\n}\n```\n\nThe `content`\n\nfield is what you embed and store in pgvector. The rest of the fields are stored as structured columns for filtering.\n\n## Why raw HTML is the wrong input format\n\nIf you send raw page HTML to an LLM, here is what you are actually sending:\n\n```\n<!DOCTYPE html>\n<html>\n<head>\n  <meta charset=\"UTF-8\">\n  <title>Fiat 500 | Cars for Sale</title>\n  <link rel=\"stylesheet\" href=\"/css/main.css\">\n  <script src=\"/js/analytics.js\"></script>\n  <!-- 40 more lines of head content -->\n</head>\n<body>\n  <nav class=\"main-nav\">\n    <ul>\n      <li><a href=\"/cars\">Cars</a></li>\n      <li><a href=\"/bikes\">Bikes</a></li>\n      <!-- navigation items -->\n    </ul>\n  </nav>\n  <header class=\"site-header\">\n    <!-- header content -->\n  </header>\n  <!-- actual product content buried somewhere in here -->\n  <footer>\n    <!-- footer repeating across every page -->\n  </footer>\n</body>\n</html>\n```\n\nFor a typical e-commerce product page, the markup that actually contains useful product information is maybe 10 to 20% of the total HTML. The rest is navigation, scripts, CSS, analytics tags, cookie banners, and footers that repeat on every single page. You are paying to send all of that to the LLM on every request.\n\nStripping that boilerplate before extraction is not optional, it is the first thing you should do.\n\n## The right approach: Structured extraction at the scraping layer\n\nInstead of fetching raw HTML and then calling an LLM yourself, use a scraping tool that handles both steps. You provide a URL, a plain-English description of what to extract, and an output schema. The tool renders the page, strips boilerplate, extracts exactly what you described, and returns structured JSON that matches your schema.\n\nThis eliminates several problems at once:\n\n- No raw HTML in your application code\n- No LLM rate limit management on your side\n- No token overhead from boilerplate\n- Consistent output shape regardless of page structure differences\n- Extraction works even on JavaScript-rendered pages\n\nSpidra handles all of this through a single API. Here is what the full pipeline looks like.\n\n## Building the pipeline with Spidra\n\n### Step 1: Install the SDK\n\n```\npip install spidra\n```\n\n### Step 2: Define your schema\n\nDefine the exact output shape you want. Spidra's AI extraction will match this schema on every page regardless of how the page is structured. Required fields always appear in the output, as `null`\n\nif the page does not have that value.\n\n```\nPRODUCT_SCHEMA = {\n    \"type\": \"object\",\n    \"required\": [\"title\", \"price\", \"categories\", \"description\"],\n    \"properties\": {\n        \"title\":       {\"type\": \"string\"},\n        \"brand\":       {\"type\": [\"string\", \"null\"]},\n        \"price\":       {\"type\": [\"number\", \"null\"]},\n        \"currency\":    {\"type\": [\"string\", \"null\"]},\n        \"categories\":  {\n            \"type\": \"array\",\n            \"items\": {\n                \"type\": \"string\",\n                \"enum\": [\"car\", \"bike\", \"truck\", \"van\", \"electric\"]\n            }\n        },\n        \"description\": {\"type\": \"string\"},\n        \"year\":        {\"type\": [\"integer\", \"null\"]},\n        \"mileage_km\":  {\"type\": [\"number\", \"null\"]},\n    }\n}\n```\n\nThe `categories`\n\nfield uses an enum so the AI matches each page to your predefined taxonomy rather than inventing new category names. This is the right way to handle classification across inconsistent pages.\n\n### Step 3: Scrape a single page\n\n``` python\nfrom spidra import SpidraClient, ScrapeParams, ScrapeUrl\nfrom datetime import datetime, timezone\nimport os\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\ndef scrape_product_page(url: str) -> dict:\n    job = spidra.scrape.run_sync(ScrapeParams(\n        urls=[ScrapeUrl(url=url)],\n        prompt=\"\"\"\n            Extract the product details from this page.\n            Normalize price to a number without currency symbols.\n            Map the vehicle type to the closest matching category.\n            Return null for any fields not present on the page.\n        \"\"\",\n        output=\"json\",\n        schema=PRODUCT_SCHEMA,\n        extract_content_only=True,  # strips nav, ads, boilerplate\n    ))\n\n    structured = job.result.content\n    content = job.result.markdown_content or \"\"\n\n    return {\n        \"url\": url,\n        **structured,\n        \"content\": content,\n        \"scraped_at\": datetime.now(timezone.utc).isoformat(),\n    }\n\nrecord = scrape_product_page(\"https://example.com/products/fiat-500\")\nprint(record)\n{\n    \"url\": \"https://example.com/products/fiat-500\",\n    \"title\": \"Fiat 500 Hatchback 2024\",\n    \"brand\": \"Fiat\",\n    \"price\": 18990,\n    \"currency\": \"USD\",\n    \"categories\": [\"car\"],\n    \"description\": \"The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...\",\n    \"year\": 2024,\n    \"mileage_km\": null,\n    \"content\": \"Full cleaned page text...\",\n    \"scraped_at\": \"2026-05-01T10:22:00Z\"\n}\n```\n\nConsistent output, every time, regardless of how the source page is laid out.\n\n### Step 4: Process 300 URLs without rate limits\n\nInstead of calling an LLM 300 times yourself and managing rate limits, use Spidra's batch endpoint. It processes up to 50 URLs in parallel per request. For 300 URLs, send 6 batch requests:\n\n``` python\nfrom spidra import SpidraClient, BatchScrapeParams\nfrom datetime import datetime, timezone\nimport os, json, time\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\ndef batch_scrape(urls: list[str], batch_size: int = 50) -> list[dict]:\n    all_records = []\n\n    for i in range(0, len(urls), batch_size):\n        chunk = urls[i:i + batch_size]\n        print(f\"Processing batch {i // batch_size + 1} ({len(chunk)} URLs)...\")\n\n        batch = spidra.batch.run_sync(BatchScrapeParams(\n            urls=chunk,\n            prompt=\"\"\"\n                Extract the product details from this page.\n                Normalize price to a number without currency symbols.\n                Map the vehicle type to the closest matching category.\n                Return null for any fields not present on the page.\n            \"\"\",\n            output=\"json\",\n            schema=PRODUCT_SCHEMA,\n            extract_content_only=True,\n        ))\n\n        collected_at = datetime.now(timezone.utc).isoformat()\n\n        for item in batch.items:\n            if item.status == \"completed\" and item.result:\n                record = {\n                    \"url\": item.url,\n                    **item.result.content,\n                    \"content\": item.result.markdown_content or \"\",\n                    \"scraped_at\": collected_at,\n                }\n                all_records.append(record)\n            else:\n                print(f\"  Failed: {item.url} — {item.error}\")\n\n    return all_records\n\n# load your URL list\nwith open(\"product_urls.txt\") as f:\n    urls = [line.strip() for line in f if line.strip()]\n\nrecords = batch_scrape(urls)\nprint(f\"Scraped {len(records)} records\")\n\n# save to JSONL for vector DB ingestion\nwith open(\"products.jsonl\", \"w\") as f:\n    for record in records:\n        f.write(json.dumps(record) + \"\\n\")\n```\n\nNo rate limit handling. No token management. No boilerplate stripping. 300 URLs processed across 6 parallel batches.\n\n### Step 5: Embed and store in pgvector\n\nWith the records saved, generate embeddings for the `content`\n\nfield and store everything in PostgreSQL with pgvector:\n\n``` python\nimport psycopg2\nimport json\nfrom openai import OpenAI\n\nopenai_client = OpenAI(api_key=os.environ[\"OPENAI_API_KEY\"])\n\ndef get_embedding(text: str) -> list[float]:\n    response = openai_client.embeddings.create(\n        input=text,\n        model=\"text-embedding-3-small\"\n    )\n    return response.data[0].embedding\n\n# connect to postgres\nconn = psycopg2.connect(os.environ[\"DATABASE_URL\"])\ncur = conn.cursor()\n\n# create the table\ncur.execute(\"\"\"\n    CREATE TABLE IF NOT EXISTS products (\n        id          SERIAL PRIMARY KEY,\n        url         TEXT UNIQUE NOT NULL,\n        title       TEXT,\n        brand       TEXT,\n        price       NUMERIC,\n        currency    TEXT,\n        categories  TEXT[],\n        description TEXT,\n        year        INTEGER,\n        mileage_km  NUMERIC,\n        content     TEXT,\n        embedding   vector(1536),\n        scraped_at  TIMESTAMPTZ\n    )\n\"\"\")\n\n# load and insert records\nwith open(\"products.jsonl\") as f:\n    for line in f:\n        record = json.loads(line)\n\n        if not record.get(\"content\"):\n            continue\n\n        embedding = get_embedding(record[\"content\"])\n\n        cur.execute(\"\"\"\n            INSERT INTO products\n                (url, title, brand, price, currency, categories, description,\n                 year, mileage_km, content, embedding, scraped_at)\n            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)\n            ON CONFLICT (url) DO UPDATE SET\n                title       = EXCLUDED.title,\n                price       = EXCLUDED.price,\n                content     = EXCLUDED.content,\n                embedding   = EXCLUDED.embedding,\n                scraped_at  = EXCLUDED.scraped_at\n        \"\"\", (\n            record[\"url\"],\n            record.get(\"title\"),\n            record.get(\"brand\"),\n            record.get(\"price\"),\n            record.get(\"currency\"),\n            record.get(\"categories\", []),\n            record.get(\"description\"),\n            record.get(\"year\"),\n            record.get(\"mileage_km\"),\n            record[\"content\"],\n            embedding,\n            record[\"scraped_at\"],\n        ))\n\nconn.commit()\ncur.close()\nconn.close()\nprint(\"Done\")\n```\n\n### Step 6: Semantic search with structured filtering\n\nNow query with combined semantic search and structured filters:\n\n``` python\ndef search_products(query: str, category: str = None, max_price: float = None, limit: int = 10):\n    # embed the search query\n    query_embedding = get_embedding(query)\n\n    # build the query with optional filters\n    filters = []\n    params = [query_embedding, limit]\n\n    if category:\n        filters.append(f\"%s = ANY(categories)\")\n        params.insert(-1, category)\n\n    if max_price:\n        filters.append(f\"price <= %s\")\n        params.insert(-1, max_price)\n\n    where_clause = (\"WHERE \" + \" AND \".join(filters)) if filters else \"\"\n\n    sql = f\"\"\"\n        SELECT url, title, brand, price, categories, description,\n               1 - (embedding <=> %s::vector) AS similarity\n        FROM products\n        {where_clause}\n        ORDER BY embedding <=> %s::vector\n        LIMIT %s\n    \"\"\"\n\n    params = [query_embedding] + params[:-1] + [query_embedding, limit]\n\n    cur.execute(sql, params)\n    return cur.fetchall()\n\n# find fuel-efficient city cars under $25,000\nresults = search_products(\n    query=\"fuel efficient city car easy to park\",\n    category=\"car\",\n    max_price=25000,\n    limit=5\n)\n\nfor row in results:\n    print(f\"{row[1]} ({row[4]}) — ${row[3]:,.0f} — similarity: {row[6]:.3f}\")\n# Output\nFiat 500 Hatchback 2024 (['car']) — $18,990 — similarity: 0.891\nToyota Yaris 2024 (['car']) — $22,450 — similarity: 0.876\nMini Cooper 3-Door 2023 (['car']) — $24,100 — similarity: 0.863\n```\n\nSemantic search combined with exact price and category filtering, all from data that was structured consistently from pages that had nothing consistent about them.\n\n## Handling pages with no JavaScript\n\nFor sites where content is server-rendered and there is no bot protection, you can scrape faster without using a full browser. Spidra detects this automatically and uses the appropriate rendering strategy per page. You do not need to configure anything.\n\nFor sites with bot protection, add `use_proxy=True`\n\n:\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=url)],\n    prompt=\"Extract the product details...\",\n    output=\"json\",\n    schema=PRODUCT_SCHEMA,\n    extract_content_only=True,\n    use_proxy=True,\n    proxy_country=\"us\",\n))\n```\n\nSame code, same output, with residential proxy rotation and CAPTCHA solving handled automatically.\n\n## Keeping the corpus fresh\n\nSet up a refresh job that re-scrapes pages on a schedule and updates records where the content hash has changed:\n\n``` php\nimport hashlib\n\ndef needs_update(url: str, new_content: str) -> bool:\n    new_hash = hashlib.md5(new_content.encode()).hexdigest()\n    cur.execute(\"SELECT content FROM products WHERE url = %s\", (url,))\n    row = cur.fetchone()\n    if not row:\n        return True\n    old_hash = hashlib.md5(row[0].encode()).hexdigest()\n    return new_hash != old_hash\n```\n\nOnly re-embed and update records that actually changed. This keeps your vector index fresh without re-processing everything on every run.\n\n## The full pipeline summary\n\n```\nURL list\n   ↓\nSpidra batch endpoint (50 URLs in parallel)\n   ↓\nReal browser render + boilerplate stripped + AI extraction + schema enforced\n   ↓\nClean structured JSON (consistent shape regardless of source page layout)\n   ↓\nproducts.jsonl\n   ↓\nEmbed content field with text-embedding-3-small\n   ↓\nInsert into pgvector (structured fields + embedding)\n   ↓\nSemantic search + structured filters\n```\n\nNo LLM rate limits to manage. No token overhead from boilerplate. No selectors to maintain. No parser to update when the site redesigns. The schema enforces consistent output even when the source pages are completely inconsistent with each other.", "url": "https://wpnews.pro/news/how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai", "canonical_source": "https://spidra.io/blog/scrape-unstructured-web-data-structured-json-vector-db", "published_at": "2026-06-08 00:00:00+00:00", "updated_at": "2026-06-11 18:46:37.517817+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-infrastructure", "ai-products"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai", "markdown": "https://wpnews.pro/news/how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai.md", "text": "https://wpnews.pro/news/how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai.txt", "jsonld": "https://wpnews.pro/news/how-to-scrape-unstructured-website-data-and-turn-it-into-structured-json-for-ai.jsonld"}}