{"slug": "ai-powered-web-scraping-with-rust", "title": "AI-powered web scraping with Rust", "summary": "A new AI-powered approach to web scraping replaces fragile CSS selectors with natural language prompts and returns structured JSON output, making data extraction more resilient to site redesigns and dynamic content. The Spidra Rust SDK enables developers to describe desired data in plain English, define a JSON schema for output, and execute a single async call that returns typed results. This shift from selector-based to prompt-based extraction reduces maintenance costs and integrates directly into LLM pipelines and Rust AI agent workflows.", "body_md": "Traditional web scraping treats extraction as a parsing problem: find the right CSS selector, pull the text, repeat. That model works until the site redesigns, adds dynamic loading, or starts obfuscating class names. In 2026, AI-powered scraping replaces selectors with natural language prompts and returns guaranteed structured output, a better fit for LLM pipelines, agent workflows, and any system that needs typed data rather than raw HTML.\n\nThis guide covers how to extract structured data from websites in Rust using AI: the shift from selector-based to prompt-based extraction, JSON schema output for LLM pipelines, wiring scraped data into Rust AI agents, and production patterns for async batch pipelines.\n\n**Quick answer:** To extract structured data from websites in Rust using AI, describe what you want in plain English, define a JSON schema for the output shape, and let an AI scraping layer handle parsing and normalization. The Spidra Rust SDK (`cargo add spidra`\n\n) exposes this as a single async call that returns a typed result.\n\n## How AI changes web scraping\n\nSelector-based scrapers are tightly coupled to page structure. A `div.product-price > span.amount`\n\nselector breaks the moment a developer renames a class or restructures the DOM. Maintaining dozens of these across different sites is a significant ongoing cost.\n\nLLMs change the equation. Instead of encoding page structure into your code, you describe the data you want in plain English. The model reads the rendered content, understands context, and returns structured output regardless of how the underlying HTML is organized.\n\nFor Rust AI agent scraping, this matters because:\n\n**Typed pipelines.** When your scraper returns a JSON object that maps to a Rust struct, the compiler catches shape mismatches before they reach production.**Resilience to layout changes.** Prompt-based scrapers degrade gracefully. A redesign that breaks a CSS selector often has no impact on a well-written prompt.**Less code to maintain.** Replacing fifty selectors with one prompt eliminates a whole class of breakage.\n\n## The old way vs the new way\n\nHere is the same extraction task done both ways: pulling job listing details from a page.\n\n### Selector-based (old way)\n\n```\nuse reqwest::Client;\nuse scraper::{Html, Selector};\n\n#[tokio::main]\nasync fn main() -> Result<(), Box<dyn std::error::Error>> {\n    let client = Client::builder()\n        .user_agent(\"Mozilla/5.0\")\n        .build()?;\n\n    let html = client\n        .get(\"https://jobs.example.com/senior-engineer\")\n        .send()\n        .await?\n        .text()\n        .await?;\n\n    let document = Html::parse_document(&html);\n\n    // Fragile — breaks on any DOM change\n    let title_sel = Selector::parse(\"h1.job-title\").unwrap();\n    let company_sel = Selector::parse(\"span.company-name\").unwrap();\n    let salary_sel = Selector::parse(\"div.salary-range > span.value\").unwrap();\n\n    let title = document.select(&title_sel)\n        .next()\n        .map(|e| e.text().collect::<String>())\n        .unwrap_or_default();\n\n    let company = document.select(&company_sel)\n        .next()\n        .map(|e| e.text().collect::<String>())\n        .unwrap_or_default();\n\n    let salary = document.select(&salary_sel)\n        .next()\n        .map(|e| e.text().collect::<String>())\n        .unwrap_or_default();\n\n    println!(\"{} at {} — {}\", title, company, salary);\n    Ok(())\n}\n```\n\nThis works until the site changes. And sites change.\n\n### Prompt-based with JSON schema (new way)\n\n```\nuse spidra::{SpidraClient, types::scrape::{ScrapeParams, OutputFormat}};\nuse serde::Deserialize;\nuse serde_json::json;\n\n#[derive(Deserialize, Debug)]\nstruct JobListing {\n    title: String,\n    company: String,\n    location: Option<String>,\n    salary_min: Option<f64>,\n    salary_max: Option<f64>,\n    skills: Vec<String>,\n}\n\n#[tokio::main]\nasync fn main() -> Result<(), Box<dyn std::error::Error>> {\n    let client = SpidraClient::new(\"your-api-key\");\n\n    let mut params = ScrapeParams::new(\"https://jobs.example.com/senior-engineer\");\n    params.prompt = Some(\"Extract the job title, company name, location, salary range, and required skills\".to_string());\n    params.output_format = Some(OutputFormat::Json);\n    params.schema = Some(json!({\n        \"type\": \"object\",\n        \"required\": [\"title\", \"company\"],\n        \"properties\": {\n            \"title\": { \"type\": \"string\" },\n            \"company\": { \"type\": \"string\" },\n            \"location\": { \"type\": [\"string\", \"null\"] },\n            \"salary_min\": { \"type\": [\"number\", \"null\"] },\n            \"salary_max\": { \"type\": [\"number\", \"null\"] },\n            \"skills\": {\n                \"type\": \"array\",\n                \"items\": { \"type\": \"string\" }\n            }\n        }\n    }));\n\n    let result = client.scrape().run(&params).await?;\n\n    // Deserialize directly into your Rust struct\n    let listing: JobListing = serde_json::from_value(result.data.unwrap())?;\n    println!(\"{} at {} ({:?})\", listing.title, listing.company, listing.location);\n    println!(\"Skills: {:?}\", listing.skills);\n\n    Ok(())\n}\n```\n\nThe schema enforces output shape. Fields in `required`\n\nalways appear, as `null`\n\nif the data is not found. Optional fields are omitted when unavailable. Your Rust struct maps directly to the schema, so deserialization is clean.\n\n## Structured output with JSON schema for LLM pipelines\n\nThe schema field is the most important feature for AI agent scraping in Rust. It turns unpredictable HTML into a typed value you can deserialize directly into a Rust struct, pass to an LLM as structured context, or store in a database.\n\nA few rules worth knowing:\n\n- Mark anything that might be missing as\n`[\"type\", \"null\"]`\n\n. The API returns`null`\n\nrather than omitting the field, so your`Option<T>`\n\nfields map cleanly. - Put all fields you always need in\n`required`\n\n. Put optional enrichment fields outside it. - Enum fields work:\n`\"enum\": [\"full_time\", \"part_time\", \"contract\", null]`\n\n.\n\nExample schema for a product listing:\n\n``` js\nlet schema = json!({\n    \"type\": \"object\",\n    \"required\": [\"name\", \"price\", \"in_stock\"],\n    \"properties\": {\n        \"name\": { \"type\": \"string\" },\n        \"price\": { \"type\": \"number\" },\n        \"currency\": { \"type\": [\"string\", \"null\"] },\n        \"in_stock\": { \"type\": \"boolean\" },\n        \"rating\": { \"type\": [\"number\", \"null\"] },\n        \"review_count\": { \"type\": [\"number\", \"null\"] }\n    }\n});\n```\n\n## Wiring scraped data into a Rust AI agent\n\nHere is a complete example of a Rust AI agent that uses `async-openai`\n\nand Spidra together. The agent scrapes a URL for context, then passes the structured result to an LLM.\n\n```\n[dependencies]\nspidra = \"0.1\"\nasync-openai = \"0.23\"\ntokio = { version = \"1\", features = [\"full\"] }\nserde_json = \"1\"\nuse async_openai::{Client as OpenAIClient, types::{ChatCompletionRequestUserMessageArgs, CreateChatCompletionRequestArgs}};\nuse spidra::{SpidraClient, types::scrape::ScrapeParams};\n\n#[tokio::main]\nasync fn main() -> Result<(), Box<dyn std::error::Error>> {\n    let spidra = SpidraClient::new(\"your-spidra-key\");\n    let openai = OpenAIClient::new();\n\n    // Step 1: scrape the page\n    let mut params = ScrapeParams::new(\"https://news.ycombinator.com\");\n    params.prompt = Some(\"List the top 10 stories with title, URL, and point count\".to_string());\n\n    let scrape_result = spidra.scrape().run(&params).await?;\n\n    let context = serde_json::to_string_pretty(&scrape_result.content)?;\n\n    // Step 2: pass to LLM\n    let request = CreateChatCompletionRequestArgs::default()\n        .model(\"gpt-4o\")\n        .messages(vec![\n            ChatCompletionRequestUserMessageArgs::default()\n                .content(format!(\n                    \"Here are the current top stories on Hacker News:\\n\\n{}\\n\\nWhich story would be most relevant to a Rust developer and why?\",\n                    context\n                ))\n                .build()?\n                .into(),\n        ])\n        .build()?;\n\n    let response = openai.chat().create(request).await?;\n    let answer = &response.choices[0].message.content;\n    println!(\"{:?}\", answer);\n\n    Ok(())\n}\n```\n\nThe key pattern: scrape first, structure the data, pass it as context. The LLM gets clean typed information rather than raw HTML noise.\n\n## Batch extraction at scale\n\nFor processing many URLs in parallel (competitor pages, product listings, directory entries), use the batch endpoint. Up to 50 URLs processed concurrently:\n\n```\nuse spidra::types::batch::BatchScrapeParams;\nuse spidra::types::scrape::ScrapeUrl;\n\nlet urls = vec![\n    ScrapeUrl::new(\"https://shop.example.com/product/1\"),\n    ScrapeUrl::new(\"https://shop.example.com/product/2\"),\n    ScrapeUrl::new(\"https://shop.example.com/product/3\"),\n];\n\nlet batch = client.batch().run(&BatchScrapeParams::new(urls)).await?;\n\nlet successful: Vec<_> = batch.items.iter()\n    .filter(|item| item.status == \"completed\")\n    .collect();\n\nprintln!(\"{}/{} succeeded\", successful.len(), batch.items.len());\n\nfor item in successful {\n    println!(\"{}: {:?}\", item.url, item.result);\n}\n```\n\n## Production patterns for async Rust scraping pipelines\n\n### Retry with exponential backoff\n\n```\nuse std::time::Duration;\nuse tokio::time::sleep;\n\nasync fn scrape_with_retry(\n    client: &SpidraClient,\n    params: ScrapeParams,\n    max_retries: u32,\n) -> Result<spidra::types::scrape::ScrapeResult, spidra::error::SpidraError> {\n    let mut attempt = 0;\n    loop {\n        match client.scrape().run(&params).await {\n            Ok(result) => return Ok(result),\n            Err(e) if attempt < max_retries => {\n                let wait = Duration::from_secs(2u64.pow(attempt));\n                eprintln!(\"Attempt {} failed: {}. Retrying in {:?}\", attempt + 1, e, wait);\n                sleep(wait).await;\n                attempt += 1;\n            }\n            Err(e) => return Err(e),\n        }\n    }\n}\n```\n\n### Processing a URL queue concurrently\n\n```\nuse tokio::task::JoinSet;\nuse std::sync::Arc;\n\nasync fn process_urls(urls: Vec<String>, api_key: &str) {\n    let client = Arc::new(SpidraClient::new(api_key));\n    let mut set = JoinSet::new();\n\n    for url in urls {\n        let client = Arc::clone(&client);\n        set.spawn(async move {\n            let mut params = ScrapeParams::new(&url);\n            params.prompt = Some(\"Extract the page title and main content summary\".to_string());\n            client.scrape().run(&params).await\n        });\n    }\n\n    while let Some(result) = set.join_next().await {\n        match result {\n            Ok(Ok(scrape_result)) => println!(\"Success: {:?}\", scrape_result.content),\n            Ok(Err(e)) => eprintln!(\"Scrape error: {}\", e),\n            Err(e) => eprintln!(\"Task panicked: {}\", e),\n        }\n    }\n}\n```\n\n## Full end-to-end example: competitive pricing agent\n\nA complete agent that monitors competitor prices and generates a report:\n\n```\nuse spidra::{SpidraClient, types::batch::BatchScrapeParams};\nuse serde::{Deserialize};\nuse serde_json::json;\n\n#[derive(Deserialize, Debug)]\nstruct Product {\n    name: String,\n    price: f64,\n    currency: Option<String>,\n    on_sale: Option<bool>,\n}\n\n#[tokio::main]\nasync fn main() -> Result<(), Box<dyn std::error::Error>> {\n    let client = SpidraClient::new(\"your-api-key\");\n\n    let competitor_urls = vec![\n        \"https://competitor-a.com/widgets\".to_string(),\n        \"https://competitor-b.com/widgets\".to_string(),\n        \"https://competitor-c.com/widgets\".to_string(),\n    ];\n\n    let urls = competitor_urls.into_iter().map(ScrapeUrl::new).collect();\n    let batch = client.batch().run(&BatchScrapeParams::new(urls)).await?;\n\n    for item in &batch.items {\n        if item.status == \"completed\" {\n            println!(\"--- {} ---\", item.url);\n            if let Some(result) = &item.result {\n                println!(\"{}\", serde_json::to_string_pretty(result)?);\n            }\n        }\n    }\n\n    Ok(())\n}\n```\n\n## Wrapping up\n\nAI-powered web scraping in Rust replaces brittle selectors with durable prompts and delivers typed structured output that integrates cleanly with Rust's type system. The JSON schema contract means your `Option<T>`\n\nfields and enums map directly to what the API returns, with no post-processing or defensive null checks sprinkled throughout your code.\n\nFor production Rust AI agent scraping, the Spidra SDK handles the browser automation, proxy rotation, and CAPTCHA infrastructure. Your code stays focused on what to do with the data.\n\nInstall: `cargo add spidra`\n\n*Spidra is a web scraping API with AI-powered extraction, proxy rotation, and CAPTCHA handling. **Try it free at spidra.io**.*", "url": "https://wpnews.pro/news/ai-powered-web-scraping-with-rust", "canonical_source": "https://spidra.io/blog/ai-powered-web-scraping-with-rust", "published_at": "2026-05-05 00:00:00+00:00", "updated_at": "2026-06-03 09:41:35.475597+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-agents", "natural-language-processing"], "entities": ["Spidra"], "alternates": {"html": "https://wpnews.pro/news/ai-powered-web-scraping-with-rust", "markdown": "https://wpnews.pro/news/ai-powered-web-scraping-with-rust.md", "text": "https://wpnews.pro/news/ai-powered-web-scraping-with-rust.txt", "jsonld": "https://wpnews.pro/news/ai-powered-web-scraping-with-rust.jsonld"}}