Spidra crawl API: how to crawl an entire website and extract data

Spidra launched a crawl API that automatically discovers and extracts data from entire websites by following links, rendering pages, solving CAPTCHAs, and using AI extraction. Users submit a starting URL, plain-English instructions for page discovery and data extraction, and receive structured results from every matched page. The API handles the entire process asynchronously, from link discovery to final data output.

Scraping and crawling are two different problems. Scraping is for when you know the URLs. You have a list of product pages, a set of job listings, a collection of profiles. You hand those URLs to the scrape endpoint and get data back. Crawling is for when you do not know the URLs. You know the website, you know what kind of pages you want, but you have not sat down and enumerated them. You want to point Spidra at a domain, describe what to look for, and have it discover and extract data from everything matching that description. That is what the crawl API https://docs.spidra.io/api-reference/crawling/crawl does. You give it a starting URL, tell it which pages to discover in plain English, tell it what to extract from each one, and it handles the rest. Page discovery, link following, browser rendering, CAPTCHA solving https://spidra.io/products/captcha-solver , and AI extraction all happen automatically. You get back structured data from every page it found. This guide covers the entire crawl API from your first request through re-extraction, history, and real-world pipelines. How the crawl API works Crawl jobs follow the same async pattern as scraping. You submit, receive a jobId , and poll until complete. The internal process has five stages: Submit — you send your request and get a jobId immediately Discover — Spidra loads your baseUrl and finds links matching your crawlInstruction Crawl — each discovered page is visited in a real browser, up to your maxPages limit Solve — CAPTCHAs are handled automatically on any page that needs them Transform — your transformInstruction runs on every crawled page via AI extraction POST /api/crawl → { jobId: "abc-123" } GET /api/crawl/abc-123 → { status: "active", ... } GET /api/crawl/abc-123 → { status: "completed", result: ... } Three fields are required on every crawl request: baseUrl , crawlInstruction , and transformInstruction . Everything else is optional. Your first crawl cURL Submit a crawl curl -X POST https://api.spidra.io/api/crawl \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "baseUrl": "https://books.toscrape.com", "crawlInstruction": "Crawl all book listing pages and individual book pages", "transformInstruction": "Extract the book title, price, star rating, and availability", "maxPages": 10 }' Response: { "status": "queued", "jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41", "message": "Crawl job queued. Poll /api/crawl/7f3a8b12 for results." } Poll until complete: curl https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41 \ -H "x-api-key: YOUR API KEY" When the job finishes, the result array has one entry per crawled page. Each entry has the url and the data extracted by your transformInstruction : { "status": "completed", "jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41", "result": { "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic 1000/index.html", "data": { "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "availability": "In stock" } }, { "url": "https://books.toscrape.com/catalogue/tipping-the-velvet 999/index.html", "data": { "title": "Tipping the Velvet", "price": "£53.74", "rating": "One", "availability": "In stock" } } } The three required fields baseUrl The page where the crawl starts. Spidra loads this URL first, reads the links on it, and uses those links as the starting point for discovery. The crawl stays within the same domain by default. Be specific with the starting URL. If you are crawling a blog, start at the blog index rather than the homepage. If you want documentation pages, start at the docs root. Starting closer to the content you want means the crawl reaches it faster and wastes fewer pages on navigation. "baseUrl": "https://docs.example.com" // good: docs root "baseUrl": "https://example.com" // less focused: homepage first "baseUrl": "https://example.com/blog" // good: blog root "baseUrl": "https://competitor.com/pricing" // good: specific section crawlInstruction This tells Spidra which links to follow and which to ignore. Write it as a plain English description of the pages you want. Here are some examples: "crawlInstruction": "Find all blog post pages. Skip tag pages, author pages, and the homepage." "crawlInstruction": "Crawl all product pages. Ignore cart, checkout, account, and search result pages." "crawlInstruction": "Find all documentation pages. Skip API reference pages, changelog pages, and login pages." Be explicit about what to skip. Crawl budgets are limited by maxPages , and pages spent on navigation, sidebars, or boilerplate are pages not spent on the content you actually want. transformInstruction This tells Spidra what to extract from each page it crawls. It runs as an AI extraction on the fully rendered content of every page. Here are some examples: "transformInstruction": "Extract the article title, author name, publish date, and the full body text. Return null for any field not found on the page." "transformInstruction": "Extract the product name, current price as a number, currency code, and whether it is in stock." "transformInstruction": "Extract the page title and a two-sentence summary of the main content." Write it the same way you would write a prompt for the scrape API. Specific field names produce consistent output. Telling the AI what to return when a field is missing prevents silent omissions. Python: raw REST API python import requests import time import json import os API KEY = os.environ "SPIDRA API KEY" BASE URL = "https://api.spidra.io/api" HEADERS = {"x-api-key": API KEY, "Content-Type": "application/json"} Submit the crawl response = requests.post f"{BASE URL}/crawl", headers=HEADERS, json={ "baseUrl": "https://books.toscrape.com", "crawlInstruction": "Find all individual book pages. Skip category pages, the homepage, and pagination pages.", "transformInstruction": "Extract the book title, price excluding tax as a number, star rating as a word, UPC code, and whether it is in stock.", "maxPages": 15, } response.raise for status job id = response.json "jobId" print f"Crawl started: {job id}" Poll until complete while True: status = requests.get f"{BASE URL}/crawl/{job id}", headers=HEADERS .json print f"Status: {status 'status' }" if status "status" == "completed": break elif status "status" == "failed": raise Exception f"Crawl failed: {status.get 'error' }" time.sleep 5 Process results pages = status "result" print f"\nCrawled {len pages } pages" for page in pages: print f"\n{page 'url' }" print page "data" Save as JSONL with open "books.jsonl", "w" as f: for page in pages: f.write json.dumps { "url": page "url" , "data": page "data" } + "\n" print f"\nSaved to books.jsonl" Python SDK python from spidra import SpidraClient, CrawlParams, PollOptions import os, json spidra = SpidraClient api key=os.environ "SPIDRA API KEY" job = spidra.crawl.run sync CrawlParams base url="https://books.toscrape.com", crawl instruction="Find all individual book pages. Skip category pages, the homepage, and pagination pages.", transform instruction="Extract the book title, price excluding tax, star rating, UPC, and stock status.", max pages=15, , PollOptions timeout=300 , print f"Crawled {len job.result } pages" for page in job.result: print page.url, page.data For crawls that may take a while, use submit sync and get sync to manage polling yourself: Submit without waiting queued = spidra.crawl.submit sync CrawlParams base url="https://docs.example.com", crawl instruction="Crawl all documentation pages. Skip changelog and community pages.", transform instruction="Extract the page title and full body text as clean Markdown.", max pages=20, job id = queued.job id print f"Job submitted: {job id}" Check on it later import time while True: status = spidra.crawl.get sync job id print f"Status: {status.status}" if status.status in "completed", "failed" : break time.sleep 5 for page in status.result: print page.url Node.js SDK js import { SpidraClient } from 'spidra-js' import { writeFileSync } from 'fs' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } const job = await spidra.crawl.run { baseUrl: 'https://books.toscrape.com', crawlInstruction: 'Find all individual book pages. Skip category pages, the homepage, and pagination pages.', transformInstruction: 'Extract the book title, price, star rating, and availability.', maxPages: 15, } console.log Crawled ${job.result.length} pages const jsonl = job.result .map page = JSON.stringify { url: page.url, data: page.data } .join '\n' writeFileSync 'books.jsonl', jsonl maxPages: what to expect maxPages defaults to 5 and accepts values between 1 and 50. The crawl stops when it reaches this limit even if there are more pages to discover. If you need to go beyond 50 for a large-scale use case, reach out via the contact page https://spidra.io/contact and the team can adjust it for you. A few things worth knowing about how maxPages works in practice: The first page counted is the baseUrl itself. A maxPages: 10 crawl visits the base URL plus up to 9 additional discovered pages. Page discovery is breadth-first. The crawler finds all links on the base URL, then all links on those pages, and so on. Pages found earlier in the discovery order are more likely to be visited within a limited budget. Setting a focused crawlInstruction matters more the lower your maxPages budget. A vague instruction on a large site will fill your budget with navigation and boilerplate. A specific instruction sends the budget toward the content you actually want. For reference: maxPages: 5 is appropriate for a quick proof of concept or a small documentation section. maxPages: 10 works well for a moderate-sized blog or a product catalog section. maxPages: 50 covers large documentation sites, full competitor blogs, and most production crawl use cases. Proxy routing for crawls Add useProxy and proxyCountry to route crawl requests through residential proxies. Useful for sites that block cloud infrastructure, sites with geo-restricted content, or competitor sites that actively rate-limit. spidra.crawl.run sync CrawlParams base url="https://competitor.com/pricing", crawl instruction="Find all pricing, plan, and features pages.", transform instruction="Extract plan names, prices, and included features. Return null for any field not present.", max pages=10, use proxy=True, proxy country="us", Country options: any two-letter ISO code "us" , "de" , "gb" , "fr" , "eu" to rotate across all 27 EU states, or omit for global rotation. Authenticated crawling Crawl pages behind a login by passing session cookies. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it as a string. spidra.crawl.run sync CrawlParams base url="https://app.example.com/reports", crawl instruction="Find all monthly report pages from 2025.", transform instruction="Extract the report title, date, and the key metrics table.", max pages=15, cookies="session id=abc123; auth token=xyz789", The cookies are applied to every page visited during the crawl. Getting the raw page content After a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. Spidra stores the page content and provides signed download URLs that are valid for one hour. Via REST API curl https://api.spidra.io/api/crawl/{jobId}/pages \ -H "x-api-key: YOUR API KEY" Response: { "pages": { "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic 1000/index.html", "status": "success", "html url": "https://spidra-storage.s3.amazonaws.com/...?expires=...", "markdown url": "https://spidra-storage.s3.amazonaws.com/...?expires=..." } } Via Python SDK result = spidra.crawl.pages sync job id for page in result.pages: print page.url, page.status page.html url — download original HTML page.markdown url — download cleaned Markdown Download the Markdown if page.markdown url: import requests md = requests.get page.markdown url .text print md :500 The signed URLs expire after one hour. Download the content you need promptly after retrieving them. Re-extracting without re-crawling This is one of the most useful features in the crawl API and the one most people do not know about. Once a crawl job is complete, Spidra stores all the page content. If you want to extract different information from the same set of pages — because your requirements changed, or you want additional fields, or you want to try a different extraction approach — you can run a new AI extraction on the stored content without making a single new browser request. The extract endpoint takes a completed job ID and a new instruction. It returns a new job ID for the re-extraction. You only pay transformation credits, not crawling credits. Via REST API curl -X POST https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41/extract \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "transformInstruction": "Extract only the book title and the UPC code as a flat JSON object." }' Response: { "jobId": "new-job-id-for-re-extraction" } Poll this new jobId with GET /api/crawl/{jobId} until it completes. Via Python SDK Original crawl that extracted titles and prices original = spidra.crawl.run sync CrawlParams base url="https://books.toscrape.com", crawl instruction="Find all individual book pages", transform instruction="Extract book title and price", max pages=10, original job id = original.job id Later: re-extract different fields from the same pages new job = spidra.crawl.extract sync original job id, "Extract only the UPC code and the full product description from each page", Poll the new extraction job result = spidra.crawl.get sync new job.job id for page in result.result: print page.url, page.data This is particularly useful for iterative data collection. Crawl once, extract different fields as your pipeline evolves, without paying for browser time again. Browsing crawl history python from spidra import CrawlHistoryParams response = spidra.crawl.history sync CrawlHistoryParams page=1, limit=10 print f"Total crawl jobs: {response.total}" for job in response.jobs: print f"{job.base url} {job.status} {job.pages crawled} pages" Overall stats stats = spidra.crawl.stats sync print f"All-time crawls: {stats.total}" Via REST: curl "https://api.spidra.io/api/crawl/history?page=1&limit=10" \ -H "x-api-key: YOUR API KEY" Real-world examples Crawling a competitor's blog for content analysis python from spidra import SpidraClient, CrawlParams import os, json spidra = SpidraClient api key=os.environ "SPIDRA API KEY" job = spidra.crawl.run sync CrawlParams base url="https://competitor.com/blog", crawl instruction="Follow links to blog posts only. Skip author pages, tag pages, category pages, and the blog index.", transform instruction="Extract the post title, author name, publish date in ISO format, and a one-sentence summary of the main argument.", max pages=20, use proxy=True, posts = p.data for p in job.result if p.data print f"Found {len posts } blog posts" Save for analysis with open "competitor posts.jsonl", "w" as f: for i, page in enumerate job.result : if page.data: f.write json.dumps {"url": page.url, "data": page.data} + "\n" Crawling books.toscrape.com for a full product catalogue books.toscrape.com is a fictional bookstore built specifically for scraping practice. It has 1000 books across 50 pages. With maxPages: 30 or more you can get a thorough sample of individual book pages with full details. job = spidra.crawl.run sync CrawlParams base url="https://books.toscrape.com/catalogue/category/books/mystery 3/index.html", crawl instruction="Crawl all individual book pages in the Mystery category. Skip pagination pages and category index pages.", transform instruction="Extract the book title, price excluding tax as a number without the pound sign, price including tax as a number, star rating as a word, UPC, number in stock, and the full product description.", max pages=20, print f"Extracted {len job.result } mystery books" for page in job.result :3 : print f"\n{page.url}" print json.dumps page.data, indent=2 Crawling a documentation site for a RAG pipeline python import os, json, requests from spidra import SpidraClient, CrawlParams, PollOptions spidra = SpidraClient api key=os.environ "SPIDRA API KEY" Crawl the documentation job = spidra.crawl.run sync CrawlParams base url="https://docs.example.com", crawl instruction="Follow all documentation pages. Skip the changelog, the API reference, and login pages.", transform instruction="Extract the page title and the full body content as clean Markdown. Preserve all headings, code examples, and numbered lists. Return null for pages with no meaningful content.", max pages=20, , PollOptions timeout=600 , print f"Crawled {len job.result } pages" Fetch the Markdown for each page and chunk it raw pages = spidra.crawl.pages sync job.job id chunks = for page in raw pages.pages: if page.markdown url and page.status == "success": md = requests.get page.markdown url .text Split at heading boundaries import re sections = re.split r'\n ?= {1,3} ', md for section in sections: if len section.strip 200: chunks.append { "url": page.url, "content": section.strip , } print f"Created {len chunks } chunks for vector indexing" with open "docs chunks.jsonl", "w" as f: for chunk in chunks: f.write json.dumps chunk + "\n" Weekly competitive intelligence crawl python import os, json from datetime import datetime, timezone from pathlib import Path from spidra import SpidraClient, CrawlParams spidra = SpidraClient api key=os.environ "SPIDRA API KEY" COMPETITORS = "https://competitor-a.com/pricing", "https://competitor-b.com/pricing", "https://competitor-c.com/plans", today = datetime.now timezone.utc .strftime "%Y-%m-%d" results = {} for url in COMPETITORS: print f"Crawling {url}..." job = spidra.crawl.run sync CrawlParams base url=url, crawl instruction="Find all pricing pages, plan comparison pages, and feature pages. Skip blog, docs, and help pages.", transform instruction="Extract all plan names, monthly prices, annual prices, and the list of features or limitations for each plan. Return null for any field not present.", max pages=5, use proxy=True, results url = { "crawled at": today, "pages": {"url": p.url, "data": p.data} for p in job.result } Save snapshot Path "snapshots" .mkdir exist ok=True with open f"snapshots/pricing-{today}.json", "w" as f: json.dump results, f, indent=2 print f"\nSnapshot saved: snapshots/pricing-{today}.json" Crawl API vs scrape API: when to use each The decision is straightforward once you think about it in terms of what you know upfront. Use the scrape API when you already have the URLs. A list of product pages, a set of known job listings, specific competitor pages you monitor regularly. You have the URLs, you just need the data. Use the crawl API when you do not have the URLs yet. You know the website, you know what kind of pages you want, but you need Spidra to discover them for you. A competitor's blog you have not catalogued. A documentation site you want to index. A new site section that was just published. The crawl API is also the better choice when the number of pages is unpredictable. If a competitor's pricing page might be one URL or five, the crawl endpoint discovers and processes all of them without you needing to know the count in advance. For large-scale collection where you already have hundreds of URLs, the batch scrape endpoint processes up to 50 in parallel per request and is more efficient than crawling. Crawl API reference Endpoints | Method | Endpoint | Purpose | |---|---|---| POST | /api/crawl | Submit a crawl job | GET | /api/crawl/{jobId} | Poll for status and results | GET | /api/crawl/{jobId}/pages | Get per-page HTML and Markdown download URLs | POST | /api/crawl/{jobId}/extract | Re-run extraction on stored pages | GET | /api/crawl/{jobId}/download | Download all results as a ZIP | GET | /api/crawl/history | List past crawl jobs | Parameters | Parameter | Type | Required | Default | Description | |---|---|---|---|---| baseUrl | string | Yes | — | Starting URL for the crawl | crawlInstruction | string | Yes | — | Which pages to discover | transformInstruction | string | Yes | — | What to extract from each page | maxPages | integer | No | 5 | Maximum pages to crawl 1–50 | useProxy | boolean | No | false | Route through residential proxies | proxyCountry | string | No | global | Country code, "eu" , or "global" | cookies | string | No | — | Session cookies for authenticated pages |