Scraping and crawling are two different problems. Scraping is for when you know the URLs. You have a list of product pages, a set of job listings, a collection of profiles. You hand those URLs to the scrape endpoint and get data back.
Crawling is for when you do not know the URLs. You know the website, you know what kind of pages you want, but you have not sat down and enumerated them. You want to point Spidra at a domain, describe what to look for, and have it discover and extract data from everything matching that description.
That is what the crawl API does. You give it a starting URL, tell it which pages to discover in plain English, tell it what to extract from each one, and it handles the rest. Page discovery, link following, browser rendering, CAPTCHA solving, and AI extraction all happen automatically. You get back structured data from every page it found.
This guide covers the entire crawl API from your first request through re-extraction, history, and real-world pipelines.
How the crawl API works #
Crawl jobs follow the same async pattern as scraping. You submit, receive a jobId
, and poll until complete.
The internal process has five stages:
Submit— you send your request and get ajobId
immediatelyDiscover— Spidra loads yourbaseUrl
and finds links matching yourcrawlInstruction
Crawl— each discovered page is visited in a real browser, up to yourmaxPages
limitSolve— CAPTCHAs are handled automatically on any page that needs them** Transform**— yourtransformInstruction
runs on every crawled page via AI extraction
POST /api/crawl → { jobId: "abc-123" }
GET /api/crawl/abc-123 → { status: "active", ... }
GET /api/crawl/abc-123 → { status: "completed", result: [...] }
Three fields are required on every crawl request: baseUrl
, crawlInstruction
, and transformInstruction
. Everything else is optional.
Your first crawl #
cURL
curl -X POST https://api.spidra.io/api/crawl \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"baseUrl": "https://books.toscrape.com",
"crawlInstruction": "Crawl all book listing pages and individual book pages",
"transformInstruction": "Extract the book title, price, star rating, and availability",
"maxPages": 10
}'
Response:
{
"status": "queued",
"jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41",
"message": "Crawl job queued. Poll /api/crawl/7f3a8b12 for results."
}
Poll until complete:
curl https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41 \
-H "x-api-key: YOUR_API_KEY"
When the job finishes, the result
array has one entry per crawled page. Each entry has the url
and the data
extracted by your transformInstruction
:
{
"status": "completed",
"jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41",
"result": [
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"data": {
"title": "A Light in the Attic",
"price": "£51.77",
"rating": "Three",
"availability": "In stock"
}
},
{
"url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"data": {
"title": "Tipping the Velvet",
"price": "£53.74",
"rating": "One",
"availability": "In stock"
}
}
]
}
The three required fields #
baseUrl
The page where the crawl starts. Spidra loads this URL first, reads the links on it, and uses those links as the starting point for discovery. The crawl stays within the same domain by default.
Be specific with the starting URL. If you are crawling a blog, start at the blog index rather than the homepage. If you want documentation pages, start at the docs root. Starting closer to the content you want means the crawl reaches it faster and wastes fewer pages on navigation.
"baseUrl": "https://docs.example.com" // good: docs root
"baseUrl": "https://example.com" // less focused: homepage first
"baseUrl": "https://example.com/blog" // good: blog root
"baseUrl": "https://competitor.com/pricing" // good: specific section
crawlInstruction
This tells Spidra which links to follow and which to ignore. Write it as a plain English description of the pages you want.
Here are some examples:
"crawlInstruction": "Find all blog post pages. Skip tag pages, author pages, and the homepage."
"crawlInstruction": "Crawl all product pages. Ignore cart, checkout, account, and search result pages."
"crawlInstruction": "Find all documentation pages. Skip API reference pages, changelog pages, and login pages."
Be explicit about what to skip. Crawl budgets are limited by maxPages
, and pages spent on navigation, sidebars, or boilerplate are pages not spent on the content you actually want.
transformInstruction
This tells Spidra what to extract from each page it crawls. It runs as an AI extraction on the fully rendered content of every page.
Here are some examples:
"transformInstruction": "Extract the article title, author name, publish date, and the full body text. Return null for any field not found on the page."
"transformInstruction": "Extract the product name, current price as a number, currency code, and whether it is in stock."
"transformInstruction": "Extract the page title and a two-sentence summary of the main content."
Write it the same way you would write a prompt for the scrape API. Specific field names produce consistent output. Telling the AI what to return when a field is missing prevents silent omissions.
Python: raw REST API #
import requests
import time
import json
import os
API_KEY = os.environ["SPIDRA_API_KEY"]
BASE_URL = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
response = requests.post(
f"{BASE_URL}/crawl",
headers=HEADERS,
json={
"baseUrl": "https://books.toscrape.com",
"crawlInstruction": "Find all individual book pages. Skip category pages, the homepage, and pagination pages.",
"transformInstruction": "Extract the book title, price excluding tax as a number, star rating as a word, UPC code, and whether it is in stock.",
"maxPages": 15,
}
)
response.raise_for_status()
job_id = response.json()["jobId"]
print(f"Crawl started: {job_id}")
while True:
status = requests.get(
f"{BASE_URL}/crawl/{job_id}",
headers=HEADERS
).json()
print(f"Status: {status['status']}")
if status["status"] == "completed":
break
elif status["status"] == "failed":
raise Exception(f"Crawl failed: {status.get('error')}")
time.sleep(5)
pages = status["result"]
print(f"\nCrawled {len(pages)} pages")
for page in pages:
print(f"\n{page['url']}")
print(page["data"])
with open("books.jsonl", "w") as f:
for page in pages:
f.write(json.dumps({
"url": page["url"],
"data": page["data"]
}) + "\n")
print(f"\nSaved to books.jsonl")
Python SDK #
from spidra import SpidraClient, CrawlParams, PollOptions
import os, json
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
job = spidra.crawl.run_sync(
CrawlParams(
base_url="https://books.toscrape.com",
crawl_instruction="Find all individual book pages. Skip category pages, the homepage, and pagination pages.",
transform_instruction="Extract the book title, price excluding tax, star rating, UPC, and stock status.",
max_pages=15,
),
PollOptions(timeout=300),
)
print(f"Crawled {len(job.result)} pages")
for page in job.result:
print(page.url, page.data)
For crawls that may take a while, use submit_sync()
and get_sync()
to manage polling yourself:
queued = spidra.crawl.submit_sync(CrawlParams(
base_url="https://docs.example.com",
crawl_instruction="Crawl all documentation pages. Skip changelog and community pages.",
transform_instruction="Extract the page title and full body text as clean Markdown.",
max_pages=20,
))
job_id = queued.job_id
print(f"Job submitted: {job_id}")
import time
while True:
status = spidra.crawl.get_sync(job_id)
print(f"Status: {status.status}")
if status.status in ("completed", "failed"):
break
time.sleep(5)
for page in status.result:
print(page.url)
Node.js SDK #
import { SpidraClient } from 'spidra-js'
import { writeFileSync } from 'fs'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
const job = await spidra.crawl.run({
baseUrl: 'https://books.toscrape.com',
crawlInstruction: 'Find all individual book pages. Skip category pages, the homepage, and pagination pages.',
transformInstruction: 'Extract the book title, price, star rating, and availability.',
maxPages: 15,
})
console.log(`Crawled ${job.result.length} pages`)
const jsonl = job.result
.map(page => JSON.stringify({ url: page.url, data: page.data }))
.join('\n')
writeFileSync('books.jsonl', jsonl)
maxPages: what to expect #
maxPages
defaults to 5 and accepts values between 1 and 50. The crawl stops when it reaches this limit even if there are more pages to discover. If you need to go beyond 50 for a large-scale use case, reach out via the contact page and the team can adjust it for you.
A few things worth knowing about how maxPages works in practice:
The first page counted is the baseUrl
itself. A maxPages: 10
crawl visits the base URL plus up to 9 additional discovered pages.
Page discovery is breadth-first. The crawler finds all links on the base URL, then all links on those pages, and so on. Pages found earlier in the discovery order are more likely to be visited within a limited budget.
Setting a focused crawlInstruction
matters more the lower your maxPages
budget. A vague instruction on a large site will fill your budget with navigation and boilerplate. A specific instruction sends the budget toward the content you actually want.
For reference: maxPages: 5
is appropriate for a quick proof of concept or a small documentation section. maxPages: 10
works well for a moderate-sized blog or a product catalog section. maxPages: 50
covers large documentation sites, full competitor blogs, and most production crawl use cases.
Proxy routing for crawls #
Add useProxy
and proxyCountry
to route crawl requests through residential proxies. Useful for sites that block cloud infrastructure, sites with geo-restricted content, or competitor sites that actively rate-limit.
spidra.crawl.run_sync(CrawlParams(
base_url="https://competitor.com/pricing",
crawl_instruction="Find all pricing, plan, and features pages.",
transform_instruction="Extract plan names, prices, and included features. Return null for any field not present.",
max_pages=10,
use_proxy=True,
proxy_country="us",
))
Country options: any two-letter ISO code ("us"
, "de"
, "gb"
, "fr"
), "eu"
to rotate across all 27 EU states, or omit for global rotation.
Authenticated crawling #
Crawl pages behind a login by passing session cookies. Log in through your browser, open DevTools, copy the Cookie
header from any authenticated request, and pass it as a string.
spidra.crawl.run_sync(CrawlParams(
base_url="https://app.example.com/reports",
crawl_instruction="Find all monthly report pages from 2025.",
transform_instruction="Extract the report title, date, and the key metrics table.",
max_pages=15,
cookies="session_id=abc123; auth_token=xyz789",
))
The cookies are applied to every page visited during the crawl.
Getting the raw page content #
After a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. Spidra stores the page content and provides signed download URLs that are valid for one hour.
Via REST API
curl https://api.spidra.io/api/crawl/{jobId}/pages \
-H "x-api-key: YOUR_API_KEY"
Response:
{
"pages": [
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"status": "success",
"html_url": "https://spidra-storage.s3.amazonaws.com/...?expires=...",
"markdown_url": "https://spidra-storage.s3.amazonaws.com/...?expires=..."
}
]
}
Via Python SDK
result = spidra.crawl.pages_sync(job_id)
for page in result.pages:
print(page.url, page.status)
if page.markdown_url:
import requests
md = requests.get(page.markdown_url).text
print(md[:500])
The signed URLs expire after one hour. Download the content you need promptly after retrieving them.
Re-extracting without re-crawling #
This is one of the most useful features in the crawl API and the one most people do not know about.
Once a crawl job is complete, Spidra stores all the page content. If you want to extract different information from the same set of pages — because your requirements changed, or you want additional fields, or you want to try a different extraction approach — you can run a new AI extraction on the stored content without making a single new browser request.
The extract
endpoint takes a completed job ID and a new instruction. It returns a new job ID for the re-extraction. You only pay transformation credits, not crawling credits.
Via REST API
curl -X POST https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41/extract \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"transformInstruction": "Extract only the book title and the UPC code as a flat JSON object."
}'
Response:
{
"jobId": "new-job-id-for-re-extraction"
}
Poll this new jobId
with GET /api/crawl/{jobId}
until it completes.
Via Python SDK
original = spidra.crawl.run_sync(CrawlParams(
base_url="https://books.toscrape.com",
crawl_instruction="Find all individual book pages",
transform_instruction="Extract book title and price",
max_pages=10,
))
original_job_id = original.job_id
new_job = spidra.crawl.extract_sync(
original_job_id,
"Extract only the UPC code and the full product description from each page",
)
result = spidra.crawl.get_sync(new_job.job_id)
for page in result.result:
print(page.url, page.data)
This is particularly useful for iterative data collection. Crawl once, extract different fields as your pipeline evolves, without paying for browser time again.
Browsing crawl history #
from spidra import CrawlHistoryParams
response = spidra.crawl.history_sync(CrawlHistoryParams(page=1, limit=10))
print(f"Total crawl jobs: {response.total}")
for job in response.jobs:
print(f"{job.base_url} {job.status} {job.pages_crawled} pages")
stats = spidra.crawl.stats_sync()
print(f"All-time crawls: {stats.total}")
Via REST:
curl "https://api.spidra.io/api/crawl/history?page=1&limit=10" \
-H "x-api-key: YOUR_API_KEY"
Real-world examples #
Crawling a competitor's blog for content analysis
from spidra import SpidraClient, CrawlParams
import os, json
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
job = spidra.crawl.run_sync(CrawlParams(
base_url="https://competitor.com/blog",
crawl_instruction="Follow links to blog posts only. Skip author pages, tag pages, category pages, and the blog index.",
transform_instruction="Extract the post title, author name, publish date in ISO format, and a one-sentence summary of the main argument.",
max_pages=20,
use_proxy=True,
))
posts = [p.data for p in job.result if p.data]
print(f"Found {len(posts)} blog posts")
with open("competitor_posts.jsonl", "w") as f:
for i, page in enumerate(job.result):
if page.data:
f.write(json.dumps({"url": page.url, "data": page.data}) + "\n")
Crawling books.toscrape.com for a full product catalogue
books.toscrape.com is a fictional bookstore built specifically for scraping practice. It has 1000 books across 50 pages. With maxPages: 30
or more you can get a thorough sample of individual book pages with full details.
job = spidra.crawl.run_sync(CrawlParams(
base_url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
crawl_instruction="Crawl all individual book pages in the Mystery category. Skip pagination pages and category index pages.",
transform_instruction="Extract the book title, price excluding tax as a number without the pound sign, price including tax as a number, star rating as a word, UPC, number in stock, and the full product description.",
max_pages=20,
))
print(f"Extracted {len(job.result)} mystery books")
for page in job.result[:3]:
print(f"\n{page.url}")
print(json.dumps(page.data, indent=2))
Crawling a documentation site for a RAG pipeline
import os, json, requests
from spidra import SpidraClient, CrawlParams, PollOptions
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
job = spidra.crawl.run_sync(
CrawlParams(
base_url="https://docs.example.com",
crawl_instruction="Follow all documentation pages. Skip the changelog, the API reference, and login pages.",
transform_instruction="Extract the page title and the full body content as clean Markdown. Preserve all headings, code examples, and numbered lists. Return null for pages with no meaningful content.",
max_pages=20,
),
PollOptions(timeout=600),
)
print(f"Crawled {len(job.result)} pages")
raw_pages = spidra.crawl.pages_sync(job.job_id)
chunks = []
for page in raw_pages.pages:
if page.markdown_url and page.status == "success":
md = requests.get(page.markdown_url).text
import re
sections = re.split(r'\n(?=#{1,3} )', md)
for section in sections:
if len(section.strip()) > 200:
chunks.append({
"url": page.url,
"content": section.strip(),
})
print(f"Created {len(chunks)} chunks for vector indexing")
with open("docs_chunks.jsonl", "w") as f:
for chunk in chunks:
f.write(json.dumps(chunk) + "\n")
Weekly competitive intelligence crawl
import os, json
from datetime import datetime, timezone
from pathlib import Path
from spidra import SpidraClient, CrawlParams
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
COMPETITORS = [
"https://competitor-a.com/pricing",
"https://competitor-b.com/pricing",
"https://competitor-c.com/plans",
]
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
results = {}
for url in COMPETITORS:
print(f"Crawling {url}...")
job = spidra.crawl.run_sync(CrawlParams(
base_url=url,
crawl_instruction="Find all pricing pages, plan comparison pages, and feature pages. Skip blog, docs, and help pages.",
transform_instruction="Extract all plan names, monthly prices, annual prices, and the list of features or limitations for each plan. Return null for any field not present.",
max_pages=5,
use_proxy=True,
))
results[url] = {
"crawled_at": today,
"pages": [{"url": p.url, "data": p.data} for p in job.result]
}
Path("snapshots").mkdir(exist_ok=True)
with open(f"snapshots/pricing-{today}.json", "w") as f:
json.dump(results, f, indent=2)
print(f"\nSnapshot saved: snapshots/pricing-{today}.json")
Crawl API vs scrape API: when to use each #
The decision is straightforward once you think about it in terms of what you know upfront.
Use the scrape API when you already have the URLs. A list of product pages, a set of known job listings, specific competitor pages you monitor regularly. You have the URLs, you just need the data.
Use the crawl API when you do not have the URLs yet. You know the website, you know what kind of pages you want, but you need Spidra to discover them for you. A competitor's blog you have not catalogued. A documentation site you want to index. A new site section that was just published.
The crawl API is also the better choice when the number of pages is unpredictable. If a competitor's pricing page might be one URL or five, the crawl endpoint discovers and processes all of them without you needing to know the count in advance.
For large-scale collection where you already have hundreds of URLs, the batch scrape endpoint processes up to 50 in parallel per request and is more efficient than crawling.
Crawl API reference #
Endpoints
| Method | Endpoint | Purpose |
|---|---|---|
POST |
/api/crawl |
Submit a crawl job |
GET |
/api/crawl/{jobId} |
Poll for status and results |
GET |
/api/crawl/{jobId}/pages |
Get per-page HTML and Markdown download URLs |
POST |
/api/crawl/{jobId}/extract |
Re-run extraction on stored pages |
GET |
/api/crawl/{jobId}/download |
Download all results as a ZIP |
GET |
/api/crawl/history |
List past crawl jobs |
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
baseUrl |
string | Yes | — | Starting URL for the crawl |
crawlInstruction |
string | Yes | — | Which pages to discover |
transformInstruction |
string | Yes | — | What to extract from each page |
maxPages |
integer | No | 5 | Maximum pages to crawl (1–50) |
useProxy |
boolean | No | false | Route through residential proxies |
proxyCountry |
string | No | global | Country code, "eu" , or "global" |
cookies |
string | No | — | Session cookies for authenticated pages |