# Spidra crawl API: how to crawl an entire website and extract data

> Source: <https://spidra.io/blog/spidra-crawl-api-guide>
> Published: 2026-06-24 00:00:00+00:00

Scraping and crawling are two different problems. Scraping is for when you know the URLs. You have a list of product pages, a set of job listings, a collection of profiles. You hand those URLs to the scrape endpoint and get data back.

Crawling is for when you do not know the URLs. You know the website, you know what kind of pages you want, but you have not sat down and enumerated them. You want to point Spidra at a domain, describe what to look for, and have it discover and extract data from everything matching that description.

That is what the [crawl API](https://docs.spidra.io/api-reference/crawling/crawl) does. You give it a starting URL, tell it which pages to discover in plain English, tell it what to extract from each one, and it handles the rest. Page discovery, link following, browser rendering, [CAPTCHA solving](https://spidra.io/products/captcha-solver), and AI extraction all happen automatically. You get back structured data from every page it found.

This guide covers the entire crawl API from your first request through re-extraction, history, and real-world pipelines.

## How the crawl API works

Crawl jobs follow the same async pattern as scraping. You submit, receive a `jobId`

, and poll until complete.

The internal process has five stages:

**Submit**— you send your request and get a`jobId`

immediately**Discover**— Spidra loads your`baseUrl`

and finds links matching your`crawlInstruction`

**Crawl**— each discovered page is visited in a real browser, up to your`maxPages`

limit**Solve**— CAPTCHAs are handled automatically on any page that needs them** Transform**— your`transformInstruction`

runs on every crawled page via AI extraction

```
POST /api/crawl          →  { jobId: "abc-123" }
GET  /api/crawl/abc-123  →  { status: "active", ... }
GET  /api/crawl/abc-123  →  { status: "completed", result: [...] }
```

Three fields are required on every crawl request: `baseUrl`

, `crawlInstruction`

, and `transformInstruction`

. Everything else is optional.

## Your first crawl

### cURL

```
# Submit a crawl
curl -X POST https://api.spidra.io/api/crawl \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "baseUrl": "https://books.toscrape.com",
    "crawlInstruction": "Crawl all book listing pages and individual book pages",
    "transformInstruction": "Extract the book title, price, star rating, and availability",
    "maxPages": 10
  }'
```

Response:

```
{
  "status": "queued",
  "jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41",
  "message": "Crawl job queued. Poll /api/crawl/7f3a8b12 for results."
}
```

Poll until complete:

```
curl https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41 \
  -H "x-api-key: YOUR_API_KEY"
```

When the job finishes, the `result`

array has one entry per crawled page. Each entry has the `url`

and the `data`

extracted by your `transformInstruction`

:

```
{
  "status": "completed",
  "jobId": "7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41",
  "result": [
    {
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "data": {
        "title": "A Light in the Attic",
        "price": "£51.77",
        "rating": "Three",
        "availability": "In stock"
      }
    },
    {
      "url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
      "data": {
        "title": "Tipping the Velvet",
        "price": "£53.74",
        "rating": "One",
        "availability": "In stock"
      }
    }
  ]
}
```

## The three required fields

### baseUrl

The page where the crawl starts. Spidra loads this URL first, reads the links on it, and uses those links as the starting point for discovery. The crawl stays within the same domain by default.

Be specific with the starting URL. If you are crawling a blog, start at the blog index rather than the homepage. If you want documentation pages, start at the docs root. Starting closer to the content you want means the crawl reaches it faster and wastes fewer pages on navigation.

```
"baseUrl": "https://docs.example.com"          // good: docs root
"baseUrl": "https://example.com"               // less focused: homepage first
"baseUrl": "https://example.com/blog"          // good: blog root
"baseUrl": "https://competitor.com/pricing"    // good: specific section
```

### crawlInstruction

This tells Spidra which links to follow and which to ignore. Write it as a plain English description of the pages you want.

Here are some examples:

```
"crawlInstruction": "Find all blog post pages. Skip tag pages, author pages, and the homepage."
"crawlInstruction": "Crawl all product pages. Ignore cart, checkout, account, and search result pages."
"crawlInstruction": "Find all documentation pages. Skip API reference pages, changelog pages, and login pages."
```

Be explicit about what to skip. Crawl budgets are limited by `maxPages`

, and pages spent on navigation, sidebars, or boilerplate are pages not spent on the content you actually want.

### transformInstruction

This tells Spidra what to extract from each page it crawls. It runs as an AI extraction on the fully rendered content of every page.

Here are some examples:

```
"transformInstruction": "Extract the article title, author name, publish date, and the full body text. Return null for any field not found on the page."
"transformInstruction": "Extract the product name, current price as a number, currency code, and whether it is in stock."
"transformInstruction": "Extract the page title and a two-sentence summary of the main content."
```

Write it the same way you would write a prompt for the scrape API. Specific field names produce consistent output. Telling the AI what to return when a field is missing prevents silent omissions.

## Python: raw REST API

``` python
import requests
import time
import json
import os

API_KEY = os.environ["SPIDRA_API_KEY"]
BASE_URL = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}

# Submit the crawl
response = requests.post(
    f"{BASE_URL}/crawl",
    headers=HEADERS,
    json={
        "baseUrl": "https://books.toscrape.com",
        "crawlInstruction": "Find all individual book pages. Skip category pages, the homepage, and pagination pages.",
        "transformInstruction": "Extract the book title, price excluding tax as a number, star rating as a word, UPC code, and whether it is in stock.",
        "maxPages": 15,
    }
)
response.raise_for_status()
job_id = response.json()["jobId"]
print(f"Crawl started: {job_id}")

# Poll until complete
while True:
    status = requests.get(
        f"{BASE_URL}/crawl/{job_id}",
        headers=HEADERS
    ).json()

    print(f"Status: {status['status']}")

    if status["status"] == "completed":
        break
    elif status["status"] == "failed":
        raise Exception(f"Crawl failed: {status.get('error')}")

    time.sleep(5)

# Process results
pages = status["result"]
print(f"\nCrawled {len(pages)} pages")

for page in pages:
    print(f"\n{page['url']}")
    print(page["data"])

# Save as JSONL
with open("books.jsonl", "w") as f:
    for page in pages:
        f.write(json.dumps({
            "url": page["url"],
            "data": page["data"]
        }) + "\n")

print(f"\nSaved to books.jsonl")
```

## Python SDK

``` python
from spidra import SpidraClient, CrawlParams, PollOptions
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.crawl.run_sync(
    CrawlParams(
        base_url="https://books.toscrape.com",
        crawl_instruction="Find all individual book pages. Skip category pages, the homepage, and pagination pages.",
        transform_instruction="Extract the book title, price excluding tax, star rating, UPC, and stock status.",
        max_pages=15,
    ),
    PollOptions(timeout=300),
)

print(f"Crawled {len(job.result)} pages")

for page in job.result:
    print(page.url, page.data)
```

For crawls that may take a while, use `submit_sync()`

and `get_sync()`

to manage polling yourself:

```
# Submit without waiting
queued = spidra.crawl.submit_sync(CrawlParams(
    base_url="https://docs.example.com",
    crawl_instruction="Crawl all documentation pages. Skip changelog and community pages.",
    transform_instruction="Extract the page title and full body text as clean Markdown.",
    max_pages=20,
))

job_id = queued.job_id
print(f"Job submitted: {job_id}")

# Check on it later
import time
while True:
    status = spidra.crawl.get_sync(job_id)
    print(f"Status: {status.status}")
    if status.status in ("completed", "failed"):
        break
    time.sleep(5)

for page in status.result:
    print(page.url)
```

## Node.js SDK

``` js
import { SpidraClient } from 'spidra-js'
import { writeFileSync } from 'fs'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const job = await spidra.crawl.run({
  baseUrl: 'https://books.toscrape.com',
  crawlInstruction: 'Find all individual book pages. Skip category pages, the homepage, and pagination pages.',
  transformInstruction: 'Extract the book title, price, star rating, and availability.',
  maxPages: 15,
})

console.log(`Crawled ${job.result.length} pages`)

const jsonl = job.result
  .map(page => JSON.stringify({ url: page.url, data: page.data }))
  .join('\n')

writeFileSync('books.jsonl', jsonl)
```

## maxPages: what to expect

`maxPages`

defaults to 5 and accepts values between 1 and 50. The crawl stops when it reaches this limit even if there are more pages to discover. If you need to go beyond 50 for a large-scale use case, [reach out via the contact page](https://spidra.io/contact) and the team can adjust it for you.

A few things worth knowing about how maxPages works in practice:

The first page counted is the `baseUrl`

itself. A `maxPages: 10`

crawl visits the base URL plus up to 9 additional discovered pages.

Page discovery is breadth-first. The crawler finds all links on the base URL, then all links on those pages, and so on. Pages found earlier in the discovery order are more likely to be visited within a limited budget.

Setting a focused `crawlInstruction`

matters more the lower your `maxPages`

budget. A vague instruction on a large site will fill your budget with navigation and boilerplate. A specific instruction sends the budget toward the content you actually want.

For reference: `maxPages: 5`

is appropriate for a quick proof of concept or a small documentation section. `maxPages: 10`

works well for a moderate-sized blog or a product catalog section. `maxPages: 50`

covers large documentation sites, full competitor blogs, and most production crawl use cases.

## Proxy routing for crawls

Add `useProxy`

and `proxyCountry`

to route crawl requests through residential proxies. Useful for sites that block cloud infrastructure, sites with geo-restricted content, or competitor sites that actively rate-limit.

```
spidra.crawl.run_sync(CrawlParams(
    base_url="https://competitor.com/pricing",
    crawl_instruction="Find all pricing, plan, and features pages.",
    transform_instruction="Extract plan names, prices, and included features. Return null for any field not present.",
    max_pages=10,
    use_proxy=True,
    proxy_country="us",
))
```

Country options: any two-letter ISO code (`"us"`

, `"de"`

, `"gb"`

, `"fr"`

), `"eu"`

to rotate across all 27 EU states, or omit for global rotation.

## Authenticated crawling

Crawl pages behind a login by passing session cookies. Log in through your browser, open DevTools, copy the `Cookie`

header from any authenticated request, and pass it as a string.

```
spidra.crawl.run_sync(CrawlParams(
    base_url="https://app.example.com/reports",
    crawl_instruction="Find all monthly report pages from 2025.",
    transform_instruction="Extract the report title, date, and the key metrics table.",
    max_pages=15,
    cookies="session_id=abc123; auth_token=xyz789",
))
```

The cookies are applied to every page visited during the crawl.

## Getting the raw page content

After a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. Spidra stores the page content and provides signed download URLs that are valid for one hour.

### Via REST API

```
curl https://api.spidra.io/api/crawl/{jobId}/pages \
  -H "x-api-key: YOUR_API_KEY"
```

Response:

```
{
  "pages": [
    {
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "status": "success",
      "html_url": "https://spidra-storage.s3.amazonaws.com/...?expires=...",
      "markdown_url": "https://spidra-storage.s3.amazonaws.com/...?expires=..."
    }
  ]
}
```

### Via Python SDK

```
result = spidra.crawl.pages_sync(job_id)

for page in result.pages:
    print(page.url, page.status)
    # page.html_url     — download original HTML
    # page.markdown_url — download cleaned Markdown

    # Download the Markdown
    if page.markdown_url:
        import requests
        md = requests.get(page.markdown_url).text
        print(md[:500])
```

The signed URLs expire after one hour. Download the content you need promptly after retrieving them.

## Re-extracting without re-crawling

This is one of the most useful features in the crawl API and the one most people do not know about.

Once a crawl job is complete, Spidra stores all the page content. If you want to extract different information from the same set of pages — because your requirements changed, or you want additional fields, or you want to try a different extraction approach — you can run a new AI extraction on the stored content without making a single new browser request.

The `extract`

endpoint takes a completed job ID and a new instruction. It returns a new job ID for the re-extraction. You only pay transformation credits, not crawling credits.

### Via REST API

```
curl -X POST https://api.spidra.io/api/crawl/7f3a8b12-4c21-4e98-b1d0-9a5f23c76e41/extract \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "transformInstruction": "Extract only the book title and the UPC code as a flat JSON object."
  }'
```

Response:

```
{
  "jobId": "new-job-id-for-re-extraction"
}
```

Poll this new `jobId`

with `GET /api/crawl/{jobId}`

until it completes.

### Via Python SDK

```
# Original crawl that extracted titles and prices
original = spidra.crawl.run_sync(CrawlParams(
    base_url="https://books.toscrape.com",
    crawl_instruction="Find all individual book pages",
    transform_instruction="Extract book title and price",
    max_pages=10,
))

original_job_id = original.job_id

# Later: re-extract different fields from the same pages
new_job = spidra.crawl.extract_sync(
    original_job_id,
    "Extract only the UPC code and the full product description from each page",
)

# Poll the new extraction job
result = spidra.crawl.get_sync(new_job.job_id)
for page in result.result:
    print(page.url, page.data)
```

This is particularly useful for iterative data collection. Crawl once, extract different fields as your pipeline evolves, without paying for browser time again.

## Browsing crawl history

``` python
from spidra import CrawlHistoryParams

response = spidra.crawl.history_sync(CrawlHistoryParams(page=1, limit=10))
print(f"Total crawl jobs: {response.total}")

for job in response.jobs:
    print(f"{job.base_url}  {job.status}  {job.pages_crawled} pages")

# Overall stats
stats = spidra.crawl.stats_sync()
print(f"All-time crawls: {stats.total}")
```

Via REST:

```
curl "https://api.spidra.io/api/crawl/history?page=1&limit=10" \
  -H "x-api-key: YOUR_API_KEY"
```

## Real-world examples

### Crawling a competitor's blog for content analysis

``` python
from spidra import SpidraClient, CrawlParams
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.crawl.run_sync(CrawlParams(
    base_url="https://competitor.com/blog",
    crawl_instruction="Follow links to blog posts only. Skip author pages, tag pages, category pages, and the blog index.",
    transform_instruction="Extract the post title, author name, publish date in ISO format, and a one-sentence summary of the main argument.",
    max_pages=20,
    use_proxy=True,
))

posts = [p.data for p in job.result if p.data]
print(f"Found {len(posts)} blog posts")

# Save for analysis
with open("competitor_posts.jsonl", "w") as f:
    for i, page in enumerate(job.result):
        if page.data:
            f.write(json.dumps({"url": page.url, "data": page.data}) + "\n")
```

### Crawling books.toscrape.com for a full product catalogue

books.toscrape.com is a fictional bookstore built specifically for scraping practice. It has 1000 books across 50 pages. With `maxPages: 30`

or more you can get a thorough sample of individual book pages with full details.

```
job = spidra.crawl.run_sync(CrawlParams(
    base_url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    crawl_instruction="Crawl all individual book pages in the Mystery category. Skip pagination pages and category index pages.",
    transform_instruction="Extract the book title, price excluding tax as a number without the pound sign, price including tax as a number, star rating as a word, UPC, number in stock, and the full product description.",
    max_pages=20,
))

print(f"Extracted {len(job.result)} mystery books")

for page in job.result[:3]:
    print(f"\n{page.url}")
    print(json.dumps(page.data, indent=2))
```

### Crawling a documentation site for a RAG pipeline

``` python
import os, json, requests
from spidra import SpidraClient, CrawlParams, PollOptions

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

# Crawl the documentation
job = spidra.crawl.run_sync(
    CrawlParams(
        base_url="https://docs.example.com",
        crawl_instruction="Follow all documentation pages. Skip the changelog, the API reference, and login pages.",
        transform_instruction="Extract the page title and the full body content as clean Markdown. Preserve all headings, code examples, and numbered lists. Return null for pages with no meaningful content.",
        max_pages=20,
    ),
    PollOptions(timeout=600),
)

print(f"Crawled {len(job.result)} pages")

# Fetch the Markdown for each page and chunk it
raw_pages = spidra.crawl.pages_sync(job.job_id)

chunks = []
for page in raw_pages.pages:
    if page.markdown_url and page.status == "success":
        md = requests.get(page.markdown_url).text

        # Split at heading boundaries
        import re
        sections = re.split(r'\n(?=#{1,3} )', md)
        for section in sections:
            if len(section.strip()) > 200:
                chunks.append({
                    "url": page.url,
                    "content": section.strip(),
                })

print(f"Created {len(chunks)} chunks for vector indexing")

with open("docs_chunks.jsonl", "w") as f:
    for chunk in chunks:
        f.write(json.dumps(chunk) + "\n")
```

### Weekly competitive intelligence crawl

``` python
import os, json
from datetime import datetime, timezone
from pathlib import Path
from spidra import SpidraClient, CrawlParams

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

COMPETITORS = [
    "https://competitor-a.com/pricing",
    "https://competitor-b.com/pricing",
    "https://competitor-c.com/plans",
]

today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
results = {}

for url in COMPETITORS:
    print(f"Crawling {url}...")

    job = spidra.crawl.run_sync(CrawlParams(
        base_url=url,
        crawl_instruction="Find all pricing pages, plan comparison pages, and feature pages. Skip blog, docs, and help pages.",
        transform_instruction="Extract all plan names, monthly prices, annual prices, and the list of features or limitations for each plan. Return null for any field not present.",
        max_pages=5,
        use_proxy=True,
    ))

    results[url] = {
        "crawled_at": today,
        "pages": [{"url": p.url, "data": p.data} for p in job.result]
    }

# Save snapshot
Path("snapshots").mkdir(exist_ok=True)
with open(f"snapshots/pricing-{today}.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nSnapshot saved: snapshots/pricing-{today}.json")
```

## Crawl API vs scrape API: when to use each

The decision is straightforward once you think about it in terms of what you know upfront.

**Use the scrape API when** you already have the URLs. A list of product pages, a set of known job listings, specific competitor pages you monitor regularly. You have the URLs, you just need the data.

**Use the crawl API when** you do not have the URLs yet. You know the website, you know what kind of pages you want, but you need Spidra to discover them for you. A competitor's blog you have not catalogued. A documentation site you want to index. A new site section that was just published.

The crawl API is also the better choice when the number of pages is unpredictable. If a competitor's pricing page might be one URL or five, the crawl endpoint discovers and processes all of them without you needing to know the count in advance.

For large-scale collection where you already have hundreds of URLs, the batch scrape endpoint processes up to 50 in parallel per request and is more efficient than crawling.

## Crawl API reference

### Endpoints

| Method | Endpoint | Purpose |
|---|---|---|
`POST` | `/api/crawl` | Submit a crawl job |
`GET` | `/api/crawl/{jobId}` | Poll for status and results |
`GET` | `/api/crawl/{jobId}/pages` | Get per-page HTML and Markdown download URLs |
`POST` | `/api/crawl/{jobId}/extract` | Re-run extraction on stored pages |
`GET` | `/api/crawl/{jobId}/download` | Download all results as a ZIP |
`GET` | `/api/crawl/history` | List past crawl jobs |

### Parameters

| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
`baseUrl` | string | Yes | — | Starting URL for the crawl |
`crawlInstruction` | string | Yes | — | Which pages to discover |
`transformInstruction` | string | Yes | — | What to extract from each page |
`maxPages` | integer | No | 5 | Maximum pages to crawl (1–50) |
`useProxy` | boolean | No | false | Route through residential proxies |
`proxyCountry` | string | No | global | Country code, `"eu"` , or `"global"` |
`cookies` | string | No | — | Session cookies for authenticated pages |
