Spidra API Python tutorial: scrape any website with Python

wpnews.pro

Web scraping with Python has a well-worn path. You start with requests

and BeautifulSoup for simple static pages. Then you hit a JavaScript-rendered site and reach for Playwright. Then you hit Cloudflare and spend two hours debugging stealth plugins. Then your selectors break because the site redesigned.

Spidra's Python SDK cuts across that whole progression. You install one package, describe what you want in plain English, and get back structured data from any website. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all happen on Spidra's infrastructure. You get clean results back.

This tutorial walks through the entire Python SDK from installation to crawling a full website. All code examples come directly from the SDK and will work as written.

Prerequisites #

Python 3.9 or higher
A Spidra API key (get one free at app.spidra.iounder Settings → API Keys)

Installation #

pip install spidra

Once installed, store your API key as an environment variable. Never hardcode it in your scripts.

export SPIDRA_API_KEY="spd_YOUR_API_KEY"

Setting up the client #

Everything in the SDK flows through a single SpidraClient

instance. You initialise it once and then access all functionality through its namespaced attributes.

from spidra import SpidraClient

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

In practice, pull the key from your environment:

import os
from spidra import SpidraClient

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

The client exposes five namespaces:

Namespace	What it does
`spidra.scrape`	Scrape one to three URLs with browser automation and AI extraction
`spidra.batch`	Process up to 50 URLs in parallel
`spidra.crawl`	Discover and scrape pages across an entire site
`spidra.logs`	Access the history of every scrape your API key has made
`spidra.usage`	Check credit and request consumption

Async by default, sync anywhere #

The SDK is async-first. Every method is an async

function that you await

inside an async context.

import asyncio
from spidra import SpidraClient, ScrapeParams, ScrapeUrl

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

async def main():
    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://news.ycombinator.com")],
        prompt="Extract the top 5 post titles and their point scores",
        output="json",
    ))
    print(job.result.content)

asyncio.run(main())

If you are working in a regular script, a Django view, a Flask route, or a Jupyter notebook, use the _sync

counterpart instead. It handles the event loop automatically, including environments like Jupyter where calling asyncio.run()

directly would fail.

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
    prompt="Extract the top 5 post titles and their point scores",
    output="json",
))

print(job.result.content)

Every method in the SDK has both versions. The rest of this tutorial uses _sync

in the examples for simplicity, but the async versions work identically — just add await

.

Part 1: Scraping a page #

The scrape

namespace handles single-page scraping. You can pass up to three URLs per request and they run in parallel.

Your first scrape

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
))

print(job.result.content)

Without a prompt

, Spidra returns the raw page content as Markdown. The page loads in a real browser, JavaScript executes, and the full rendered content is converted to clean Markdown. That is what ends up in job.result.content

.

How the job lifecycle works

When you call run_sync()

, the SDK submits the job, then polls in the background every 3 seconds until it is done. From your side it looks synchronous. Under the hood, the job moves through these states:

waiting → active → completed (or failed)

waiting

means the job is queued. active

means the browser is running. completed

means the result is ready. failed

means something went wrong.

If you want to submit a job and check on it later rather than waiting for it to finish, use submit()

and get()

separately:

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os, time

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

queued = spidra.scrape.submit_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com")],
    prompt="Extract the main headline",
))

print(f"Job submitted: {queued.job_id}")

time.sleep(5)
status = spidra.scrape.get_sync(queued.job_id)

if status.status == "completed":
    print(status.result.content)
elif status.status == "failed":
    print(f"Failed: {status.error}")

Part 2: Extracting data with prompts #

The prompt

field is what makes Spidra different from a plain headless browser scraper. Instead of writing CSS selectors to find elements, you describe what you want in plain English and the AI figures out where it is on the page.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
    prompt="Extract the top 10 post titles and their point scores",
    output="json",
))

print(job.result.content)

Setting output="json"

tells the AI to return structured JSON rather than formatted text. The default is "markdown"

.

The AI reads the rendered page the way a person would. It knows a number next to a currency symbol is a price, that a short bold line at the top of a product page is probably the title, and that a longer block of text is probably a description. You do not need to know the class names or DOM structure of the page.

That said, Spidra also fully supports CSS selectors and XPath for browser actions if you prefer to be explicit about where to find things. We will cover that in the browser actions section.

Part 3: Enforcing output shape with JSON schema #

Plain prompts are flexible but not predictable. The AI decides what fields to return and what to name them. That works for exploration but it is a problem in production where a database or downstream service expects a specific shape every time.

The schema

field solves this. Pass a JSON Schema object and the AI must return data matching it exactly. Fields marked as required

always appear in the output. If the page does not have a value for a required field, it comes back as None

rather than being silently omitted.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details. Normalize salary to a USD number.",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "company", "remote"],
        "properties": {
            "title":           {"type": "string"},
            "company":         {"type": "string"},
            "remote":          {"type": ["boolean", "null"]},
            "salary_min":      {"type": ["number", "null"]},
            "salary_max":      {"type": ["number", "null"]},
            "employment_type": {
                "type": ["string", "null"],
                "enum": ["full_time", "part_time", "contract", None]
            },
            "skills": {"type": "array", "items": {"type": "string"}},
        },
    },
))

print(job.result.content)

When you provide a schema

, output

is automatically set to "json"

. You do not need to set it yourself.

If you use Pydantic for data validation in your application, you can generate the schema from your existing models rather than writing it by hand:

from pydantic import BaseModel
from typing import Optional

class JobListing(BaseModel):
    title: str
    company: str
    remote: Optional[bool] = None
    salary_min: Optional[float] = None
    salary_max: Optional[float] = None

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details",
    schema=JobListing.model_json_schema(),
))

One schema definition in your codebase. Works in your application logic and in your scraping requests.

Part 4: Browser actions #

Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data until clicked.

The actions

list inside each ScrapeUrl

lets you interact with the page before extraction runs. Actions execute in order inside the browser.

from spidra import BrowserAction

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://example.com/products",
            actions=[
                BrowserAction(type="click", selector="#accept-cookies"),
                BrowserAction(type="wait", duration=1000),
                BrowserAction(type="scroll", to="80%"),
            ],
        ),
    ],
    prompt="Extract all product names and prices visible on the page",
))

For click

, check

, and uncheck

actions, you have two options for targeting elements:

selector

for a CSS selector or XPath expression like"#accept-cookies"

or".submit-btn"

value

for a plain English description like"Accept cookies button"

and Spidra locates the element using AI

Both are valid and you can mix them in the same actions list:

actions=[
    BrowserAction(type="click", selector="#accept-cookies"),  # CSS selector
    BrowserAction(type="click", value="Search button"),        # plain English
]

Use whichever is more convenient for the page you are working with.

All available actions

Action	What it does	Key fields
`click`	Clicks a button, link, or any element	`selector` or `value`
`type`	Types text into an input field	`selector` , `value`
`check`	Checks a checkbox	`selector` or `value`
`uncheck`	Unchecks a checkbox	`selector` or `value`
`wait`	s for a number of milliseconds	`duration`
`scroll`	Scrolls to a percentage of the page height	`to` (e.g. `"80%"` )
`forEach`	Finds matching elements and processes each one	`value` , `mode`

The forEach action

forEach

is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, then combines all the results into a single output.

It works in three modes:

inline

reads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element.

navigate

follows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages you need to click into.

click

clicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://directory.example.com/companies",
            actions=[
                BrowserAction(type="click", value="Accept cookies"),
                BrowserAction(
                    type="forEach",
                    value="Find all company listing cards",
                    mode="navigate",
                    max_items=20,
                    item_prompt="Extract company name, website, and industry",
                    pagination={
                        "nextSelector": "a.next-page",
                        "maxPages": 3
                    }
                ),
            ],
        ),
    ],
    output="json",
))

This dismisses the cookie banner, finds every company card on the page, navigates into each company's profile page, extracts the company details, and repeats across three pages of pagination. All in a single request.

Part 5: Proxy and geo-targeting #

Some sites block requests from cloud infrastructure IP ranges. Others show different content depending on where you are browsing from. Setting use_proxy=True

routes the request through a residential proxy.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://www.amazon.de/gp/bestsellers")],
    prompt="List the top 10 products with name and price",
    use_proxy=True,
    proxy_country="de",
))

proxy_country

accepts:

A two-letter ISO country code like "us"

,"de"

,"gb"

,"fr"

,"jp"

"eu"

to rotate randomly across all 27 EU member states"global"

or omit it for no country preference

Proxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for enabling proxy routing.

To access content that requires authentication, pass your session cookies as a raw cookie header string. Log in through your browser, open DevTools, copy the Cookie

header from any authenticated request, and pass it here.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://app.example.com/dashboard")],
    prompt="Extract the monthly revenue and active user count",
    cookies="session=abc123; auth_token=xyz789",
))

Both standard cookie format (name=value; name2=value2

) and Chrome DevTools paste format work. Cookies are passed ephemerally to the browser worker and never stored by Spidra.

Part 7: Stripping boilerplate with extract_content_only #

By default Spidra returns the full page content including navigation, headers, footers, and sidebars. If you only want the main content, turn on extract_content_only

. It strips the noise before the AI sees the page, which reduces token usage and keeps the result focused.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://blog.example.com/long-article")],
    prompt="Summarize this article in three sentences",
    extract_content_only=True,
))

Particularly useful for article pages, documentation, and any page where the main content is surrounded by heavy navigation.

Part 8: Screenshots #

Capture screenshots of scraped pages for debugging, monitoring, or archival.

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com")],
    screenshot=True,
    full_page_screenshot=True,
))

print(job.result.screenshots)  # list of URLs

screenshot=True

captures the visible viewport. full_page_screenshot=True

captures the entire scrollable page.

Part 9: Controlling polling behaviour #

By default run_sync()

polls every 3 seconds and gives up after 120 seconds. For complex pages or large crawls that take longer, pass a PollOptions

object to override both.

from spidra import PollOptions

job = spidra.scrape.run_sync(
    ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="Extract all content from this page",
    ),
    PollOptions(poll_interval=5, timeout=180),
)

PollOptions

works on batch.run_sync()

and crawl.run_sync()

too.

Part 10: Batch scraping #

When you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker.

Note that batch URLs are plain strings, not ScrapeUrl

objects. Per-URL browser actions are not supported in batch mode.

from spidra import SpidraClient, BatchScrapeParams
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=[
        "https://shop.example.com/product/1",
        "https://shop.example.com/product/2",
        "https://shop.example.com/product/3",
    ],
    prompt="Extract product name, price, and whether it is in stock",
    output="json",
))

print(f"{batch.completed_count}/{batch.total_urls} completed")

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result)
    else:
        print(f"Failed: {item.url} — {item.error}")

Batch with schema

The same schema enforcement that works in single scraping works in batch. Every item returns data matching the same shape:

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=urls,
    prompt="Extract the product details",
    schema={
        "type": "object",
        "required": ["name", "price"],
        "properties": {
            "name":      {"type": "string"},
            "price":     {"type": ["number", "null"]},
            "currency":  {"type": ["string", "null"]},
            "available": {"type": ["boolean", "null"]}
        }
    }
))

Managing batches

Once a batch is running, you have a few additional operations available:

Retrying failures. If some items fail due to transient errors, retry just those without re-running the ones that already succeeded:

if batch.failed_count > 0:
    spidra.batch.retry_sync(queued.batch_id)

Cancelling a batch. Stop a running batch and get credits refunded for anything that has not started yet:

response = spidra.batch.cancel_sync(batch_id)
print(f"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits")

Listing past batches:

from spidra import BatchListParams

page = spidra.batch.list_sync(BatchListParams(page=1, limit=20))

for job in page.jobs:
    print(job.uuid, job.status, f"{job.completed_count}/{job.total_urls}")

Processing large URL lists

The batch endpoint caps at 50 URLs per request. For larger lists, chunk them and process in batches:

import os, json
from spidra import SpidraClient, BatchScrapeParams

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def scrape_url_list(urls: list[str], prompt: str, batch_size: int = 50) -> list:
    all_results = []

    for i in range(0, len(urls), batch_size):
        chunk = urls[i:i + batch_size]
        print(f"Processing batch {i // batch_size + 1} of {-(-len(urls) // batch_size)}...")

        batch = spidra.batch.run_sync(BatchScrapeParams(
            urls=chunk,
            prompt=prompt,
            output="json",
        ))

        for item in batch.items:
            if item.status == "completed":
                all_results.append({
                    "url": item.url,
                    "data": item.result
                })
            else:
                print(f"  Failed: {item.url}")

    return all_results

urls = [f"https://example.com/product/{i}" for i in range(1, 201)]
results = scrape_url_list(urls, "Extract product name and price")

with open("results.jsonl", "w") as f:
    for record in results:
        f.write(json.dumps(record) + "\n")

print(f"Saved {len(results)} results")

Part 11: Crawling entire websites #

Batch scraping works when you already have a list of URLs. Crawling is for when you want Spidra to discover pages for you.

You give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one, and applies your transform instruction to every page it visits.

from spidra import SpidraClient, CrawlParams, PollOptions
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.crawl.run_sync(
    CrawlParams(
        base_url="https://competitor.com/blog",
        crawl_instruction="Follow links to blog posts only. Skip tag pages, category pages, and the homepage.",
        transform_instruction="Extract the post title, author name, publish date, and a one-sentence summary.",
        max_pages=30,
        use_proxy=True,
    ),
    PollOptions(timeout=360),
)

for page in job.result:
    print(page.url, page.data)

Three fields are required: base_url

, crawl_instruction

, and transform_instruction

.

crawl_instruction

tells the crawler which links to follow. transform_instruction

tells the AI what to extract from each page it visits. max_pages

defaults to 5 and goes up to 20. Pass a higher timeout

in PollOptions

for larger crawls since the default 120 seconds may not be enough.

The same use_proxy

, proxy_country

, and cookies

options from single scraping all work here too.

Down the raw content

Once a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. The URLs are signed and expire after an hour.

response = spidra.crawl.pages_sync(job_id)

for page in response.pages:
    print(page.url, page.status)

Re-extracting with a different prompt

If you crawled a site and later want to pull out different information, you do not have to re-crawl. extract()

runs a new AI pass over the already-crawled content and only charges transformation credits.

queued = spidra.crawl.extract_sync(
    completed_job_id,
    "Extract only product SKUs and prices as structured JSON",
)

result = spidra.crawl.get_sync(queued.job_id)

Browsing crawl history

from spidra import CrawlHistoryParams

response = spidra.crawl.history_sync(CrawlHistoryParams(page=1, limit=10))
print(f"Total crawl jobs: {response.total}")

stats = spidra.crawl.stats_sync()
print(f"All-time crawls: {stats.total}")

Part 12: Logs and usage #

Browsing your scrape logs

Every request your API key makes is logged automatically. You can filter by status, URL, date range, and more.

from spidra import ScrapeLogsParams

response = spidra.logs.list_sync(ScrapeLogsParams(
    status="failed",
    search_term="amazon.com",
    date_start="2025-01-01",
    date_end="2025-12-31",
    page=1,
    limit=20,
))

for log in response.logs:
    print(log.urls[0].get("url"), log.status, log.credits_used)

To get full details of a single log entry including the extraction output:

log = spidra.logs.get_sync(log_uuid)
print(log.result_data)

Checking usage

Track your credit and request consumption over time:

rows = spidra.usage.get_sync("30d")  # "7d" | "30d" | "weekly"

for row in rows:
    print(row.date, row.requests, row.credits)

"7d"

gives one row per day for the last week. "30d"

gives the last 30 days. "weekly"

gives one row per week for the last seven weeks.

Part 13: Error handling #

Every API error maps to a typed exception class. Catch exactly what you care about and let everything else bubble up.

from spidra import (
    SpidraError,
    SpidraAuthenticationError,
    SpidraInsufficientCreditsError,
    SpidraRateLimitError,
    SpidraServerError,
)

try:
    job = spidra.scrape.run_sync(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="Extract the main headline",
    ))
    print(job.result.content)

except SpidraAuthenticationError:
    print("API key is missing or invalid. Check your SPIDRA_API_KEY.")

except SpidraInsufficientCreditsError:
    print("Account is out of credits. Top up at app.spidra.io.")

except SpidraRateLimitError:
    print("Rate limit hit. Wait before retrying.")

except SpidraServerError as e:
    print(f"Server error ({e.status}): {e.message}. Retry is usually safe.")

except SpidraError as e:
    print(f"API error {e.status}: {e.message}")

Exception	HTTP status	When it fires
`SpidraAuthenticationError`	401	API key missing or invalid
`SpidraInsufficientCreditsError`	403	No credits remaining
`SpidraRateLimitError`	429	Too many requests
`SpidraServerError`	500	Unexpected error on Spidra's side
`SpidraError`	any	Base class for all Spidra exceptions

All exceptions expose .status

for the HTTP code and .message

for a human-readable explanation.

Also check the ai_extraction_failed

flag in the result. If AI extraction fails for any reason, Spidra falls back to returning the raw page Markdown and sets this flag so your code can detect it:

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com")],
    prompt="Extract the main headline",
))

if job.result.ai_extraction_failed:
    raw = job.result.data[0].markdown_content
    print("Extraction failed, falling back to raw content")
else:
    print(job.result.content)

Putting it all together: a complete pipeline #

Here is a full example that uses browser actions with forEach

to collect job listings from a directory, enforces a schema on the output, handles errors properly, and saves results to JSONL:

import os, json
from spidra import (
    SpidraClient,
    ScrapeParams,
    ScrapeUrl,
    BrowserAction,
    SpidraError,
    SpidraInsufficientCreditsError,
)

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

JOB_SCHEMA = {
    "type": "object",
    "required": ["title", "company", "location"],
    "properties": {
        "title":           {"type": "string"},
        "company":         {"type": "string"},
        "location":        {"type": ["string", "null"]},
        "remote":          {"type": ["boolean", "null"]},
        "salary_min":      {"type": ["number", "null"]},
        "salary_max":      {"type": ["number", "null"]},
        "employment_type": {
            "type": ["string", "null"],
            "enum": ["full_time", "part_time", "contract", None]
        },
    },
}

def collect_listings(board_url: str) -> list:
    try:
        job = spidra.scrape.run_sync(ScrapeParams(
            urls=[
                ScrapeUrl(
                    url=board_url,
                    actions=[
                        BrowserAction(type="click", value="Accept cookies"),
                        BrowserAction(
                            type="forEach",
                            value="Find all job listing cards",
                            mode="navigate",
                            max_items=50,
                            item_prompt="Extract job title, company, location, remote status, salary range, and employment type",
                            pagination={
                                "nextSelector": "a.next-page",
                                "maxPages": 3
                            }
                        ),
                    ],
                )
            ],
            output="json",
            schema=JOB_SCHEMA,
        ))

        if job.result.ai_extraction_failed:
            print(f"Warning: AI extraction failed for {board_url}")
            return []

        content = job.result.content
        return content if isinstance(content, list) else [content]

    except SpidraInsufficientCreditsError:
        print("Out of credits. Stopping.")
        return []
    except SpidraError as e:
        print(f"Error scraping {board_url}: {e.message}")
        return []

boards = [
    "https://jobs.example.com/engineering",
    "https://careers.anothersite.com/remote",
]

all_jobs = []
for board in boards:
    print(f"Collecting from {board}...")
    listings = collect_listings(board)
    all_jobs.extend(listings)
    print(f"  Got {len(listings)} listings")

with open("jobs.jsonl", "w") as f:
    for job in all_jobs:
        f.write(json.dumps(job) + "\n")

print(f"\nDone. {len(all_jobs)} jobs saved to jobs.jsonl")

All scrape parameters #

For reference, here is the full list of parameters you can pass to ScrapeParams

:

Parameter	Type	Description
`urls`	list	Up to 3 `ScrapeUrl` objects. Each takes a `url` and optional `actions` .
`prompt`	str	What to extract, in plain English
`output`	str	`"markdown"` (default) or `"json"`
`schema`	dict	JSON Schema for a guaranteed output shape
`use_proxy`	bool	Route through a residential proxy
`proxy_country`	str	Two-letter country code or `"eu"` / `"global"`
`extract_content_only`	bool	Strip nav, ads, and boilerplate before AI extraction
`screenshot`	bool	Capture a viewport screenshot
`full_page_screenshot`	bool	Capture a full-page screenshot
`cookies`	str	Raw `Cookie` header string for authenticated pages

What to read next #

If you want to go deeper on any part of the SDK:

Browser actions guidecovers every option for each action type including allforEach

parametersStructured output guidecovers schemas in depth including Pydantic integration and schema limitsStealth mode guidehas the full country list and proxy optionsAuthenticated scraping guidecovers how to get cookies from your browser and the formats Spidra accepts

Get your API key at app.spidra.io. The free plan has 300 credits and no card required.

source & further reading

spidra.io — original article Get structured data from popular websites How to scrape web data with Beautiful Soup: step-by-step guide in 2026 Spidra crawl API: how to crawl an entire website and extract data