Spidra API Python tutorial: scrape any website with Python

Spidra released a Python SDK that allows developers to scrape any website — including those with JavaScript rendering, anti-bot protections, and CAPTCHAs — using a single package and plain English prompts. The SDK handles browser automation, anti-bot bypass, and AI extraction on Spidra's infrastructure, returning structured data without requiring users to manage proxies, stealth plugins, or selector maintenance. The tool is available now with a free API key from app.spidra.io.

Web scraping with Python has a well-worn path. You start with requests and BeautifulSoup for simple static pages. Then you hit a JavaScript-rendered site and reach for Playwright. Then you hit Cloudflare and spend two hours debugging stealth plugins. Then your selectors break because the site redesigned. Spidra's Python SDK cuts across that whole progression. You install one package, describe what you want in plain English, and get back structured data from any website. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all happen on Spidra's infrastructure. You get clean results back. This tutorial walks through the entire Python SDK from installation to crawling a full website. All code examples come directly from the SDK and will work as written. Prerequisites - Python 3.9 or higher - A Spidra API key get one free at app.spidra.io https://app.spidra.io/ under Settings → API Keys Installation pip install spidra Once installed, store your API key as an environment variable. Never hardcode it in your scripts. export SPIDRA API KEY="spd YOUR API KEY" Setting up the client Everything in the SDK flows through a single SpidraClient instance. You initialise it once and then access all functionality through its namespaced attributes. python from spidra import SpidraClient spidra = SpidraClient api key="spd YOUR API KEY" In practice, pull the key from your environment: python import os from spidra import SpidraClient spidra = SpidraClient api key=os.environ "SPIDRA API KEY" The client exposes five namespaces: | Namespace | What it does | |---|---| spidra.scrape | Scrape one to three URLs with browser automation and AI extraction | spidra.batch | Process up to 50 URLs in parallel | spidra.crawl | Discover and scrape pages across an entire site | spidra.logs | Access the history of every scrape your API key has made | spidra.usage | Check credit and request consumption | Async by default, sync anywhere The SDK is async-first. Every method is an async function that you await inside an async context. python import asyncio from spidra import SpidraClient, ScrapeParams, ScrapeUrl spidra = SpidraClient api key=os.environ "SPIDRA API KEY" async def main : job = await spidra.scrape.run ScrapeParams urls= ScrapeUrl url="https://news.ycombinator.com" , prompt="Extract the top 5 post titles and their point scores", output="json", print job.result.content asyncio.run main If you are working in a regular script, a Django view, a Flask route, or a Jupyter notebook, use the sync counterpart instead. It handles the event loop automatically, including environments like Jupyter where calling asyncio.run directly would fail. python from spidra import SpidraClient, ScrapeParams, ScrapeUrl import os spidra = SpidraClient api key=os.environ "SPIDRA API KEY" Works anywhere without async/await job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://news.ycombinator.com" , prompt="Extract the top 5 post titles and their point scores", output="json", print job.result.content Every method in the SDK has both versions. The rest of this tutorial uses sync in the examples for simplicity, but the async versions work identically — just add await . Part 1: Scraping a page The scrape namespace handles single-page scraping. You can pass up to three URLs per request and they run in parallel. Your first scrape python from spidra import SpidraClient, ScrapeParams, ScrapeUrl import os spidra = SpidraClient api key=os.environ "SPIDRA API KEY" job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://news.ycombinator.com" , print job.result.content Without a prompt , Spidra returns the raw page content as Markdown. The page loads in a real browser, JavaScript executes, and the full rendered content is converted to clean Markdown. That is what ends up in job.result.content . How the job lifecycle works When you call run sync , the SDK submits the job, then polls in the background every 3 seconds until it is done. From your side it looks synchronous. Under the hood, the job moves through these states: waiting → active → completed or failed waiting means the job is queued. active means the browser is running. completed means the result is ready. failed means something went wrong. If you want to submit a job and check on it later rather than waiting for it to finish, use submit and get separately: python from spidra import SpidraClient, ScrapeParams, ScrapeUrl import os, time spidra = SpidraClient api key=os.environ "SPIDRA API KEY" Submit and get a job ID immediately queued = spidra.scrape.submit sync ScrapeParams urls= ScrapeUrl url="https://example.com" , prompt="Extract the main headline", print f"Job submitted: {queued.job id}" Come back later and check time.sleep 5 status = spidra.scrape.get sync queued.job id if status.status == "completed": print status.result.content elif status.status == "failed": print f"Failed: {status.error}" Part 2: Extracting data with prompts The prompt field is what makes Spidra different from a plain headless browser scraper. Instead of writing CSS selectors to find elements, you describe what you want in plain English and the AI figures out where it is on the page. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://news.ycombinator.com" , prompt="Extract the top 10 post titles and their point scores", output="json", print job.result.content {"title": "Show HN: I built a thing", "points": 342}, ... Setting output="json" tells the AI to return structured JSON rather than formatted text. The default is "markdown" . The AI reads the rendered page the way a person would. It knows a number next to a currency symbol is a price, that a short bold line at the top of a product page is probably the title, and that a longer block of text is probably a description. You do not need to know the class names or DOM structure of the page. That said, Spidra also fully supports CSS selectors and XPath for browser actions if you prefer to be explicit about where to find things. We will cover that in the browser actions section. Part 3: Enforcing output shape with JSON schema Plain prompts are flexible but not predictable. The AI decides what fields to return and what to name them. That works for exploration but it is a problem in production where a database or downstream service expects a specific shape every time. The schema field solves this. Pass a JSON Schema object and the AI must return data matching it exactly. Fields marked as required always appear in the output. If the page does not have a value for a required field, it comes back as None rather than being silently omitted. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://jobs.example.com/senior-engineer" , prompt="Extract the job listing details. Normalize salary to a USD number.", output="json", schema={ "type": "object", "required": "title", "company", "remote" , "properties": { "title": {"type": "string"}, "company": {"type": "string"}, "remote": {"type": "boolean", "null" }, "salary min": {"type": "number", "null" }, "salary max": {"type": "number", "null" }, "employment type": { "type": "string", "null" , "enum": "full time", "part time", "contract", None }, "skills": {"type": "array", "items": {"type": "string"}}, }, }, print job.result.content { "title": "Senior Software Engineer", "company": "Acme Corp", "remote": True, "salary min": 120000, "salary max": 160000, "employment type": "full time", "skills": "Python", "PostgreSQL", "AWS" } When you provide a schema , output is automatically set to "json" . You do not need to set it yourself. If you use Pydantic for data validation in your application, you can generate the schema from your existing models rather than writing it by hand: python from pydantic import BaseModel from typing import Optional class JobListing BaseModel : title: str company: str remote: Optional bool = None salary min: Optional float = None salary max: Optional float = None job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://jobs.example.com/senior-engineer" , prompt="Extract the job listing details", schema=JobListing.model json schema , One schema definition in your codebase. Works in your application logic and in your scraping requests. Part 4: Browser actions Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data until clicked. The actions list inside each ScrapeUrl lets you interact with the page before extraction runs. Actions execute in order inside the browser. python from spidra import BrowserAction job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://example.com/products", actions= BrowserAction type="click", selector=" accept-cookies" , BrowserAction type="wait", duration=1000 , BrowserAction type="scroll", to="80%" , , , , prompt="Extract all product names and prices visible on the page", For click , check , and uncheck actions, you have two options for targeting elements: selector for a CSS selector or XPath expression like " accept-cookies" or ".submit-btn" value for a plain English description like "Accept cookies button" and Spidra locates the element using AI Both are valid and you can mix them in the same actions list: actions= BrowserAction type="click", selector=" accept-cookies" , CSS selector BrowserAction type="click", value="Search button" , plain English Use whichever is more convenient for the page you are working with. All available actions | Action | What it does | Key fields | |---|---|---| click | Clicks a button, link, or any element | selector or value | type | Types text into an input field | selector , value | check | Checks a checkbox | selector or value | uncheck | Unchecks a checkbox | selector or value | wait | Pauses for a number of milliseconds | duration | scroll | Scrolls to a percentage of the page height | to e.g. "80%" | forEach | Finds matching elements and processes each one | value , mode | The forEach action forEach is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, then combines all the results into a single output. It works in three modes: inline reads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element. navigate follows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages you need to click into. click clicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://directory.example.com/companies", actions= BrowserAction type="click", value="Accept cookies" , BrowserAction type="forEach", value="Find all company listing cards", mode="navigate", max items=20, item prompt="Extract company name, website, and industry", pagination={ "nextSelector": "a.next-page", "maxPages": 3 } , , , , output="json", This dismisses the cookie banner, finds every company card on the page, navigates into each company's profile page, extracts the company details, and repeats across three pages of pagination. All in a single request. Part 5: Proxy and geo-targeting Some sites block requests from cloud infrastructure IP ranges. Others show different content depending on where you are browsing from. Setting use proxy=True routes the request through a residential proxy. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://www.amazon.de/gp/bestsellers" , prompt="List the top 10 products with name and price", use proxy=True, proxy country="de", proxy country accepts: - A two-letter ISO country code like "us" , "de" , "gb" , "fr" , "jp" "eu" to rotate randomly across all 27 EU member states "global" or omit it for no country preference Proxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for enabling proxy routing. Part 6: Scraping pages behind a login To access content that requires authentication, pass your session cookies as a raw cookie header string. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it here. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://app.example.com/dashboard" , prompt="Extract the monthly revenue and active user count", cookies="session=abc123; auth token=xyz789", Both standard cookie format name=value; name2=value2 and Chrome DevTools paste format work. Cookies are passed ephemerally to the browser worker and never stored by Spidra. Part 7: Stripping boilerplate with extract content only By default Spidra returns the full page content including navigation, headers, footers, and sidebars. If you only want the main content, turn on extract content only . It strips the noise before the AI sees the page, which reduces token usage and keeps the result focused. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://blog.example.com/long-article" , prompt="Summarize this article in three sentences", extract content only=True, Particularly useful for article pages, documentation, and any page where the main content is surrounded by heavy navigation. Part 8: Screenshots Capture screenshots of scraped pages for debugging, monitoring, or archival. job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://example.com" , screenshot=True, full page screenshot=True, Screenshot URLs are in the result print job.result.screenshots list of URLs screenshot=True captures the visible viewport. full page screenshot=True captures the entire scrollable page. Part 9: Controlling polling behaviour By default run sync polls every 3 seconds and gives up after 120 seconds. For complex pages or large crawls that take longer, pass a PollOptions object to override both. python from spidra import PollOptions job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://example.com" , prompt="Extract all content from this page", , PollOptions poll interval=5, timeout=180 , PollOptions works on batch.run sync and crawl.run sync too. Part 10: Batch scraping When you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker. Note that batch URLs are plain strings, not ScrapeUrl objects. Per-URL browser actions are not supported in batch mode. python from spidra import SpidraClient, BatchScrapeParams import os spidra = SpidraClient api key=os.environ "SPIDRA API KEY" batch = spidra.batch.run sync BatchScrapeParams urls= "https://shop.example.com/product/1", "https://shop.example.com/product/2", "https://shop.example.com/product/3", , prompt="Extract product name, price, and whether it is in stock", output="json", print f"{batch.completed count}/{batch.total urls} completed" for item in batch.items: if item.status == "completed": print item.url, item.result else: print f"Failed: {item.url} — {item.error}" Batch with schema The same schema enforcement that works in single scraping works in batch. Every item returns data matching the same shape: batch = spidra.batch.run sync BatchScrapeParams urls=urls, prompt="Extract the product details", schema={ "type": "object", "required": "name", "price" , "properties": { "name": {"type": "string"}, "price": {"type": "number", "null" }, "currency": {"type": "string", "null" }, "available": {"type": "boolean", "null" } } } Managing batches Once a batch is running, you have a few additional operations available: Retrying failures. If some items fail due to transient errors, retry just those without re-running the ones that already succeeded: if batch.failed count 0: spidra.batch.retry sync queued.batch id Cancelling a batch. Stop a running batch and get credits refunded for anything that has not started yet: response = spidra.batch.cancel sync batch id print f"Cancelled {response.cancelled items} items, refunded {response.credits refunded} credits" Listing past batches: python from spidra import BatchListParams page = spidra.batch.list sync BatchListParams page=1, limit=20 for job in page.jobs: print job.uuid, job.status, f"{job.completed count}/{job.total urls}" Processing large URL lists The batch endpoint caps at 50 URLs per request. For larger lists, chunk them and process in batches: python import os, json from spidra import SpidraClient, BatchScrapeParams spidra = SpidraClient api key=os.environ "SPIDRA API KEY" def scrape url list urls: list str , prompt: str, batch size: int = 50 - list: all results = for i in range 0, len urls , batch size : chunk = urls i:i + batch size print f"Processing batch {i // batch size + 1} of {- -len urls // batch size }..." batch = spidra.batch.run sync BatchScrapeParams urls=chunk, prompt=prompt, output="json", for item in batch.items: if item.status == "completed": all results.append { "url": item.url, "data": item.result } else: print f" Failed: {item.url}" return all results urls = f"https://example.com/product/{i}" for i in range 1, 201 results = scrape url list urls, "Extract product name and price" with open "results.jsonl", "w" as f: for record in results: f.write json.dumps record + "\n" print f"Saved {len results } results" Part 11: Crawling entire websites Batch scraping works when you already have a list of URLs. Crawling is for when you want Spidra to discover pages for you. You give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one, and applies your transform instruction to every page it visits. python from spidra import SpidraClient, CrawlParams, PollOptions import os spidra = SpidraClient api key=os.environ "SPIDRA API KEY" job = spidra.crawl.run sync CrawlParams base url="https://competitor.com/blog", crawl instruction="Follow links to blog posts only. Skip tag pages, category pages, and the homepage.", transform instruction="Extract the post title, author name, publish date, and a one-sentence summary.", max pages=30, use proxy=True, , PollOptions timeout=360 , for page in job.result: print page.url, page.data Three fields are required: base url , crawl instruction , and transform instruction . crawl instruction tells the crawler which links to follow. transform instruction tells the AI what to extract from each page it visits. max pages defaults to 5 and goes up to 20. Pass a higher timeout in PollOptions for larger crawls since the default 120 seconds may not be enough. The same use proxy , proxy country , and cookies options from single scraping all work here too. Downloading the raw content Once a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. The URLs are signed and expire after an hour. response = spidra.crawl.pages sync job id for page in response.pages: print page.url, page.status page.html url — download the raw HTML page.markdown url — download the cleaned Markdown Re-extracting with a different prompt If you crawled a site and later want to pull out different information, you do not have to re-crawl. extract runs a new AI pass over the already-crawled content and only charges transformation credits. queued = spidra.crawl.extract sync completed job id, "Extract only product SKUs and prices as structured JSON", This creates a new job — check it like any other result = spidra.crawl.get sync queued.job id Browsing crawl history python from spidra import CrawlHistoryParams response = spidra.crawl.history sync CrawlHistoryParams page=1, limit=10 print f"Total crawl jobs: {response.total}" stats = spidra.crawl.stats sync print f"All-time crawls: {stats.total}" Part 12: Logs and usage Browsing your scrape logs Every request your API key makes is logged automatically. You can filter by status, URL, date range, and more. python from spidra import ScrapeLogsParams response = spidra.logs.list sync ScrapeLogsParams status="failed", search term="amazon.com", date start="2025-01-01", date end="2025-12-31", page=1, limit=20, for log in response.logs: print log.urls 0 .get "url" , log.status, log.credits used To get full details of a single log entry including the extraction output: log = spidra.logs.get sync log uuid print log.result data Checking usage Track your credit and request consumption over time: rows = spidra.usage.get sync "30d" "7d" | "30d" | "weekly" for row in rows: print row.date, row.requests, row.credits "7d" gives one row per day for the last week. "30d" gives the last 30 days. "weekly" gives one row per week for the last seven weeks. Part 13: Error handling Every API error maps to a typed exception class. Catch exactly what you care about and let everything else bubble up. python from spidra import SpidraError, SpidraAuthenticationError, SpidraInsufficientCreditsError, SpidraRateLimitError, SpidraServerError, try: job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://example.com" , prompt="Extract the main headline", print job.result.content except SpidraAuthenticationError: print "API key is missing or invalid. Check your SPIDRA API KEY." except SpidraInsufficientCreditsError: print "Account is out of credits. Top up at app.spidra.io." except SpidraRateLimitError: print "Rate limit hit. Wait before retrying." except SpidraServerError as e: print f"Server error {e.status} : {e.message}. Retry is usually safe." except SpidraError as e: print f"API error {e.status}: {e.message}" | Exception | HTTP status | When it fires | |---|---|---| SpidraAuthenticationError | 401 | API key missing or invalid | SpidraInsufficientCreditsError | 403 | No credits remaining | SpidraRateLimitError | 429 | Too many requests | SpidraServerError | 500 | Unexpected error on Spidra's side | SpidraError | any | Base class for all Spidra exceptions | All exceptions expose .status for the HTTP code and .message for a human-readable explanation. Also check the ai extraction failed flag in the result. If AI extraction fails for any reason, Spidra falls back to returning the raw page Markdown and sets this flag so your code can detect it: job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url="https://example.com" , prompt="Extract the main headline", if job.result.ai extraction failed: AI extraction failed — raw Markdown fallback is in the data array raw = job.result.data 0 .markdown content print "Extraction failed, falling back to raw content" else: print job.result.content Putting it all together: a complete pipeline Here is a full example that uses browser actions with forEach to collect job listings from a directory, enforces a schema on the output, handles errors properly, and saves results to JSONL: python import os, json from spidra import SpidraClient, ScrapeParams, ScrapeUrl, BrowserAction, SpidraError, SpidraInsufficientCreditsError, spidra = SpidraClient api key=os.environ "SPIDRA API KEY" JOB SCHEMA = { "type": "object", "required": "title", "company", "location" , "properties": { "title": {"type": "string"}, "company": {"type": "string"}, "location": {"type": "string", "null" }, "remote": {"type": "boolean", "null" }, "salary min": {"type": "number", "null" }, "salary max": {"type": "number", "null" }, "employment type": { "type": "string", "null" , "enum": "full time", "part time", "contract", None }, }, } def collect listings board url: str - list: try: job = spidra.scrape.run sync ScrapeParams urls= ScrapeUrl url=board url, actions= BrowserAction type="click", value="Accept cookies" , BrowserAction type="forEach", value="Find all job listing cards", mode="navigate", max items=50, item prompt="Extract job title, company, location, remote status, salary range, and employment type", pagination={ "nextSelector": "a.next-page", "maxPages": 3 } , , , output="json", schema=JOB SCHEMA, if job.result.ai extraction failed: print f"Warning: AI extraction failed for {board url}" return content = job.result.content return content if isinstance content, list else content except SpidraInsufficientCreditsError: print "Out of credits. Stopping." return except SpidraError as e: print f"Error scraping {board url}: {e.message}" return boards = "https://jobs.example.com/engineering", "https://careers.anothersite.com/remote", all jobs = for board in boards: print f"Collecting from {board}..." listings = collect listings board all jobs.extend listings print f" Got {len listings } listings" with open "jobs.jsonl", "w" as f: for job in all jobs: f.write json.dumps job + "\n" print f"\nDone. {len all jobs } jobs saved to jobs.jsonl" All scrape parameters For reference, here is the full list of parameters you can pass to ScrapeParams : | Parameter | Type | Description | |---|---|---| urls | list | Up to 3 ScrapeUrl objects. Each takes a url and optional actions . | prompt | str | What to extract, in plain English | output | str | "markdown" default or "json" | schema | dict | JSON Schema for a guaranteed output shape | use proxy | bool | Route through a residential proxy | proxy country | str | Two-letter country code or "eu" / "global" | extract content only | bool | Strip nav, ads, and boilerplate before AI extraction | screenshot | bool | Capture a viewport screenshot | full page screenshot | bool | Capture a full-page screenshot | cookies | str | Raw Cookie header string for authenticated pages | What to read next If you want to go deeper on any part of the SDK: Browser actions guide https://docs.spidra.io/features/actions covers every option for each action type including all forEach parameters Structured output guide https://docs.spidra.io/features/structured-output covers schemas in depth including Pydantic integration and schema limits Stealth mode guide https://docs.spidra.io/features/stealth-mode has the full country list and proxy options Authenticated scraping guide https://docs.spidra.io/features/authenticated-scraping covers how to get cookies from your browser and the formats Spidra accepts Get your API key at app.spidra.io https://app.spidra.io/ . The free plan has 300 credits and no card required.