{"slug": "spidra-api-python-tutorial-scrape-any-website-with-python", "title": "Spidra API Python tutorial: scrape any website with Python", "summary": "Spidra released a Python SDK that allows developers to scrape any website — including those with JavaScript rendering, anti-bot protections, and CAPTCHAs — using a single package and plain English prompts. The SDK handles browser automation, anti-bot bypass, and AI extraction on Spidra's infrastructure, returning structured data without requiring users to manage proxies, stealth plugins, or selector maintenance. The tool is available now with a free API key from app.spidra.io.", "body_md": "Web scraping with Python has a well-worn path. You start with `requests`\n\nand BeautifulSoup for simple static pages. Then you hit a JavaScript-rendered site and reach for Playwright. Then you hit Cloudflare and spend two hours debugging stealth plugins. Then your selectors break because the site redesigned.\n\nSpidra's Python SDK cuts across that whole progression. You install one package, describe what you want in plain English, and get back structured data from any website. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all happen on Spidra's infrastructure. You get clean results back.\n\nThis tutorial walks through the entire Python SDK from installation to crawling a full website. All code examples come directly from the SDK and will work as written.\n\n## Prerequisites\n\n- Python 3.9 or higher\n- A Spidra API key (get one free at\n[app.spidra.io](https://app.spidra.io/)under Settings → API Keys)\n\n## Installation\n\n```\npip install spidra\n```\n\nOnce installed, store your API key as an environment variable. Never hardcode it in your scripts.\n\n```\nexport SPIDRA_API_KEY=\"spd_YOUR_API_KEY\"\n```\n\n## Setting up the client\n\nEverything in the SDK flows through a single `SpidraClient`\n\ninstance. You initialise it once and then access all functionality through its namespaced attributes.\n\n``` python\nfrom spidra import SpidraClient\n\nspidra = SpidraClient(api_key=\"spd_YOUR_API_KEY\")\n```\n\nIn practice, pull the key from your environment:\n\n``` python\nimport os\nfrom spidra import SpidraClient\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n```\n\nThe client exposes five namespaces:\n\n| Namespace | What it does |\n|---|---|\n`spidra.scrape` | Scrape one to three URLs with browser automation and AI extraction |\n`spidra.batch` | Process up to 50 URLs in parallel |\n`spidra.crawl` | Discover and scrape pages across an entire site |\n`spidra.logs` | Access the history of every scrape your API key has made |\n`spidra.usage` | Check credit and request consumption |\n\n## Async by default, sync anywhere\n\nThe SDK is async-first. Every method is an `async`\n\nfunction that you `await`\n\ninside an async context.\n\n``` python\nimport asyncio\nfrom spidra import SpidraClient, ScrapeParams, ScrapeUrl\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\nasync def main():\n    job = await spidra.scrape.run(ScrapeParams(\n        urls=[ScrapeUrl(url=\"https://news.ycombinator.com\")],\n        prompt=\"Extract the top 5 post titles and their point scores\",\n        output=\"json\",\n    ))\n    print(job.result.content)\n\nasyncio.run(main())\n```\n\nIf you are working in a regular script, a Django view, a Flask route, or a Jupyter notebook, use the `_sync`\n\ncounterpart instead. It handles the event loop automatically, including environments like Jupyter where calling `asyncio.run()`\n\ndirectly would fail.\n\n``` python\nfrom spidra import SpidraClient, ScrapeParams, ScrapeUrl\nimport os\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\n# Works anywhere without async/await\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://news.ycombinator.com\")],\n    prompt=\"Extract the top 5 post titles and their point scores\",\n    output=\"json\",\n))\n\nprint(job.result.content)\n```\n\nEvery method in the SDK has both versions. The rest of this tutorial uses `_sync`\n\nin the examples for simplicity, but the async versions work identically — just add `await`\n\n.\n\n## Part 1: Scraping a page\n\nThe `scrape`\n\nnamespace handles single-page scraping. You can pass up to three URLs per request and they run in parallel.\n\n### Your first scrape\n\n``` python\nfrom spidra import SpidraClient, ScrapeParams, ScrapeUrl\nimport os\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://news.ycombinator.com\")],\n))\n\nprint(job.result.content)\n```\n\nWithout a `prompt`\n\n, Spidra returns the raw page content as Markdown. The page loads in a real browser, JavaScript executes, and the full rendered content is converted to clean Markdown. That is what ends up in `job.result.content`\n\n.\n\n### How the job lifecycle works\n\nWhen you call `run_sync()`\n\n, the SDK submits the job, then polls in the background every 3 seconds until it is done. From your side it looks synchronous. Under the hood, the job moves through these states:\n\n```\nwaiting → active → completed (or failed)\n```\n\n`waiting`\n\nmeans the job is queued. `active`\n\nmeans the browser is running. `completed`\n\nmeans the result is ready. `failed`\n\nmeans something went wrong.\n\nIf you want to submit a job and check on it later rather than waiting for it to finish, use `submit()`\n\nand `get()`\n\nseparately:\n\n``` python\nfrom spidra import SpidraClient, ScrapeParams, ScrapeUrl\nimport os, time\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\n# Submit and get a job ID immediately\nqueued = spidra.scrape.submit_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://example.com\")],\n    prompt=\"Extract the main headline\",\n))\n\nprint(f\"Job submitted: {queued.job_id}\")\n\n# Come back later and check\ntime.sleep(5)\nstatus = spidra.scrape.get_sync(queued.job_id)\n\nif status.status == \"completed\":\n    print(status.result.content)\nelif status.status == \"failed\":\n    print(f\"Failed: {status.error}\")\n```\n\n## Part 2: Extracting data with prompts\n\nThe `prompt`\n\nfield is what makes Spidra different from a plain headless browser scraper. Instead of writing CSS selectors to find elements, you describe what you want in plain English and the AI figures out where it is on the page.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://news.ycombinator.com\")],\n    prompt=\"Extract the top 10 post titles and their point scores\",\n    output=\"json\",\n))\n\nprint(job.result.content)\n# [{\"title\": \"Show HN: I built a thing\", \"points\": 342}, ...]\n```\n\nSetting `output=\"json\"`\n\ntells the AI to return structured JSON rather than formatted text. The default is `\"markdown\"`\n\n.\n\nThe AI reads the rendered page the way a person would. It knows a number next to a currency symbol is a price, that a short bold line at the top of a product page is probably the title, and that a longer block of text is probably a description. You do not need to know the class names or DOM structure of the page.\n\nThat said, Spidra also fully supports CSS selectors and XPath for browser actions if you prefer to be explicit about where to find things. We will cover that in the browser actions section.\n\n## Part 3: Enforcing output shape with JSON schema\n\nPlain prompts are flexible but not predictable. The AI decides what fields to return and what to name them. That works for exploration but it is a problem in production where a database or downstream service expects a specific shape every time.\n\nThe `schema`\n\nfield solves this. Pass a JSON Schema object and the AI must return data matching it exactly. Fields marked as `required`\n\nalways appear in the output. If the page does not have a value for a required field, it comes back as `None`\n\nrather than being silently omitted.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://jobs.example.com/senior-engineer\")],\n    prompt=\"Extract the job listing details. Normalize salary to a USD number.\",\n    output=\"json\",\n    schema={\n        \"type\": \"object\",\n        \"required\": [\"title\", \"company\", \"remote\"],\n        \"properties\": {\n            \"title\":           {\"type\": \"string\"},\n            \"company\":         {\"type\": \"string\"},\n            \"remote\":          {\"type\": [\"boolean\", \"null\"]},\n            \"salary_min\":      {\"type\": [\"number\", \"null\"]},\n            \"salary_max\":      {\"type\": [\"number\", \"null\"]},\n            \"employment_type\": {\n                \"type\": [\"string\", \"null\"],\n                \"enum\": [\"full_time\", \"part_time\", \"contract\", None]\n            },\n            \"skills\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}},\n        },\n    },\n))\n\nprint(job.result.content)\n# {\n#   \"title\": \"Senior Software Engineer\",\n#   \"company\": \"Acme Corp\",\n#   \"remote\": True,\n#   \"salary_min\": 120000,\n#   \"salary_max\": 160000,\n#   \"employment_type\": \"full_time\",\n#   \"skills\": [\"Python\", \"PostgreSQL\", \"AWS\"]\n# }\n```\n\nWhen you provide a `schema`\n\n, `output`\n\nis automatically set to `\"json\"`\n\n. You do not need to set it yourself.\n\nIf you use Pydantic for data validation in your application, you can generate the schema from your existing models rather than writing it by hand:\n\n``` python\nfrom pydantic import BaseModel\nfrom typing import Optional\n\nclass JobListing(BaseModel):\n    title: str\n    company: str\n    remote: Optional[bool] = None\n    salary_min: Optional[float] = None\n    salary_max: Optional[float] = None\n\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://jobs.example.com/senior-engineer\")],\n    prompt=\"Extract the job listing details\",\n    schema=JobListing.model_json_schema(),\n))\n```\n\nOne schema definition in your codebase. Works in your application logic and in your scraping requests.\n\n## Part 4: Browser actions\n\nSome pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data until clicked.\n\nThe `actions`\n\nlist inside each `ScrapeUrl`\n\nlets you interact with the page before extraction runs. Actions execute in order inside the browser.\n\n``` python\nfrom spidra import BrowserAction\n\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[\n        ScrapeUrl(\n            url=\"https://example.com/products\",\n            actions=[\n                BrowserAction(type=\"click\", selector=\"#accept-cookies\"),\n                BrowserAction(type=\"wait\", duration=1000),\n                BrowserAction(type=\"scroll\", to=\"80%\"),\n            ],\n        ),\n    ],\n    prompt=\"Extract all product names and prices visible on the page\",\n))\n```\n\nFor `click`\n\n, `check`\n\n, and `uncheck`\n\nactions, you have two options for targeting elements:\n\n`selector`\n\nfor a CSS selector or XPath expression like`\"#accept-cookies\"`\n\nor`\".submit-btn\"`\n\n`value`\n\nfor a plain English description like`\"Accept cookies button\"`\n\nand Spidra locates the element using AI\n\nBoth are valid and you can mix them in the same actions list:\n\n```\nactions=[\n    BrowserAction(type=\"click\", selector=\"#accept-cookies\"),  # CSS selector\n    BrowserAction(type=\"click\", value=\"Search button\"),        # plain English\n]\n```\n\nUse whichever is more convenient for the page you are working with.\n\n### All available actions\n\n| Action | What it does | Key fields |\n|---|---|---|\n`click` | Clicks a button, link, or any element | `selector` or `value` |\n`type` | Types text into an input field | `selector` , `value` |\n`check` | Checks a checkbox | `selector` or `value` |\n`uncheck` | Unchecks a checkbox | `selector` or `value` |\n`wait` | Pauses for a number of milliseconds | `duration` |\n`scroll` | Scrolls to a percentage of the page height | `to` (e.g. `\"80%\"` ) |\n`forEach` | Finds matching elements and processes each one | `value` , `mode` |\n\n### The forEach action\n\n`forEach`\n\nis the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, then combines all the results into a single output.\n\nIt works in three modes:\n\n`inline`\n\nreads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element.\n\n`navigate`\n\nfollows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages you need to click into.\n\n`click`\n\nclicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[\n        ScrapeUrl(\n            url=\"https://directory.example.com/companies\",\n            actions=[\n                BrowserAction(type=\"click\", value=\"Accept cookies\"),\n                BrowserAction(\n                    type=\"forEach\",\n                    value=\"Find all company listing cards\",\n                    mode=\"navigate\",\n                    max_items=20,\n                    item_prompt=\"Extract company name, website, and industry\",\n                    pagination={\n                        \"nextSelector\": \"a.next-page\",\n                        \"maxPages\": 3\n                    }\n                ),\n            ],\n        ),\n    ],\n    output=\"json\",\n))\n```\n\nThis dismisses the cookie banner, finds every company card on the page, navigates into each company's profile page, extracts the company details, and repeats across three pages of pagination. All in a single request.\n\n## Part 5: Proxy and geo-targeting\n\nSome sites block requests from cloud infrastructure IP ranges. Others show different content depending on where you are browsing from. Setting `use_proxy=True`\n\nroutes the request through a residential proxy.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://www.amazon.de/gp/bestsellers\")],\n    prompt=\"List the top 10 products with name and price\",\n    use_proxy=True,\n    proxy_country=\"de\",\n))\n```\n\n`proxy_country`\n\naccepts:\n\n- A two-letter ISO country code like\n`\"us\"`\n\n,`\"de\"`\n\n,`\"gb\"`\n\n,`\"fr\"`\n\n,`\"jp\"`\n\n`\"eu\"`\n\nto rotate randomly across all 27 EU member states`\"global\"`\n\nor omit it for no country preference\n\nProxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for enabling proxy routing.\n\n## Part 6: Scraping pages behind a login\n\nTo access content that requires authentication, pass your session cookies as a raw cookie header string. Log in through your browser, open DevTools, copy the `Cookie`\n\nheader from any authenticated request, and pass it here.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://app.example.com/dashboard\")],\n    prompt=\"Extract the monthly revenue and active user count\",\n    cookies=\"session=abc123; auth_token=xyz789\",\n))\n```\n\nBoth standard cookie format (`name=value; name2=value2`\n\n) and Chrome DevTools paste format work. Cookies are passed ephemerally to the browser worker and never stored by Spidra.\n\n## Part 7: Stripping boilerplate with extract_content_only\n\nBy default Spidra returns the full page content including navigation, headers, footers, and sidebars. If you only want the main content, turn on `extract_content_only`\n\n. It strips the noise before the AI sees the page, which reduces token usage and keeps the result focused.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://blog.example.com/long-article\")],\n    prompt=\"Summarize this article in three sentences\",\n    extract_content_only=True,\n))\n```\n\nParticularly useful for article pages, documentation, and any page where the main content is surrounded by heavy navigation.\n\n## Part 8: Screenshots\n\nCapture screenshots of scraped pages for debugging, monitoring, or archival.\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://example.com\")],\n    screenshot=True,\n    full_page_screenshot=True,\n))\n\n# Screenshot URLs are in the result\nprint(job.result.screenshots)  # list of URLs\n```\n\n`screenshot=True`\n\ncaptures the visible viewport. `full_page_screenshot=True`\n\ncaptures the entire scrollable page.\n\n## Part 9: Controlling polling behaviour\n\nBy default `run_sync()`\n\npolls every 3 seconds and gives up after 120 seconds. For complex pages or large crawls that take longer, pass a `PollOptions`\n\nobject to override both.\n\n``` python\nfrom spidra import PollOptions\n\njob = spidra.scrape.run_sync(\n    ScrapeParams(\n        urls=[ScrapeUrl(url=\"https://example.com\")],\n        prompt=\"Extract all content from this page\",\n    ),\n    PollOptions(poll_interval=5, timeout=180),\n)\n```\n\n`PollOptions`\n\nworks on `batch.run_sync()`\n\nand `crawl.run_sync()`\n\ntoo.\n\n## Part 10: Batch scraping\n\nWhen you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker.\n\nNote that batch URLs are plain strings, not `ScrapeUrl`\n\nobjects. Per-URL browser actions are not supported in batch mode.\n\n``` python\nfrom spidra import SpidraClient, BatchScrapeParams\nimport os\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\nbatch = spidra.batch.run_sync(BatchScrapeParams(\n    urls=[\n        \"https://shop.example.com/product/1\",\n        \"https://shop.example.com/product/2\",\n        \"https://shop.example.com/product/3\",\n    ],\n    prompt=\"Extract product name, price, and whether it is in stock\",\n    output=\"json\",\n))\n\nprint(f\"{batch.completed_count}/{batch.total_urls} completed\")\n\nfor item in batch.items:\n    if item.status == \"completed\":\n        print(item.url, item.result)\n    else:\n        print(f\"Failed: {item.url} — {item.error}\")\n```\n\n### Batch with schema\n\nThe same schema enforcement that works in single scraping works in batch. Every item returns data matching the same shape:\n\n```\nbatch = spidra.batch.run_sync(BatchScrapeParams(\n    urls=urls,\n    prompt=\"Extract the product details\",\n    schema={\n        \"type\": \"object\",\n        \"required\": [\"name\", \"price\"],\n        \"properties\": {\n            \"name\":      {\"type\": \"string\"},\n            \"price\":     {\"type\": [\"number\", \"null\"]},\n            \"currency\":  {\"type\": [\"string\", \"null\"]},\n            \"available\": {\"type\": [\"boolean\", \"null\"]}\n        }\n    }\n))\n```\n\n### Managing batches\n\nOnce a batch is running, you have a few additional operations available:\n\n**Retrying failures.** If some items fail due to transient errors, retry just those without re-running the ones that already succeeded:\n\n```\nif batch.failed_count > 0:\n    spidra.batch.retry_sync(queued.batch_id)\n```\n\n**Cancelling a batch.** Stop a running batch and get credits refunded for anything that has not started yet:\n\n```\nresponse = spidra.batch.cancel_sync(batch_id)\nprint(f\"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits\")\n```\n\n**Listing past batches:**\n\n``` python\nfrom spidra import BatchListParams\n\npage = spidra.batch.list_sync(BatchListParams(page=1, limit=20))\n\nfor job in page.jobs:\n    print(job.uuid, job.status, f\"{job.completed_count}/{job.total_urls}\")\n```\n\n### Processing large URL lists\n\nThe batch endpoint caps at 50 URLs per request. For larger lists, chunk them and process in batches:\n\n``` python\nimport os, json\nfrom spidra import SpidraClient, BatchScrapeParams\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\ndef scrape_url_list(urls: list[str], prompt: str, batch_size: int = 50) -> list:\n    all_results = []\n\n    for i in range(0, len(urls), batch_size):\n        chunk = urls[i:i + batch_size]\n        print(f\"Processing batch {i // batch_size + 1} of {-(-len(urls) // batch_size)}...\")\n\n        batch = spidra.batch.run_sync(BatchScrapeParams(\n            urls=chunk,\n            prompt=prompt,\n            output=\"json\",\n        ))\n\n        for item in batch.items:\n            if item.status == \"completed\":\n                all_results.append({\n                    \"url\": item.url,\n                    \"data\": item.result\n                })\n            else:\n                print(f\"  Failed: {item.url}\")\n\n    return all_results\n\nurls = [f\"https://example.com/product/{i}\" for i in range(1, 201)]\nresults = scrape_url_list(urls, \"Extract product name and price\")\n\nwith open(\"results.jsonl\", \"w\") as f:\n    for record in results:\n        f.write(json.dumps(record) + \"\\n\")\n\nprint(f\"Saved {len(results)} results\")\n```\n\n## Part 11: Crawling entire websites\n\nBatch scraping works when you already have a list of URLs. Crawling is for when you want Spidra to discover pages for you.\n\nYou give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one, and applies your transform instruction to every page it visits.\n\n``` python\nfrom spidra import SpidraClient, CrawlParams, PollOptions\nimport os\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\njob = spidra.crawl.run_sync(\n    CrawlParams(\n        base_url=\"https://competitor.com/blog\",\n        crawl_instruction=\"Follow links to blog posts only. Skip tag pages, category pages, and the homepage.\",\n        transform_instruction=\"Extract the post title, author name, publish date, and a one-sentence summary.\",\n        max_pages=30,\n        use_proxy=True,\n    ),\n    PollOptions(timeout=360),\n)\n\nfor page in job.result:\n    print(page.url, page.data)\n```\n\nThree fields are required: `base_url`\n\n, `crawl_instruction`\n\n, and `transform_instruction`\n\n.\n\n`crawl_instruction`\n\ntells the crawler which links to follow. `transform_instruction`\n\ntells the AI what to extract from each page it visits. `max_pages`\n\ndefaults to 5 and goes up to 20. Pass a higher `timeout`\n\nin `PollOptions`\n\nfor larger crawls since the default 120 seconds may not be enough.\n\nThe same `use_proxy`\n\n, `proxy_country`\n\n, and `cookies`\n\noptions from single scraping all work here too.\n\n### Downloading the raw content\n\nOnce a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. The URLs are signed and expire after an hour.\n\n```\nresponse = spidra.crawl.pages_sync(job_id)\n\nfor page in response.pages:\n    print(page.url, page.status)\n    # page.html_url     — download the raw HTML\n    # page.markdown_url — download the cleaned Markdown\n```\n\n### Re-extracting with a different prompt\n\nIf you crawled a site and later want to pull out different information, you do not have to re-crawl. `extract()`\n\nruns a new AI pass over the already-crawled content and only charges transformation credits.\n\n```\nqueued = spidra.crawl.extract_sync(\n    completed_job_id,\n    \"Extract only product SKUs and prices as structured JSON\",\n)\n\n# This creates a new job — check it like any other\nresult = spidra.crawl.get_sync(queued.job_id)\n```\n\n### Browsing crawl history\n\n``` python\nfrom spidra import CrawlHistoryParams\n\nresponse = spidra.crawl.history_sync(CrawlHistoryParams(page=1, limit=10))\nprint(f\"Total crawl jobs: {response.total}\")\n\nstats = spidra.crawl.stats_sync()\nprint(f\"All-time crawls: {stats.total}\")\n```\n\n## Part 12: Logs and usage\n\n### Browsing your scrape logs\n\nEvery request your API key makes is logged automatically. You can filter by status, URL, date range, and more.\n\n``` python\nfrom spidra import ScrapeLogsParams\n\nresponse = spidra.logs.list_sync(ScrapeLogsParams(\n    status=\"failed\",\n    search_term=\"amazon.com\",\n    date_start=\"2025-01-01\",\n    date_end=\"2025-12-31\",\n    page=1,\n    limit=20,\n))\n\nfor log in response.logs:\n    print(log.urls[0].get(\"url\"), log.status, log.credits_used)\n```\n\nTo get full details of a single log entry including the extraction output:\n\n```\nlog = spidra.logs.get_sync(log_uuid)\nprint(log.result_data)\n```\n\n### Checking usage\n\nTrack your credit and request consumption over time:\n\n```\nrows = spidra.usage.get_sync(\"30d\")  # \"7d\" | \"30d\" | \"weekly\"\n\nfor row in rows:\n    print(row.date, row.requests, row.credits)\n```\n\n`\"7d\"`\n\ngives one row per day for the last week. `\"30d\"`\n\ngives the last 30 days. `\"weekly\"`\n\ngives one row per week for the last seven weeks.\n\n## Part 13: Error handling\n\nEvery API error maps to a typed exception class. Catch exactly what you care about and let everything else bubble up.\n\n``` python\nfrom spidra import (\n    SpidraError,\n    SpidraAuthenticationError,\n    SpidraInsufficientCreditsError,\n    SpidraRateLimitError,\n    SpidraServerError,\n)\n\ntry:\n    job = spidra.scrape.run_sync(ScrapeParams(\n        urls=[ScrapeUrl(url=\"https://example.com\")],\n        prompt=\"Extract the main headline\",\n    ))\n    print(job.result.content)\n\nexcept SpidraAuthenticationError:\n    print(\"API key is missing or invalid. Check your SPIDRA_API_KEY.\")\n\nexcept SpidraInsufficientCreditsError:\n    print(\"Account is out of credits. Top up at app.spidra.io.\")\n\nexcept SpidraRateLimitError:\n    print(\"Rate limit hit. Wait before retrying.\")\n\nexcept SpidraServerError as e:\n    print(f\"Server error ({e.status}): {e.message}. Retry is usually safe.\")\n\nexcept SpidraError as e:\n    print(f\"API error {e.status}: {e.message}\")\n```\n\n| Exception | HTTP status | When it fires |\n|---|---|---|\n`SpidraAuthenticationError` | 401 | API key missing or invalid |\n`SpidraInsufficientCreditsError` | 403 | No credits remaining |\n`SpidraRateLimitError` | 429 | Too many requests |\n`SpidraServerError` | 500 | Unexpected error on Spidra's side |\n`SpidraError` | any | Base class for all Spidra exceptions |\n\nAll exceptions expose `.status`\n\nfor the HTTP code and `.message`\n\nfor a human-readable explanation.\n\nAlso check the `ai_extraction_failed`\n\nflag in the result. If AI extraction fails for any reason, Spidra falls back to returning the raw page Markdown and sets this flag so your code can detect it:\n\n```\njob = spidra.scrape.run_sync(ScrapeParams(\n    urls=[ScrapeUrl(url=\"https://example.com\")],\n    prompt=\"Extract the main headline\",\n))\n\nif job.result.ai_extraction_failed:\n    # AI extraction failed — raw Markdown fallback is in the data array\n    raw = job.result.data[0].markdown_content\n    print(\"Extraction failed, falling back to raw content\")\nelse:\n    print(job.result.content)\n```\n\n## Putting it all together: a complete pipeline\n\nHere is a full example that uses browser actions with `forEach`\n\nto collect job listings from a directory, enforces a schema on the output, handles errors properly, and saves results to JSONL:\n\n``` python\nimport os, json\nfrom spidra import (\n    SpidraClient,\n    ScrapeParams,\n    ScrapeUrl,\n    BrowserAction,\n    SpidraError,\n    SpidraInsufficientCreditsError,\n)\n\nspidra = SpidraClient(api_key=os.environ[\"SPIDRA_API_KEY\"])\n\nJOB_SCHEMA = {\n    \"type\": \"object\",\n    \"required\": [\"title\", \"company\", \"location\"],\n    \"properties\": {\n        \"title\":           {\"type\": \"string\"},\n        \"company\":         {\"type\": \"string\"},\n        \"location\":        {\"type\": [\"string\", \"null\"]},\n        \"remote\":          {\"type\": [\"boolean\", \"null\"]},\n        \"salary_min\":      {\"type\": [\"number\", \"null\"]},\n        \"salary_max\":      {\"type\": [\"number\", \"null\"]},\n        \"employment_type\": {\n            \"type\": [\"string\", \"null\"],\n            \"enum\": [\"full_time\", \"part_time\", \"contract\", None]\n        },\n    },\n}\n\ndef collect_listings(board_url: str) -> list:\n    try:\n        job = spidra.scrape.run_sync(ScrapeParams(\n            urls=[\n                ScrapeUrl(\n                    url=board_url,\n                    actions=[\n                        BrowserAction(type=\"click\", value=\"Accept cookies\"),\n                        BrowserAction(\n                            type=\"forEach\",\n                            value=\"Find all job listing cards\",\n                            mode=\"navigate\",\n                            max_items=50,\n                            item_prompt=\"Extract job title, company, location, remote status, salary range, and employment type\",\n                            pagination={\n                                \"nextSelector\": \"a.next-page\",\n                                \"maxPages\": 3\n                            }\n                        ),\n                    ],\n                )\n            ],\n            output=\"json\",\n            schema=JOB_SCHEMA,\n        ))\n\n        if job.result.ai_extraction_failed:\n            print(f\"Warning: AI extraction failed for {board_url}\")\n            return []\n\n        content = job.result.content\n        return content if isinstance(content, list) else [content]\n\n    except SpidraInsufficientCreditsError:\n        print(\"Out of credits. Stopping.\")\n        return []\n    except SpidraError as e:\n        print(f\"Error scraping {board_url}: {e.message}\")\n        return []\n\nboards = [\n    \"https://jobs.example.com/engineering\",\n    \"https://careers.anothersite.com/remote\",\n]\n\nall_jobs = []\nfor board in boards:\n    print(f\"Collecting from {board}...\")\n    listings = collect_listings(board)\n    all_jobs.extend(listings)\n    print(f\"  Got {len(listings)} listings\")\n\nwith open(\"jobs.jsonl\", \"w\") as f:\n    for job in all_jobs:\n        f.write(json.dumps(job) + \"\\n\")\n\nprint(f\"\\nDone. {len(all_jobs)} jobs saved to jobs.jsonl\")\n```\n\n## All scrape parameters\n\nFor reference, here is the full list of parameters you can pass to `ScrapeParams`\n\n:\n\n| Parameter | Type | Description |\n|---|---|---|\n`urls` | list | Up to 3 `ScrapeUrl` objects. Each takes a `url` and optional `actions` . |\n`prompt` | str | What to extract, in plain English |\n`output` | str | `\"markdown\"` (default) or `\"json\"` |\n`schema` | dict | JSON Schema for a guaranteed output shape |\n`use_proxy` | bool | Route through a residential proxy |\n`proxy_country` | str | Two-letter country code or `\"eu\"` / `\"global\"` |\n`extract_content_only` | bool | Strip nav, ads, and boilerplate before AI extraction |\n`screenshot` | bool | Capture a viewport screenshot |\n`full_page_screenshot` | bool | Capture a full-page screenshot |\n`cookies` | str | Raw `Cookie` header string for authenticated pages |\n\n## What to read next\n\nIf you want to go deeper on any part of the SDK:\n\n[Browser actions guide](https://docs.spidra.io/features/actions)covers every option for each action type including all`forEach`\n\nparameters[Structured output guide](https://docs.spidra.io/features/structured-output)covers schemas in depth including Pydantic integration and schema limits[Stealth mode guide](https://docs.spidra.io/features/stealth-mode)has the full country list and proxy options[Authenticated scraping guide](https://docs.spidra.io/features/authenticated-scraping)covers how to get cookies from your browser and the formats Spidra accepts\n\nGet your API key at [app.spidra.io](https://app.spidra.io/). The free plan has 300 credits and no card required.", "url": "https://wpnews.pro/news/spidra-api-python-tutorial-scrape-any-website-with-python", "canonical_source": "https://spidra.io/blog/spidra-api-python-tutorial", "published_at": "2026-06-10 00:00:00+00:00", "updated_at": "2026-06-11 18:46:05.735658+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure", "ai-products"], "entities": ["Spidra", "Python", "Playwright", "Cloudflare", "BeautifulSoup"], "alternates": {"html": "https://wpnews.pro/news/spidra-api-python-tutorial-scrape-any-website-with-python", "markdown": "https://wpnews.pro/news/spidra-api-python-tutorial-scrape-any-website-with-python.md", "text": "https://wpnews.pro/news/spidra-api-python-tutorial-scrape-any-website-with-python.txt", "jsonld": "https://wpnews.pro/news/spidra-api-python-tutorial-scrape-any-website-with-python.jsonld"}}