{"slug": "spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api", "title": "Spidra API tutorial: complete guide to web scraping with the Spidra API", "summary": "Spidra released a new API that allows developers to scrape websites by sending a URL and receiving structured data, eliminating the need for custom selectors, headless browsers, or anti-bot workarounds. The REST API handles browser rendering, CAPTCHA solving, proxy rotation, and AI extraction on its servers, with jobs running asynchronously and returning results after polling. The free plan includes 300 credits with no credit card required, and authentication requires only an API key in the request header.", "body_md": "Getting data from websites programmatically has always involved more work than it should. You write selectors, they break when the site updates. You try a headless browser, anti-bot protection blocks you. You get the data, but it is raw HTML and you still have to parse it into something useful.\n\nThe [Spidra API](https://spidra.io/products/spidra-api) is designed to solve all three of those problems in one place. You send a URL, describe what you want, and get back structured data. The browser rendering, CAPTCHA solving, proxy rotation, and AI extraction all happen on Spidra's side.\n\nThis guide walks through the entire API from authentication to crawling. By the end you will know how every endpoint works, what the response structure looks like, and how to build a real scraping pipeline around it.\n\n## Before you start\n\nYou need a Spidra account and an API key.\n\nSign up at [spidra.io](https://spidra.io/). The free plan includes 300 credits with no credit card required. Once you are in, go to **app.spidra.io → Settings → API Keys** and create a key.\n\nKeep it somewhere safe. Every request you make to the API includes this key in the header.\n\n## How the API works\n\nThe Spidra API is a REST API with one base URL:\n\n```\nhttps://api.spidra.io/api\n```\n\nEvery request is authenticated by including your API key in the `x-api-key`\n\nheader. There are no bearer tokens, no OAuth flows, just a header on every request.\n\n```\ncurl -X POST https://api.spidra.io/api/scrape \\\n  -H \"x-api-key: YOUR_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"urls\": [{\"url\": \"https://example.com\"}]}'\n```\n\nOne important thing to understand before you make your first request: **Spidra jobs are asynchronous**. When you submit a scrape, you do not get the data back immediately. You get a job ID. You then poll a status endpoint every few seconds until the job is complete and the data is ready.\n\nThis is by design. Browser rendering, CAPTCHA solving, and AI extraction take a few seconds. The async pattern means you are not holding a connection open the whole time.\n\nThe flow for every job type looks like this:\n\n- Submit the job. Receive a job ID in the response.\n- Poll the status endpoint every 2 to 5 seconds.\n- When\n`status`\n\nis`completed`\n\n, read your results.\n\nNow let us go through each part of the API.\n\n## Authentication\n\nEvery request needs the `x-api-key`\n\nheader. That is it.\n\n```\n-H \"x-api-key: YOUR_API_KEY\"\n```\n\nIf the key is missing or invalid, the API returns a `401`\n\n. If your credits are exhausted, you get a `403`\n\n.\n\nHere is the full set of response codes you will encounter:\n\n| Code | What it means |\n|---|---|\n`200` | Request completed successfully |\n`202` | Job queued successfully. Poll for results. |\n`400` | Bad request. Missing or invalid parameters. |\n`401` | API key missing, invalid, or expired |\n`403` | Credits exhausted or plan limit reached |\n`404` | Job or resource not found |\n`429` | Rate limit hit. Back off and retry. |\n`500` | Something went wrong on Spidra's side |\n\nAll errors come back in the same format:\n\n```\n{\n  \"status\": \"error\",\n  \"message\": \"Detailed explanation of what went wrong\"\n}\n```\n\n## Scraping a single page\n\nThe scrape endpoint is where most people start. You give it one to three URLs and it returns structured data from each one.\n\n**Endpoint:** `POST /api/scrape`\n\n### The minimal request\n\nThe only required field is `urls`\n\n, which takes an array of URL objects. Each URL object requires a `url`\n\nfield and optionally takes an `actions`\n\narray for browser interactions.\n\n```\ncurl -X POST https://api.spidra.io/api/scrape \\\n  -H \"x-api-key: YOUR_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"urls\": [{\"url\": \"https://news.ycombinator.com\"}]\n  }'\n```\n\nResponse:\n\n```\n{\n  \"status\": \"queued\",\n  \"jobId\": \"550e8400-e29b-41d4-a716-446655440000\",\n  \"message\": \"Scrape job has been queued. Poll /api/scrape/550e8400... to get the result.\"\n}\n```\n\nSave that `jobId`\n\n. You need it to check on the job.\n\n### Polling for results\n\nCall `GET /api/scrape/{jobId}`\n\nevery few seconds until the status changes.\n\n```\ncurl https://api.spidra.io/api/scrape/550e8400-e29b-41d4-a716-446655440000 \\\n  -H \"x-api-key: YOUR_API_KEY\"\n```\n\nWhile the job is running, you will see something like this:\n\n```\n{\n  \"status\": \"active\",\n  \"progress\": {\n    \"message\": \"Processing content with AI...\",\n    \"progress\": 0.6\n  },\n  \"result\": null,\n  \"error\": null\n}\n```\n\nThe `progress`\n\nfield goes from 0 to 1 as the job moves through its stages: loading the browser, executing actions, solving CAPTCHAs, running AI extraction.\n\nWhen it finishes:\n\n```\n{\n  \"status\": \"completed\",\n  \"progress\": {\n    \"message\": \"Scrape completed successfully\",\n    \"progress\": 1\n  },\n  \"result\": {\n    \"content\": \"...\",\n    \"data\": [\n      {\n        \"url\": \"https://news.ycombinator.com\",\n        \"title\": \"Hacker News\",\n        \"markdownContent\": \"...\",\n        \"success\": true,\n        \"screenshotUrl\": null\n      }\n    ],\n    \"screenshots\": [],\n    \"ai_extraction_failed\": false,\n    \"stats\": {\n      \"durationMs\": 4200,\n      \"captchaSolvedCount\": 0,\n      \"inputTokens\": 312,\n      \"outputTokens\": 84,\n      \"totalTokens\": 396\n    }\n  },\n  \"error\": null\n}\n```\n\nThe `result.content`\n\nfield is the main output. What it contains depends on what you asked for:\n\n- If you passed a\n`prompt`\n\n,`content`\n\nis the AI-extracted result - If you did not pass a\n`prompt`\n\n,`content`\n\nis the raw page content as Markdown\n\n`result.data`\n\nis an array with one entry per URL. Each entry has the page title, the full Markdown content for that URL, whether it succeeded, and a screenshot URL if you requested one.\n\n`result.stats`\n\ntells you how long the job took, how many CAPTCHAs were solved, and how many tokens the AI extraction used.\n\n### A polling loop in Python\n\n``` python\nimport requests\nimport time\n\nAPI_KEY = \"YOUR_API_KEY\"\nBASE_URL = \"https://api.spidra.io/api\"\nHEADERS = {\"x-api-key\": API_KEY, \"Content-Type\": \"application/json\"}\n\ndef scrape(url):\n    # Submit the job\n    response = requests.post(\n        f\"{BASE_URL}/scrape\",\n        headers=HEADERS,\n        json={\"urls\": [{\"url\": url}]}\n    )\n    response.raise_for_status()\n    job_id = response.json()[\"jobId\"]\n\n    # Poll until complete\n    while True:\n        status_response = requests.get(\n            f\"{BASE_URL}/scrape/{job_id}\",\n            headers=HEADERS\n        )\n        data = status_response.json()\n\n        if data[\"status\"] == \"completed\":\n            return data[\"result\"]\n        elif data[\"status\"] == \"failed\":\n            raise Exception(f\"Scrape failed: {data['error']}\")\n\n        time.sleep(3)\n\nresult = scrape(\"https://news.ycombinator.com\")\nprint(result[\"content\"])\n```\n\nThe same in Node.js:\n\n``` js\nconst API_KEY = \"YOUR_API_KEY\";\nconst BASE_URL = \"https://api.spidra.io/api\";\nconst HEADERS = {\n  \"x-api-key\": API_KEY,\n  \"Content-Type\": \"application/json\"\n};\n\nasync function scrape(url) {\n  const submitRes = await fetch(`${BASE_URL}/scrape`, {\n    method: \"POST\",\n    headers: HEADERS,\n    body: JSON.stringify({ urls: [{ url }] })\n  });\n  const { jobId } = await submitRes.json();\n\n  while (true) {\n    const statusRes = await fetch(`${BASE_URL}/scrape/${jobId}`, {\n      headers: HEADERS\n    });\n    const data = await statusRes.json();\n\n    if (data.status === \"completed\") return data.result;\n    if (data.status === \"failed\") throw new Error(data.error);\n\n    await new Promise(r => setTimeout(r, 3000));\n  }\n}\n\nconst result = await scrape(\"https://news.ycombinator.com\");\nconsole.log(result.content);\n```\n\n## AI extraction with prompts\n\nThe plain scrape above gives you raw Markdown. Most of the time you want something more specific. That is where the `prompt`\n\nfield comes in.\n\nAdd a `prompt`\n\nand Spidra reads the rendered page and extracts exactly what you described. The AI understands context. It knows a number next to a currency symbol is a price, that a short bold line near the top of a product page is probably the title, and that a block of longer text is likely a description. You describe the output you want and it figures out where to find it.\n\n```\ncurl -X POST https://api.spidra.io/api/scrape \\\n  -H \"x-api-key: YOUR_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"urls\": [{\"url\": \"https://news.ycombinator.com\"}],\n    \"prompt\": \"Extract the top 10 post titles and their point scores\",\n    \"output\": \"json\"\n  }'\n```\n\nWhen the job completes, `result.content`\n\ncontains the AI-extracted data as JSON:\n\n```\n[\n  {\"title\": \"Show HN: I built a thing\", \"points\": 342},\n  {\"title\": \"Ask HN: What are you working on?\", \"points\": 289}\n]\n```\n\nThe `output`\n\nfield controls the format. It defaults to `\"json\"`\n\nbut you can set it to `\"markdown\"`\n\nif you want the extracted content as formatted text instead of structured data.\n\nOne thing to know: if you set `output: \"json\"`\n\nwithout a `prompt`\n\n, Spidra still runs a default AI extraction pass. If you want the raw page content with no AI processing at all, omit both `output`\n\nand `prompt`\n\n.\n\nIf AI extraction fails for any reason (a near-empty page, a heavily obfuscated site), Spidra falls back to returning the raw page Markdown and sets `ai_extraction_failed: true`\n\nin the response so your code can detect and handle it.\n\n## Structured output with JSON schema\n\nPrompts are flexible but they are not predictable. The AI decides what fields to return and what to call them. For production pipelines where downstream systems expect a specific shape, that is a problem.\n\nThe `schema`\n\nfield solves this. Pass a JSON Schema object and the AI must return data that matches it exactly. Required fields always appear in the output, as `null`\n\nif the page does not have that value. Field names match exactly what you defined. The structure never varies between runs.\n\n```\ncurl -X POST https://api.spidra.io/api/scrape \\\n  -H \"x-api-key: YOUR_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"urls\": [{\"url\": \"https://jobs.example.com/senior-engineer\"}],\n    \"prompt\": \"Extract the job details. Normalize salary to a number in USD.\",\n    \"schema\": {\n      \"type\": \"object\",\n      \"required\": [\"title\", \"company\", \"remote\", \"employment_type\"],\n      \"properties\": {\n        \"title\":           {\"type\": \"string\"},\n        \"company\":         {\"type\": \"string\"},\n        \"remote\":          {\"type\": [\"boolean\", \"null\"]},\n        \"salary_min\":      {\"type\": [\"number\", \"null\"]},\n        \"salary_max\":      {\"type\": [\"number\", \"null\"]},\n        \"employment_type\": {\n          \"type\": [\"string\", \"null\"],\n          \"enum\": [\"full_time\", \"part_time\", \"contract\", null]\n        }\n      }\n    }\n  }'\n```\n\nThe response will always have `title`\n\n, `company`\n\n, `remote`\n\n, and `employment_type`\n\nbecause they are in `required`\n\n. If the page does not mention a salary, `salary_min`\n\nand `salary_max`\n\ncome back as `null`\n\nrather than being omitted.\n\nWhen you provide a `schema`\n\n, `output`\n\nis automatically set to `\"json\"`\n\n. You do not need to set it yourself.\n\nThe schema is validated before the job is queued. If it is malformed, the API returns a `422`\n\nwith descriptive errors. Non-fatal issues like unsupported keywords come back as `schema_warnings`\n\nin the response.\n\nSchema limits to be aware of: maximum nesting depth is 5 levels, maximum schema size is 10 KB.\n\n## Browser actions\n\nSome pages do not show you the data you want until you interact with them first. Cookie banners blocking content. A \"Load More\" button that reveals the next batch of results. A search form you need to fill before anything appears. Tabs that hide content by default.\n\nThe `actions`\n\narray on each URL object lets you interact with the page before extraction runs. Actions execute in order, inside a real browser, before Spidra runs your prompt.\n\nHere is an example that dismisses a cookie banner, fills a search form, and waits for results to load:\n\n```\n{\n  \"urls\": [{\n    \"url\": \"https://example.com/search\",\n    \"actions\": [\n      {\"type\": \"click\", \"value\": \"Accept cookies button\"},\n      {\"type\": \"type\", \"selector\": \"input[name='q']\", \"value\": \"wireless headphones\"},\n      {\"type\": \"click\", \"selector\": \"button[type='submit']\"},\n      {\"type\": \"wait\", \"duration\": 1500},\n      {\"type\": \"scroll\", \"to\": \"80%\"}\n    ]\n  }],\n  \"prompt\": \"Extract all product names and prices from the search results\",\n  \"output\": \"json\"\n}\n```\n\nNotice that for the first `click`\n\n, the `value`\n\nfield is a plain English description of the element. For the second `click`\n\n, the `selector`\n\nfield is a CSS selector. Both approaches work and you can mix them in the same actions array.\n\nFor any `click`\n\n, `check`\n\n, or `uncheck`\n\naction:\n\n- Use\n`selector`\n\nfor a CSS selector or XPath expression like`\"#accept-cookies\"`\n\nor`\".submit-btn\"`\n\n- Use\n`value`\n\nfor a plain English description like`\"Accept cookies button\"`\n\nand Spidra's AI will find the element for you\n\nBoth are equally valid. Use whichever makes more sense for the page you are working with.\n\n### Available actions\n\n| Action | What it does | Key fields |\n|---|---|---|\n`click` | Clicks a button, link, tab, or any element | `selector` or `value` |\n`type` | Types text into an input or search field | `selector` , `value` |\n`check` | Checks a checkbox | `selector` or `value` |\n`uncheck` | Unchecks a checkbox | `selector` or `value` |\n`wait` | Pauses for a number of milliseconds | `duration` |\n`scroll` | Scrolls to a percentage of the page height | `to` (e.g. `\"80%\"` ) |\n`forEach` | Finds matching elements and processes each one | `value` , `mode` |\n\n### The forEach action\n\n`forEach`\n\nis the most powerful action in the API. It finds a set of repeating elements on the page (product cards, search result links, accordion rows, directory listings) and processes each one individually, then combines all the results into a single output.\n\nIt supports three modes:\n\n`inline`\n\nreads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element itself.\n\n`navigate`\n\nfollows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages that you need to navigate into.\n\n`click`\n\nclicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections.\n\n```\n{\n  \"urls\": [{\n    \"url\": \"https://directory.example.com/companies\",\n    \"actions\": [\n      {\"type\": \"click\", \"value\": \"Accept cookies\"},\n      {\n        \"type\": \"forEach\",\n        \"value\": \"Find all company listing cards\",\n        \"mode\": \"navigate\",\n        \"maxItems\": 20,\n        \"itemPrompt\": \"Extract company name, website, and industry\",\n        \"pagination\": {\n          \"nextSelector\": \"a.next-page\",\n          \"maxPages\": 3\n        }\n      }\n    ]\n  }],\n  \"output\": \"json\"\n}\n```\n\nThis dismisses the cookie banner, finds every company card on the page, navigates into each one, extracts the company details, and repeats across 3 pages of pagination. All in a single API call.\n\n## Proxy and geo-targeting\n\nSome sites block traffic from cloud IP ranges. Others serve different content based on location. The `useProxy`\n\nand `proxyCountry`\n\nfields route your requests through residential proxies to handle both situations.\n\n```\n{\n  \"urls\": [{\"url\": \"https://amazon.de/dp/B123456\"}],\n  \"prompt\": \"Extract the product price\",\n  \"output\": \"json\",\n  \"useProxy\": true,\n  \"proxyCountry\": \"de\"\n}\n```\n\nSetting `useProxy: true`\n\nroutes the request through the residential proxy network. `proxyCountry`\n\naccepts:\n\n- A two-letter ISO country code like\n`\"us\"`\n\n,`\"de\"`\n\n,`\"gb\"`\n\n,`\"fr\"`\n\n`\"eu\"`\n\nto rotate randomly across all 27 EU member states`\"global\"`\n\nor omit it entirely for no country preference\n\nProxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for using proxies.\n\n## Additional options\n\n### Extract content only\n\nStrip navigation, headers, footers, and sidebars before extraction. Useful when you only want the main content of a page and want to reduce noise.\n\n```\n{\n  \"urls\": [{\"url\": \"https://blog.example.com/article\"}],\n  \"prompt\": \"Summarize this article\",\n  \"extractContentOnly\": true\n}\n```\n\n### Screenshots\n\nCapture screenshots of scraped pages for debugging, archival, or visual monitoring.\n\n```\n{\n  \"urls\": [{\"url\": \"https://example.com\"}],\n  \"screenshot\": true,\n  \"fullPageScreenshot\": true\n}\n```\n\n`screenshot: true`\n\ncaptures the visible viewport. `fullPageScreenshot: true`\n\ncaptures the entire scrollable page. The screenshot URLs are returned in `result.screenshots`\n\nand in each item's `screenshotUrl`\n\nfield.\n\n### Authenticated scraping\n\nPass session cookies to access pages behind a login. Get the cookies from your browser's DevTools after logging in manually, then include them in your request.\n\n```\n{\n  \"urls\": [{\"url\": \"https://app.example.com/dashboard\"}],\n  \"prompt\": \"Extract the account summary\",\n  \"cookies\": \"session_id=abc123; auth_token=xyz789\"\n}\n```\n\nStandard cookie format (`name=value; name2=value2`\n\n) and Chrome DevTools paste format both work. Cookies are passed ephemerally to the browser worker and never stored.\n\n## Batch scraping\n\nWhen you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker.\n\n**Endpoint:** `POST /api/batch/scrape`\n\n``` python\nimport requests\nimport time\n\nAPI_KEY = \"YOUR_API_KEY\"\nBASE_URL = \"https://api.spidra.io/api\"\nHEADERS = {\"x-api-key\": API_KEY, \"Content-Type\": \"application/json\"}\n\nurls = [\n    \"https://example.com/product/1\",\n    \"https://example.com/product/2\",\n    \"https://example.com/product/3\",\n]\n\n# Submit the batch\nresponse = requests.post(\n    f\"{BASE_URL}/batch/scrape\",\n    headers=HEADERS,\n    json={\n        \"urls\": urls,\n        \"prompt\": \"Extract the product name, price, and availability\",\n        \"output\": \"json\",\n    }\n)\nbatch_id = response.json()[\"batchId\"]\n\n# Poll until complete\nwhile True:\n    status = requests.get(\n        f\"{BASE_URL}/batch/scrape/{batch_id}\",\n        headers=HEADERS\n    ).json()\n\n    if status[\"status\"] in (\"completed\", \"failed\", \"partial\"):\n        break\n\n    time.sleep(3)\n\n# Process results\nfor item in status[\"items\"]:\n    if item[\"status\"] == \"completed\":\n        print(f\"{item['url']}: {item['result']['content']}\")\n    else:\n        print(f\"Failed: {item['url']} — {item['error']}\")\n```\n\nThe batch response includes a `status`\n\nfor the overall batch and an `items`\n\narray with one entry per URL. Each item has its own `status`\n\n, `result`\n\n, and `error`\n\nso you can see exactly which URLs succeeded and which failed.\n\nCredits are reserved upfront when you submit and reconciled per item when processing completes. If a URL fails, credits for that item are returned.\n\n### Batch with structured output\n\nEverything that works in single scrape works in batch. Pass a schema and every item in the batch returns data matching that shape:\n\n```\nrequests.post(\n    f\"{BASE_URL}/batch/scrape\",\n    headers=HEADERS,\n    json={\n        \"urls\": urls,\n        \"prompt\": \"Extract the product details\",\n        \"schema\": {\n            \"type\": \"object\",\n            \"required\": [\"name\", \"price\"],\n            \"properties\": {\n                \"name\":      {\"type\": \"string\"},\n                \"price\":     {\"type\": [\"number\", \"null\"]},\n                \"currency\":  {\"type\": [\"string\", \"null\"]},\n                \"available\": {\"type\": [\"boolean\", \"null\"]}\n            }\n        }\n    }\n)\n```\n\n### Managing batches\n\nBeyond submitting and polling, the batch API has a few more endpoints worth knowing:\n\n| Endpoint | What it does |\n|---|---|\n`GET /api/batch/scrape` | List all your batch jobs with status and credit usage |\n`DELETE /api/batch/scrape/{batchId}` | Cancel a running or pending batch. Credits for unprocessed items are refunded. |\n`POST /api/batch/scrape/{batchId}/retry` | Re-queue only the failed items in a completed batch without resubmitting the ones that already succeeded. |\n\nThe retry endpoint is particularly useful for large batches where a handful of items fail due to transient issues. You do not need to resubmit the full batch, just the failures.\n\n## Crawling\n\nBatch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover pages for you.\n\nYou give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one up to your `maxPages`\n\nlimit, and applies your transform instruction to every page it visits.\n\n**Endpoint:** `POST /api/crawl`\n\n```\nresponse = requests.post(\n    f\"{BASE_URL}/crawl\",\n    headers=HEADERS,\n    json={\n        \"baseUrl\": \"https://docs.example.com\",\n        \"crawlInstruction\": \"Follow all documentation pages. Skip changelog and login pages.\",\n        \"transformInstruction\": \"Extract the page title and full body text as clean Markdown. Preserve all headings and code examples.\",\n        \"maxPages\": 20,\n        \"useProxy\": False\n    }\n)\njob_id = response.json()[\"jobId\"]\n```\n\nThree fields are required: `baseUrl`\n\n, `crawlInstruction`\n\n, and `transformInstruction`\n\n. Everything else is optional.\n\n`maxPages`\n\ndefaults to 5 and goes up to 20. The crawl discovers links from the base URL first, then works through them in order of discovery.\n\nPoll `GET /api/crawl/{jobId}`\n\nfor status. When complete, results are available through several endpoints:\n\n| Endpoint | What it returns |\n|---|---|\n`GET /api/crawl/{jobId}` | Overall status and summary |\n`GET /api/crawl/{jobId}/pages` | All crawled pages with extracted data and signed URLs to the original HTML and Markdown |\n`GET /api/crawl/{jobId}/download` | ZIP archive of all results |\n`POST /api/crawl/{jobId}/extract` | Run a new extraction on already-crawled pages without re-crawling |\n`GET /api/crawl/history` | Paginated list of your past crawl jobs |\n\nThe `extract`\n\nendpoint is worth highlighting. If you crawl a site and later decide you want to extract different fields, you can run a new extraction on the cached pages without making a single new browser request. That saves time and credits.\n\n### A complete crawl example\n\n``` python\nimport requests\nimport time\nimport json\n\nAPI_KEY = \"YOUR_API_KEY\"\nBASE_URL = \"https://api.spidra.io/api\"\nHEADERS = {\"x-api-key\": API_KEY, \"Content-Type\": \"application/json\"}\n\n# Submit the crawl\njob = requests.post(\n    f\"{BASE_URL}/crawl\",\n    headers=HEADERS,\n    json={\n        \"baseUrl\": \"https://blog.example.com\",\n        \"crawlInstruction\": \"Follow all blog post pages. Skip tag pages, author pages, and the homepage.\",\n        \"transformInstruction\": \"Extract the article title, author, publish date, and full body text.\",\n        \"maxPages\": 15\n    }\n).json()\n\njob_id = job[\"jobId\"]\nprint(f\"Crawl job started: {job_id}\")\n\n# Poll until complete\nwhile True:\n    status = requests.get(\n        f\"{BASE_URL}/crawl/{job_id}\",\n        headers=HEADERS\n    ).json()\n\n    print(f\"Status: {status['status']}\")\n\n    if status[\"status\"] == \"completed\":\n        break\n    elif status[\"status\"] == \"failed\":\n        raise Exception(\"Crawl failed\")\n\n    time.sleep(5)\n\n# Fetch all crawled pages\npages = requests.get(\n    f\"{BASE_URL}/crawl/{job_id}/pages\",\n    headers=HEADERS\n).json()\n\n# Save as JSONL\nwith open(\"crawl_results.jsonl\", \"w\") as f:\n    for page in pages[\"pages\"]:\n        f.write(json.dumps({\n            \"url\": page[\"url\"],\n            \"data\": page[\"data\"]\n        }) + \"\\n\")\n\nprint(f\"Saved {len(pages['pages'])} pages\")\n```\n\n## Monitoring and logs\n\nThe Spidra API keeps a log of every scrape job you run. This is useful for debugging, auditing, and understanding your credit consumption.\n\n```\n# List recent scrape logs\nlogs = requests.get(\n    f\"{BASE_URL}/scrape-logs\",\n    headers=HEADERS\n).json()\n\nfor log in logs[\"data\"]:\n    print(f\"{log['started_at']} — {log['status']} — {log['latency_ms']}ms — {log['tokens_used']} tokens\")\n\n# Get full details of a specific log\nlog_detail = requests.get(\n    f\"{BASE_URL}/scrape-logs/{log['uuid']}\",\n    headers=HEADERS\n).json()\n```\n\n### Usage statistics\n\nTrack your credit consumption over time:\n\n```\nusage = requests.get(\n    f\"{BASE_URL}/account/usage\",\n    headers=HEADERS\n).json()\n\nprint(usage)\n```\n\nThis returns time-series data covering requests, tokens, crawls, and credit consumption over a configurable period.\n\n## Putting it all together: a real pipeline\n\nHere is a complete example that combines scraping, batch processing, and structured output into a pipeline that collects job listings from multiple pages and saves them to a JSONL file:\n\n``` python\nimport requests\nimport time\nimport json\n\nAPI_KEY = \"YOUR_API_KEY\"\nBASE_URL = \"https://api.spidra.io/api\"\nHEADERS = {\"x-api-key\": API_KEY, \"Content-Type\": \"application/json\"}\n\nJOB_SCHEMA = {\n    \"type\": \"object\",\n    \"required\": [\"title\", \"company\", \"location\"],\n    \"properties\": {\n        \"title\":           {\"type\": \"string\"},\n        \"company\":         {\"type\": \"string\"},\n        \"location\":        {\"type\": [\"string\", \"null\"]},\n        \"remote\":          {\"type\": [\"boolean\", \"null\"]},\n        \"salary_min\":      {\"type\": [\"number\", \"null\"]},\n        \"salary_max\":      {\"type\": [\"number\", \"null\"]},\n        \"employment_type\": {\n            \"type\": [\"string\", \"null\"],\n            \"enum\": [\"full_time\", \"part_time\", \"contract\", None]\n        }\n    }\n}\n\ndef collect_job_urls(board_url):\n    \"\"\"Use forEach to collect job listing URLs from a board page.\"\"\"\n    response = requests.post(\n        f\"{BASE_URL}/scrape\",\n        headers=HEADERS,\n        json={\n            \"urls\": [{\n                \"url\": board_url,\n                \"actions\": [\n                    {\"type\": \"click\", \"value\": \"Accept cookies\"},\n                    {\n                        \"type\": \"forEach\",\n                        \"value\": \"Find all job listing links\",\n                        \"mode\": \"navigate\",\n                        \"maxItems\": 50,\n                        \"itemPrompt\": \"Extract job title, company, location, remote status, salary range, and employment type\",\n                        \"pagination\": {\n                            \"nextSelector\": \"a.next-page\",\n                            \"maxPages\": 3\n                        }\n                    }\n                ]\n            }],\n            \"output\": \"json\",\n            \"schema\": JOB_SCHEMA\n        }\n    )\n    job_id = response.json()[\"jobId\"]\n\n    while True:\n        status = requests.get(\n            f\"{BASE_URL}/scrape/{job_id}\",\n            headers=HEADERS\n        ).json()\n\n        if status[\"status\"] == \"completed\":\n            return status[\"result\"][\"content\"]\n        elif status[\"status\"] == \"failed\":\n            raise Exception(status[\"error\"])\n\n        time.sleep(3)\n\n# Collect from multiple job boards\nboards = [\n    \"https://jobs.example.com/engineering\",\n    \"https://careers.anothersite.com/remote\",\n]\n\nall_jobs = []\nfor board in boards:\n    print(f\"Collecting from {board}...\")\n    jobs = collect_job_urls(board)\n    if isinstance(jobs, list):\n        all_jobs.extend(jobs)\n    print(f\"  Got {len(jobs) if isinstance(jobs, list) else 0} jobs\")\n\n# Save results\nwith open(\"jobs.jsonl\", \"w\") as f:\n    for job in all_jobs:\n        f.write(json.dumps(job) + \"\\n\")\n\nprint(f\"\\nTotal: {len(all_jobs)} jobs saved to jobs.jsonl\")\n```\n\n## Error handling\n\nWrap your API calls properly and handle the cases that actually come up in production.\n\n``` python\nimport requests\n\ndef safe_scrape(url, prompt):\n    try:\n        response = requests.post(\n            f\"{BASE_URL}/scrape\",\n            headers=HEADERS,\n            json={\n                \"urls\": [{\"url\": url}],\n                \"prompt\": prompt,\n                \"output\": \"json\"\n            }\n        )\n\n        if response.status_code == 401:\n            raise Exception(\"Invalid API key. Check your x-api-key header.\")\n\n        if response.status_code == 403:\n            raise Exception(\"Credits exhausted or plan limit reached.\")\n\n        if response.status_code == 429:\n            raise Exception(\"Rate limit hit. Wait before retrying.\")\n\n        response.raise_for_status()\n        return response.json()[\"jobId\"]\n\n    except requests.exceptions.ConnectionError:\n        raise Exception(\"Could not connect to the Spidra API.\")\n```\n\nFor polling loops, always handle the `failed`\n\nstatus and check `ai_extraction_failed`\n\nin the result:\n\n```\nif status[\"status\"] == \"completed\":\n    result = status[\"result\"]\n\n    if result.get(\"ai_extraction_failed\"):\n        # AI extraction failed, content is raw Markdown fallback\n        print(\"AI extraction failed, using raw content\")\n        content = result[\"data\"][0][\"markdownContent\"]\n    else:\n        content = result[\"content\"]\n```\n\n## API reference summary\n\n| Method | Endpoint | Purpose |\n|---|---|---|\n`POST` | `/api/scrape` | Submit a scrape job (1 to 3 URLs) |\n`GET` | `/api/scrape/{jobId}` | Poll for job status and results |\n`POST` | `/api/batch/scrape` | Submit a batch job (up to 50 URLs) |\n`GET` | `/api/batch/scrape/{batchId}` | Poll batch status and per-item results |\n`GET` | `/api/batch/scrape` | List all your batch jobs |\n`DELETE` | `/api/batch/scrape/{batchId}` | Cancel a batch and refund unused credits |\n`POST` | `/api/batch/scrape/{batchId}/retry` | Retry only the failed items in a batch |\n`POST` | `/api/crawl` | Submit a crawl job |\n`GET` | `/api/crawl/{jobId}` | Poll crawl status |\n`GET` | `/api/crawl/{jobId}/pages` | Get all crawled pages with extracted data |\n`POST` | `/api/crawl/{jobId}/extract` | Re-extract from crawled pages without re-crawling |\n`GET` | `/api/crawl/{jobId}/download` | Download crawl results as ZIP |\n`GET` | `/api/crawl/history` | List your past crawl jobs |\n`GET` | `/api/scrape-logs` | List recent scrape logs |\n`GET` | `/api/scrape-logs/{id}` | Get full details of a single log |\n`GET` | `/api/account/usage` | Get usage statistics |\n\n## What next\n\nYou now have a working understanding of every part of the Spidra API. Here are the natural next steps depending on what you are building:\n\nIf you want to go deeper on browser actions and `forEach`\n\n, read the [Browser Actions Guide](https://docs.spidra.io/features/actions) in the docs. It covers every option for each action type with real examples.\n\nIf you are building something that needs guaranteed output shapes, read the [Structured Output Guide](https://docs.spidra.io/features/structured-output) for full details on schemas, nullable fields, Zod and Pydantic integration, and schema limits.\n\nIf you are using an SDK in a specific language, each one has its own guide: [Node.js](https://docs.spidra.io/sdks/node), [Python](https://docs.spidra.io/sdks/python), [Go](https://docs.spidra.io/sdks/go), [PHP](https://docs.spidra.io/sdks/php), [Ruby](https://docs.spidra.io/sdks/ruby), [Rust](https://docs.spidra.io/sdks/rust), [.NET](https://docs.spidra.io/sdks/dotnet), [Elixir](https://docs.spidra.io/sdks/elixir), [Java](https://docs.spidra.io/sdks/java), and [Swift](https://docs.spidra.io/sdks/swift).\n\nGet your API key at [app.spidra.io](https://app.spidra.io/). The free plan has 300 credits and no card required.", "url": "https://wpnews.pro/news/spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api", "canonical_source": "https://spidra.io/blog/spidra-api-tutorial", "published_at": "2026-06-09 00:00:00+00:00", "updated_at": "2026-06-11 18:46:16.050712+00:00", "lang": "en", "topics": ["ai-tools", "ai-products", "ai-infrastructure"], "entities": ["Spidra API", "Spidra"], "alternates": {"html": "https://wpnews.pro/news/spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api", "markdown": "https://wpnews.pro/news/spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api.md", "text": "https://wpnews.pro/news/spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api.txt", "jsonld": "https://wpnews.pro/news/spidra-api-tutorial-complete-guide-to-web-scraping-with-the-spidra-api.jsonld"}}