Spidra API tutorial: complete guide to web scraping with the Spidra API

Spidra released a new API that allows developers to scrape websites by sending a URL and receiving structured data, eliminating the need for custom selectors, headless browsers, or anti-bot workarounds. The REST API handles browser rendering, CAPTCHA solving, proxy rotation, and AI extraction on its servers, with jobs running asynchronously and returning results after polling. The free plan includes 300 credits with no credit card required, and authentication requires only an API key in the request header.

Getting data from websites programmatically has always involved more work than it should. You write selectors, they break when the site updates. You try a headless browser, anti-bot protection blocks you. You get the data, but it is raw HTML and you still have to parse it into something useful. The Spidra API https://spidra.io/products/spidra-api is designed to solve all three of those problems in one place. You send a URL, describe what you want, and get back structured data. The browser rendering, CAPTCHA solving, proxy rotation, and AI extraction all happen on Spidra's side. This guide walks through the entire API from authentication to crawling. By the end you will know how every endpoint works, what the response structure looks like, and how to build a real scraping pipeline around it. Before you start You need a Spidra account and an API key. Sign up at spidra.io https://spidra.io/ . The free plan includes 300 credits with no credit card required. Once you are in, go to app.spidra.io → Settings → API Keys and create a key. Keep it somewhere safe. Every request you make to the API includes this key in the header. How the API works The Spidra API is a REST API with one base URL: https://api.spidra.io/api Every request is authenticated by including your API key in the x-api-key header. There are no bearer tokens, no OAuth flows, just a header on every request. curl -X POST https://api.spidra.io/api/scrape \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{"urls": {"url": "https://example.com"} }' One important thing to understand before you make your first request: Spidra jobs are asynchronous . When you submit a scrape, you do not get the data back immediately. You get a job ID. You then poll a status endpoint every few seconds until the job is complete and the data is ready. This is by design. Browser rendering, CAPTCHA solving, and AI extraction take a few seconds. The async pattern means you are not holding a connection open the whole time. The flow for every job type looks like this: - Submit the job. Receive a job ID in the response. - Poll the status endpoint every 2 to 5 seconds. - When status is completed , read your results. Now let us go through each part of the API. Authentication Every request needs the x-api-key header. That is it. -H "x-api-key: YOUR API KEY" If the key is missing or invalid, the API returns a 401 . If your credits are exhausted, you get a 403 . Here is the full set of response codes you will encounter: | Code | What it means | |---|---| 200 | Request completed successfully | 202 | Job queued successfully. Poll for results. | 400 | Bad request. Missing or invalid parameters. | 401 | API key missing, invalid, or expired | 403 | Credits exhausted or plan limit reached | 404 | Job or resource not found | 429 | Rate limit hit. Back off and retry. | 500 | Something went wrong on Spidra's side | All errors come back in the same format: { "status": "error", "message": "Detailed explanation of what went wrong" } Scraping a single page The scrape endpoint is where most people start. You give it one to three URLs and it returns structured data from each one. Endpoint: POST /api/scrape The minimal request The only required field is urls , which takes an array of URL objects. Each URL object requires a url field and optionally takes an actions array for browser interactions. curl -X POST https://api.spidra.io/api/scrape \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": {"url": "https://news.ycombinator.com"} }' Response: { "status": "queued", "jobId": "550e8400-e29b-41d4-a716-446655440000", "message": "Scrape job has been queued. Poll /api/scrape/550e8400... to get the result." } Save that jobId . You need it to check on the job. Polling for results Call GET /api/scrape/{jobId} every few seconds until the status changes. curl https://api.spidra.io/api/scrape/550e8400-e29b-41d4-a716-446655440000 \ -H "x-api-key: YOUR API KEY" While the job is running, you will see something like this: { "status": "active", "progress": { "message": "Processing content with AI...", "progress": 0.6 }, "result": null, "error": null } The progress field goes from 0 to 1 as the job moves through its stages: loading the browser, executing actions, solving CAPTCHAs, running AI extraction. When it finishes: { "status": "completed", "progress": { "message": "Scrape completed successfully", "progress": 1 }, "result": { "content": "...", "data": { "url": "https://news.ycombinator.com", "title": "Hacker News", "markdownContent": "...", "success": true, "screenshotUrl": null } , "screenshots": , "ai extraction failed": false, "stats": { "durationMs": 4200, "captchaSolvedCount": 0, "inputTokens": 312, "outputTokens": 84, "totalTokens": 396 } }, "error": null } The result.content field is the main output. What it contains depends on what you asked for: - If you passed a prompt , content is the AI-extracted result - If you did not pass a prompt , content is the raw page content as Markdown result.data is an array with one entry per URL. Each entry has the page title, the full Markdown content for that URL, whether it succeeded, and a screenshot URL if you requested one. result.stats tells you how long the job took, how many CAPTCHAs were solved, and how many tokens the AI extraction used. A polling loop in Python python import requests import time API KEY = "YOUR API KEY" BASE URL = "https://api.spidra.io/api" HEADERS = {"x-api-key": API KEY, "Content-Type": "application/json"} def scrape url : Submit the job response = requests.post f"{BASE URL}/scrape", headers=HEADERS, json={"urls": {"url": url} } response.raise for status job id = response.json "jobId" Poll until complete while True: status response = requests.get f"{BASE URL}/scrape/{job id}", headers=HEADERS data = status response.json if data "status" == "completed": return data "result" elif data "status" == "failed": raise Exception f"Scrape failed: {data 'error' }" time.sleep 3 result = scrape "https://news.ycombinator.com" print result "content" The same in Node.js: js const API KEY = "YOUR API KEY"; const BASE URL = "https://api.spidra.io/api"; const HEADERS = { "x-api-key": API KEY, "Content-Type": "application/json" }; async function scrape url { const submitRes = await fetch ${BASE URL}/scrape , { method: "POST", headers: HEADERS, body: JSON.stringify { urls: { url } } } ; const { jobId } = await submitRes.json ; while true { const statusRes = await fetch ${BASE URL}/scrape/${jobId} , { headers: HEADERS } ; const data = await statusRes.json ; if data.status === "completed" return data.result; if data.status === "failed" throw new Error data.error ; await new Promise r = setTimeout r, 3000 ; } } const result = await scrape "https://news.ycombinator.com" ; console.log result.content ; AI extraction with prompts The plain scrape above gives you raw Markdown. Most of the time you want something more specific. That is where the prompt field comes in. Add a prompt and Spidra reads the rendered page and extracts exactly what you described. The AI understands context. It knows a number next to a currency symbol is a price, that a short bold line near the top of a product page is probably the title, and that a block of longer text is likely a description. You describe the output you want and it figures out where to find it. curl -X POST https://api.spidra.io/api/scrape \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": {"url": "https://news.ycombinator.com"} , "prompt": "Extract the top 10 post titles and their point scores", "output": "json" }' When the job completes, result.content contains the AI-extracted data as JSON: {"title": "Show HN: I built a thing", "points": 342}, {"title": "Ask HN: What are you working on?", "points": 289} The output field controls the format. It defaults to "json" but you can set it to "markdown" if you want the extracted content as formatted text instead of structured data. One thing to know: if you set output: "json" without a prompt , Spidra still runs a default AI extraction pass. If you want the raw page content with no AI processing at all, omit both output and prompt . If AI extraction fails for any reason a near-empty page, a heavily obfuscated site , Spidra falls back to returning the raw page Markdown and sets ai extraction failed: true in the response so your code can detect and handle it. Structured output with JSON schema Prompts are flexible but they are not predictable. The AI decides what fields to return and what to call them. For production pipelines where downstream systems expect a specific shape, that is a problem. The schema field solves this. Pass a JSON Schema object and the AI must return data that matches it exactly. Required fields always appear in the output, as null if the page does not have that value. Field names match exactly what you defined. The structure never varies between runs. curl -X POST https://api.spidra.io/api/scrape \ -H "x-api-key: YOUR API KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": {"url": "https://jobs.example.com/senior-engineer"} , "prompt": "Extract the job details. Normalize salary to a number in USD.", "schema": { "type": "object", "required": "title", "company", "remote", "employment type" , "properties": { "title": {"type": "string"}, "company": {"type": "string"}, "remote": {"type": "boolean", "null" }, "salary min": {"type": "number", "null" }, "salary max": {"type": "number", "null" }, "employment type": { "type": "string", "null" , "enum": "full time", "part time", "contract", null } } } }' The response will always have title , company , remote , and employment type because they are in required . If the page does not mention a salary, salary min and salary max come back as null rather than being omitted. When you provide a schema , output is automatically set to "json" . You do not need to set it yourself. The schema is validated before the job is queued. If it is malformed, the API returns a 422 with descriptive errors. Non-fatal issues like unsupported keywords come back as schema warnings in the response. Schema limits to be aware of: maximum nesting depth is 5 levels, maximum schema size is 10 KB. Browser actions Some pages do not show you the data you want until you interact with them first. Cookie banners blocking content. A "Load More" button that reveals the next batch of results. A search form you need to fill before anything appears. Tabs that hide content by default. The actions array on each URL object lets you interact with the page before extraction runs. Actions execute in order, inside a real browser, before Spidra runs your prompt. Here is an example that dismisses a cookie banner, fills a search form, and waits for results to load: { "urls": { "url": "https://example.com/search", "actions": {"type": "click", "value": "Accept cookies button"}, {"type": "type", "selector": "input name='q' ", "value": "wireless headphones"}, {"type": "click", "selector": "button type='submit' "}, {"type": "wait", "duration": 1500}, {"type": "scroll", "to": "80%"} } , "prompt": "Extract all product names and prices from the search results", "output": "json" } Notice that for the first click , the value field is a plain English description of the element. For the second click , the selector field is a CSS selector. Both approaches work and you can mix them in the same actions array. For any click , check , or uncheck action: - Use selector for a CSS selector or XPath expression like " accept-cookies" or ".submit-btn" - Use value for a plain English description like "Accept cookies button" and Spidra's AI will find the element for you Both are equally valid. Use whichever makes more sense for the page you are working with. Available actions | Action | What it does | Key fields | |---|---|---| click | Clicks a button, link, tab, or any element | selector or value | type | Types text into an input or search field | selector , value | check | Checks a checkbox | selector or value | uncheck | Unchecks a checkbox | selector or value | wait | Pauses for a number of milliseconds | duration | scroll | Scrolls to a percentage of the page height | to e.g. "80%" | forEach | Finds matching elements and processes each one | value , mode | The forEach action forEach is the most powerful action in the API. It finds a set of repeating elements on the page product cards, search result links, accordion rows, directory listings and processes each one individually, then combines all the results into a single output. It supports three modes: inline reads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element itself. navigate follows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages that you need to navigate into. click clicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections. { "urls": { "url": "https://directory.example.com/companies", "actions": {"type": "click", "value": "Accept cookies"}, { "type": "forEach", "value": "Find all company listing cards", "mode": "navigate", "maxItems": 20, "itemPrompt": "Extract company name, website, and industry", "pagination": { "nextSelector": "a.next-page", "maxPages": 3 } } } , "output": "json" } This dismisses the cookie banner, finds every company card on the page, navigates into each one, extracts the company details, and repeats across 3 pages of pagination. All in a single API call. Proxy and geo-targeting Some sites block traffic from cloud IP ranges. Others serve different content based on location. The useProxy and proxyCountry fields route your requests through residential proxies to handle both situations. { "urls": {"url": "https://amazon.de/dp/B123456"} , "prompt": "Extract the product price", "output": "json", "useProxy": true, "proxyCountry": "de" } Setting useProxy: true routes the request through the residential proxy network. proxyCountry accepts: - A two-letter ISO country code like "us" , "de" , "gb" , "fr" "eu" to rotate randomly across all 27 EU member states "global" or omit it entirely for no country preference Proxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for using proxies. Additional options Extract content only Strip navigation, headers, footers, and sidebars before extraction. Useful when you only want the main content of a page and want to reduce noise. { "urls": {"url": "https://blog.example.com/article"} , "prompt": "Summarize this article", "extractContentOnly": true } Screenshots Capture screenshots of scraped pages for debugging, archival, or visual monitoring. { "urls": {"url": "https://example.com"} , "screenshot": true, "fullPageScreenshot": true } screenshot: true captures the visible viewport. fullPageScreenshot: true captures the entire scrollable page. The screenshot URLs are returned in result.screenshots and in each item's screenshotUrl field. Authenticated scraping Pass session cookies to access pages behind a login. Get the cookies from your browser's DevTools after logging in manually, then include them in your request. { "urls": {"url": "https://app.example.com/dashboard"} , "prompt": "Extract the account summary", "cookies": "session id=abc123; auth token=xyz789" } Standard cookie format name=value; name2=value2 and Chrome DevTools paste format both work. Cookies are passed ephemerally to the browser worker and never stored. Batch scraping When you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker. Endpoint: POST /api/batch/scrape python import requests import time API KEY = "YOUR API KEY" BASE URL = "https://api.spidra.io/api" HEADERS = {"x-api-key": API KEY, "Content-Type": "application/json"} urls = "https://example.com/product/1", "https://example.com/product/2", "https://example.com/product/3", Submit the batch response = requests.post f"{BASE URL}/batch/scrape", headers=HEADERS, json={ "urls": urls, "prompt": "Extract the product name, price, and availability", "output": "json", } batch id = response.json "batchId" Poll until complete while True: status = requests.get f"{BASE URL}/batch/scrape/{batch id}", headers=HEADERS .json if status "status" in "completed", "failed", "partial" : break time.sleep 3 Process results for item in status "items" : if item "status" == "completed": print f"{item 'url' }: {item 'result' 'content' }" else: print f"Failed: {item 'url' } — {item 'error' }" The batch response includes a status for the overall batch and an items array with one entry per URL. Each item has its own status , result , and error so you can see exactly which URLs succeeded and which failed. Credits are reserved upfront when you submit and reconciled per item when processing completes. If a URL fails, credits for that item are returned. Batch with structured output Everything that works in single scrape works in batch. Pass a schema and every item in the batch returns data matching that shape: requests.post f"{BASE URL}/batch/scrape", headers=HEADERS, json={ "urls": urls, "prompt": "Extract the product details", "schema": { "type": "object", "required": "name", "price" , "properties": { "name": {"type": "string"}, "price": {"type": "number", "null" }, "currency": {"type": "string", "null" }, "available": {"type": "boolean", "null" } } } } Managing batches Beyond submitting and polling, the batch API has a few more endpoints worth knowing: | Endpoint | What it does | |---|---| GET /api/batch/scrape | List all your batch jobs with status and credit usage | DELETE /api/batch/scrape/{batchId} | Cancel a running or pending batch. Credits for unprocessed items are refunded. | POST /api/batch/scrape/{batchId}/retry | Re-queue only the failed items in a completed batch without resubmitting the ones that already succeeded. | The retry endpoint is particularly useful for large batches where a handful of items fail due to transient issues. You do not need to resubmit the full batch, just the failures. Crawling Batch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover pages for you. You give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one up to your maxPages limit, and applies your transform instruction to every page it visits. Endpoint: POST /api/crawl response = requests.post f"{BASE URL}/crawl", headers=HEADERS, json={ "baseUrl": "https://docs.example.com", "crawlInstruction": "Follow all documentation pages. Skip changelog and login pages.", "transformInstruction": "Extract the page title and full body text as clean Markdown. Preserve all headings and code examples.", "maxPages": 20, "useProxy": False } job id = response.json "jobId" Three fields are required: baseUrl , crawlInstruction , and transformInstruction . Everything else is optional. maxPages defaults to 5 and goes up to 20. The crawl discovers links from the base URL first, then works through them in order of discovery. Poll GET /api/crawl/{jobId} for status. When complete, results are available through several endpoints: | Endpoint | What it returns | |---|---| GET /api/crawl/{jobId} | Overall status and summary | GET /api/crawl/{jobId}/pages | All crawled pages with extracted data and signed URLs to the original HTML and Markdown | GET /api/crawl/{jobId}/download | ZIP archive of all results | POST /api/crawl/{jobId}/extract | Run a new extraction on already-crawled pages without re-crawling | GET /api/crawl/history | Paginated list of your past crawl jobs | The extract endpoint is worth highlighting. If you crawl a site and later decide you want to extract different fields, you can run a new extraction on the cached pages without making a single new browser request. That saves time and credits. A complete crawl example python import requests import time import json API KEY = "YOUR API KEY" BASE URL = "https://api.spidra.io/api" HEADERS = {"x-api-key": API KEY, "Content-Type": "application/json"} Submit the crawl job = requests.post f"{BASE URL}/crawl", headers=HEADERS, json={ "baseUrl": "https://blog.example.com", "crawlInstruction": "Follow all blog post pages. Skip tag pages, author pages, and the homepage.", "transformInstruction": "Extract the article title, author, publish date, and full body text.", "maxPages": 15 } .json job id = job "jobId" print f"Crawl job started: {job id}" Poll until complete while True: status = requests.get f"{BASE URL}/crawl/{job id}", headers=HEADERS .json print f"Status: {status 'status' }" if status "status" == "completed": break elif status "status" == "failed": raise Exception "Crawl failed" time.sleep 5 Fetch all crawled pages pages = requests.get f"{BASE URL}/crawl/{job id}/pages", headers=HEADERS .json Save as JSONL with open "crawl results.jsonl", "w" as f: for page in pages "pages" : f.write json.dumps { "url": page "url" , "data": page "data" } + "\n" print f"Saved {len pages 'pages' } pages" Monitoring and logs The Spidra API keeps a log of every scrape job you run. This is useful for debugging, auditing, and understanding your credit consumption. List recent scrape logs logs = requests.get f"{BASE URL}/scrape-logs", headers=HEADERS .json for log in logs "data" : print f"{log 'started at' } — {log 'status' } — {log 'latency ms' }ms — {log 'tokens used' } tokens" Get full details of a specific log log detail = requests.get f"{BASE URL}/scrape-logs/{log 'uuid' }", headers=HEADERS .json Usage statistics Track your credit consumption over time: usage = requests.get f"{BASE URL}/account/usage", headers=HEADERS .json print usage This returns time-series data covering requests, tokens, crawls, and credit consumption over a configurable period. Putting it all together: a real pipeline Here is a complete example that combines scraping, batch processing, and structured output into a pipeline that collects job listings from multiple pages and saves them to a JSONL file: python import requests import time import json API KEY = "YOUR API KEY" BASE URL = "https://api.spidra.io/api" HEADERS = {"x-api-key": API KEY, "Content-Type": "application/json"} JOB SCHEMA = { "type": "object", "required": "title", "company", "location" , "properties": { "title": {"type": "string"}, "company": {"type": "string"}, "location": {"type": "string", "null" }, "remote": {"type": "boolean", "null" }, "salary min": {"type": "number", "null" }, "salary max": {"type": "number", "null" }, "employment type": { "type": "string", "null" , "enum": "full time", "part time", "contract", None } } } def collect job urls board url : """Use forEach to collect job listing URLs from a board page.""" response = requests.post f"{BASE URL}/scrape", headers=HEADERS, json={ "urls": { "url": board url, "actions": {"type": "click", "value": "Accept cookies"}, { "type": "forEach", "value": "Find all job listing links", "mode": "navigate", "maxItems": 50, "itemPrompt": "Extract job title, company, location, remote status, salary range, and employment type", "pagination": { "nextSelector": "a.next-page", "maxPages": 3 } } } , "output": "json", "schema": JOB SCHEMA } job id = response.json "jobId" while True: status = requests.get f"{BASE URL}/scrape/{job id}", headers=HEADERS .json if status "status" == "completed": return status "result" "content" elif status "status" == "failed": raise Exception status "error" time.sleep 3 Collect from multiple job boards boards = "https://jobs.example.com/engineering", "https://careers.anothersite.com/remote", all jobs = for board in boards: print f"Collecting from {board}..." jobs = collect job urls board if isinstance jobs, list : all jobs.extend jobs print f" Got {len jobs if isinstance jobs, list else 0} jobs" Save results with open "jobs.jsonl", "w" as f: for job in all jobs: f.write json.dumps job + "\n" print f"\nTotal: {len all jobs } jobs saved to jobs.jsonl" Error handling Wrap your API calls properly and handle the cases that actually come up in production. python import requests def safe scrape url, prompt : try: response = requests.post f"{BASE URL}/scrape", headers=HEADERS, json={ "urls": {"url": url} , "prompt": prompt, "output": "json" } if response.status code == 401: raise Exception "Invalid API key. Check your x-api-key header." if response.status code == 403: raise Exception "Credits exhausted or plan limit reached." if response.status code == 429: raise Exception "Rate limit hit. Wait before retrying." response.raise for status return response.json "jobId" except requests.exceptions.ConnectionError: raise Exception "Could not connect to the Spidra API." For polling loops, always handle the failed status and check ai extraction failed in the result: if status "status" == "completed": result = status "result" if result.get "ai extraction failed" : AI extraction failed, content is raw Markdown fallback print "AI extraction failed, using raw content" content = result "data" 0 "markdownContent" else: content = result "content" API reference summary | Method | Endpoint | Purpose | |---|---|---| POST | /api/scrape | Submit a scrape job 1 to 3 URLs | GET | /api/scrape/{jobId} | Poll for job status and results | POST | /api/batch/scrape | Submit a batch job up to 50 URLs | GET | /api/batch/scrape/{batchId} | Poll batch status and per-item results | GET | /api/batch/scrape | List all your batch jobs | DELETE | /api/batch/scrape/{batchId} | Cancel a batch and refund unused credits | POST | /api/batch/scrape/{batchId}/retry | Retry only the failed items in a batch | POST | /api/crawl | Submit a crawl job | GET | /api/crawl/{jobId} | Poll crawl status | GET | /api/crawl/{jobId}/pages | Get all crawled pages with extracted data | POST | /api/crawl/{jobId}/extract | Re-extract from crawled pages without re-crawling | GET | /api/crawl/{jobId}/download | Download crawl results as ZIP | GET | /api/crawl/history | List your past crawl jobs | GET | /api/scrape-logs | List recent scrape logs | GET | /api/scrape-logs/{id} | Get full details of a single log | GET | /api/account/usage | Get usage statistics | What next You now have a working understanding of every part of the Spidra API. Here are the natural next steps depending on what you are building: If you want to go deeper on browser actions and forEach , read the Browser Actions Guide https://docs.spidra.io/features/actions in the docs. It covers every option for each action type with real examples. If you are building something that needs guaranteed output shapes, read the Structured Output Guide https://docs.spidra.io/features/structured-output for full details on schemas, nullable fields, Zod and Pydantic integration, and schema limits. If you are using an SDK in a specific language, each one has its own guide: Node.js https://docs.spidra.io/sdks/node , Python https://docs.spidra.io/sdks/python , Go https://docs.spidra.io/sdks/go , PHP https://docs.spidra.io/sdks/php , Ruby https://docs.spidra.io/sdks/ruby , Rust https://docs.spidra.io/sdks/rust , .NET https://docs.spidra.io/sdks/dotnet , Elixir https://docs.spidra.io/sdks/elixir , Java https://docs.spidra.io/sdks/java , and Swift https://docs.spidra.io/sdks/swift . Get your API key at app.spidra.io https://app.spidra.io/ . The free plan has 300 credits and no card required.