Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

Spidra released a Node.js SDK that enables developers to scrape any website using plain English descriptions instead of CSS selectors or browser automation tools. The SDK handles browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction on Spidra's infrastructure, allowing developers to extract structured data from JavaScript-heavy and protected websites with a single API call. The TypeScript-native package supports scraping individual pages, batch processing up to 50 URLs, and crawling entire websites without additional configuration.

Web scraping in Node.js has a familiar progression. You start with axios or node-fetch for static pages. Then a modern site returns an empty HTML shell and you reach for Puppeteer. Then Cloudflare blocks you and you spend an evening on stealth plugins. Then the page structure changes and your selectors are worthless again. Spidra's Node.js SDK https://docs.spidra.io/sdks/node spidra-js cuts across all of that. You describe what you want from a page in plain English, and the SDK returns structured data. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all run on Spidra's infrastructure. Your code just handles the result. This tutorial covers the full SDK, from installation through crawling an entire website. The SDK is TypeScript-native so you get complete type safety out of the box. Every example works as-is with no additional configuration. Prerequisites - Node.js 18 or higher Installation npm install spidra-js The package includes TypeScript types. You do not need a separate @types/spidra-js package. Store your API key as an environment variable. Never hardcode it in source files. export SPIDRA API KEY="spd YOUR API KEY" Setting up the client Import SpidraClient and initialise it with your API key. TypeScript / ESM: js import { SpidraClient } from 'spidra-js' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } CommonJS: js const { SpidraClient } = require 'spidra-js' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } The client exposes five namespaces: | Namespace | What it handles | |---|---| spidra.scrape | Scraping one to three URLs with browser automation and AI extraction | spidra.batch | Processing up to 50 URLs in parallel | spidra.crawl | Discovering and scraping pages across an entire website | spidra.logs | History of every scrape your API key has made | spidra.usage | Credit and request consumption statistics | Every method is async and returns a Promise . The examples below use top-level await for clarity. If your project does not support top-level await , wrap the calls in an async function. Scraping a page Your first scrape js import { SpidraClient } from 'spidra-js' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } const job = await spidra.scrape.run { urls: { url: 'https://news.ycombinator.com' } , } console.log job.result.content Without a prompt , Spidra loads the page in a real browser, executes all JavaScript, and returns the full rendered content as Markdown. That is what ends up in job.result.content . How the job lifecycle works run submits the job and polls in the background until it completes. From your side it looks like a single await . Under the hood, the job moves through these states: waiting → active → completed or failed If you want to submit a job and check on it yourself rather than waiting, use submit and get separately: js // Submit and get a job ID immediately const queued = await spidra.scrape.submit { urls: { url: 'https://example.com' } , prompt: 'Extract the main headline', } console.log Job submitted: ${queued.jobId} // Check later await new Promise r = setTimeout r, 5000 const status = await spidra.scrape.get queued.jobId if status.status === 'completed' { console.log status.result.content } else if status.status === 'failed' { console.error Failed: ${status.error} } Extracting data with prompts Add a prompt and Spidra uses AI to extract exactly what you described from the rendered page. You do not need to know the page structure or write any selectors. js const job = await spidra.scrape.run { urls: { url: 'https://news.ycombinator.com' } , prompt: 'Extract the top 10 post titles and their point scores', output: 'json', } console.log job.result.content // { "title": "Show HN: I built a thing", "points": 342 }, ... Setting output: 'json' tells the AI to return structured JSON. The default is 'markdown' . The AI understands context. It knows a number next to a currency symbol is a price, a short bold line at the top of a product page is probably the title, and a longer block of text is likely a description. You describe the result you want and it finds it on the page. That said, the SDK also fully supports CSS selectors and XPath for browser interactions when you want to be precise. We will cover that in the browser actions section. Enforcing output shape with JSON schema Plain prompts are flexible but not predictable. The AI decides what fields to return and what to call them. That works for exploration but causes problems in production when a database or another service expects a consistent shape every single time. The schema field solves this. Pass a JSON Schema object and the AI must match it exactly. Fields in required always appear in the output, as null if the page does not have that value. js const job = await spidra.scrape.run { urls: { url: 'https://jobs.example.com/senior-engineer' } , prompt: 'Extract the job listing details. Normalize salary to a USD number.', output: 'json', schema: { type: 'object', required: 'title', 'company', 'remote' , properties: { title: { type: 'string' }, company: { type: 'string' }, remote: { type: 'boolean', 'null' }, salary min: { type: 'number', 'null' }, salary max: { type: 'number', 'null' }, employment type: { type: 'string', 'null' , enum: 'full time', 'part time', 'contract', null , }, skills: { type: 'array', items: { type: 'string' } }, }, }, } console.log job.result.content // { // title: "Senior Software Engineer", // company: "Acme Corp", // remote: true, // salary min: 120000, // salary max: 160000, // employment type: "full time", // skills: "TypeScript", "PostgreSQL", "AWS" // } Since the SDK is TypeScript-native, you can type the result directly: interface JobListing { title: string company: string remote: boolean | null salary min: number | null salary max: number | null employment type: 'full time' | 'part time' | 'contract' | null skills: string } const content = job.result.content as JobListing console.log ${content.title} at ${content.company} If you use Zod https://www.npmjs.com/package/zod for runtime validation, generate the schema from your existing Zod type and pass it directly: js import { z } from 'zod' import { zodToJsonSchema } from 'zod-to-json-schema' const JobListingSchema = z.object { title: z.string , company: z.string , remote: z.boolean .nullable , salary min: z.number .nullable , salary max: z.number .nullable , employment type: z.enum 'full time', 'part time', 'contract' .nullable , skills: z.array z.string , } const job = await spidra.scrape.run { urls: { url: 'https://jobs.example.com/senior-engineer' } , prompt: 'Extract the job listing details', schema: zodToJsonSchema JobListingSchema , } const listing = JobListingSchema.parse job.result.content One schema definition in your codebase that handles both runtime validation and scraping output shape. Browser actions Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data by default. Pass an actions array inside the URL object and those actions https://docs.spidra.io/features/actions execute in order inside a real browser before extraction runs. js const job = await spidra.scrape.run { urls: { url: 'https://example.com/products', actions: { type: 'click', selector: ' accept-cookies' }, { type: 'wait', duration: 1000 }, { type: 'scroll', to: '80%' }, , }, , prompt: 'Extract all product names and prices visible on the page', } For click , check , and uncheck actions, you have two options for targeting an element: selector for a CSS selector or XPath expression like ' accept-cookies' or '.submit-btn' value for a plain English description like 'Accept cookies button' and Spidra locates it using AI Both are valid, and you can mix them in the same actions array: actions: { type: 'click', selector: ' accept-cookies' }, // CSS selector { type: 'click', value: 'Search button' }, // plain English Use whichever is more convenient. If the element has a clean, stable ID or class, use selector . If the page is complex or you want the action to survive layout changes, use value . All available actions | Action | What it does | Key fields | |---|---|---| click | Clicks a button, link, or any element | selector or value | type | Types text into an input field | selector , value | check | Checks a checkbox | selector or value | uncheck | Unchecks a checkbox | selector or value | wait | Pauses for a number of milliseconds | duration | scroll | Scrolls to a percentage of the page height | to e.g. '80%' | forEach | Finds matching elements and processes each one | value , mode | The forEach action forEach https://docs.spidra.io/features/actions foreach-process-every-element-on-a-page is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, combining all the results into a single output. Three modes: inline reads the content of each matched element directly. For product cards, table rows, or content that lives inside the element itself. navigate follows each element as a link, loads the destination page, and scrapes it. For detail pages you need to click into. click clicks each element to expand or reveal content, then scrapes what appears. For accordions, modals, or expandable sections. js const job = await spidra.scrape.run { urls: { url: 'https://directory.example.com/companies', actions: { type: 'click', value: 'Accept cookies' }, { type: 'forEach', value: 'Find all company listing cards', mode: 'navigate', maxItems: 20, itemPrompt: 'Extract company name, website, and industry', pagination: { nextSelector: 'a.next-page', maxPages: 3, }, }, , }, , output: 'json', } This dismisses the cookie banner, finds every company card on the page, navigates into each company profile, extracts the company details, and repeats across three pages of pagination. One request, one await . Proxy and geo-targeting Some sites block cloud infrastructure IP ranges or serve different content based on location. Set useProxy: true to route through a residential proxy https://docs.spidra.io/features/stealth-mode . js const job = await spidra.scrape.run { urls: { url: 'https://www.amazon.de/gp/bestsellers' } , prompt: 'List the top 10 products with name and price', useProxy: true, proxyCountry: 'de', } proxyCountry accepts: - A two-letter ISO country code like 'us' , 'de' , 'gb' , 'fr' , 'jp' 'eu' to rotate randomly across all 27 EU member states 'global' or omit it for no country preference Proxy usage is billed from your bandwidth quota, not your credits. Scraping pages behind a login Pass session cookies to access authenticated content. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request https://docs.spidra.io/features/authenticated-scraping , and pass it as a string. js const job = await spidra.scrape.run { urls: { url: 'https://app.example.com/dashboard' } , prompt: 'Extract the monthly revenue and active user count', cookies: 'session=abc123; auth token=xyz789', } Standard cookie format name=value; name2=value2 and Chrome DevTools paste format both work. Stripping boilerplate extractContentOnly strips navigation, headers, footers, and sidebars before extraction runs. Useful for articles, documentation pages, and any page where the main content is surrounded by heavy navigation. js const job = await spidra.scrape.run { urls: { url: 'https://blog.example.com/long-article' } , prompt: 'Summarize this article in three sentences', extractContentOnly: true, } Screenshots Capture screenshots of pages for debugging, monitoring, or archival. js const job = await spidra.scrape.run { urls: { url: 'https://example.com' } , screenshot: true, fullPageScreenshot: true, } console.log job.result.screenshots // array of URLs screenshot: true captures the visible viewport. fullPageScreenshot: true captures the entire scrollable page. Batch scraping When you have a list of URLs, the batch endpoint https://docs.spidra.io/features/batch-scraping processes up to 50 at a time in parallel. Each URL runs in its own independent worker. js const batch = await spidra.batch.run { urls: 'https://shop.example.com/product/1', 'https://shop.example.com/product/2', 'https://shop.example.com/product/3', , prompt: 'Extract the product name, price, and whether it is in stock', output: 'json', } console.log ${batch.completedCount}/${batch.totalUrls} completed for const item of batch.items { if item.status === 'completed' { console.log item.url, item.result } else { console.error Failed: ${item.url} — ${item.error} } } Processing large URL lists The batch endpoint caps at 50 URLs per request. For larger lists, chunk them: js async function scrapeAll urls: string , prompt: string { const results: Array<{ url: string; data: unknown } = const chunkSize = 50 for let i = 0; i < urls.length; i += chunkSize { const chunk = urls.slice i, i + chunkSize const batchNum = Math.floor i / chunkSize + 1 const totalBatches = Math.ceil urls.length / chunkSize console.log Processing batch ${batchNum} of ${totalBatches}... const batch = await spidra.batch.run { urls: chunk, prompt, output: 'json', } for const item of batch.items { if item.status === 'completed' { results.push { url: item.url, data: item.result } } else { console.warn Failed: ${item.url} } } } return results } const urls = Array.from { length: 200 }, , i = https://example.com/product/${i + 1} const results = await scrapeAll urls, 'Extract product name and price' Managing batches Retry failed items without resubmitting the ones that already succeeded: if batch.failedCount 0 { await spidra.batch.retry batch.batchId } Cancel a running batch and get credits refunded for items that have not started yet: js const response = await spidra.batch.cancel batchId console.log Cancelled ${response.cancelledItems} items, refunded ${response.creditsRefunded} credits Crawling entire websites Batch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover them for you. Give it a starting URL, describe which links to follow, and describe what to extract from each page. Spidra loads the base URL, finds matching links, visits each one up to your maxPages limit, and applies your transformInstruction to every page it visits. js import { SpidraClient } from 'spidra-js' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } const job = await spidra.crawl.run { baseUrl: 'https://competitor.com/blog', crawlInstruction: 'Follow links to blog posts only. Skip tag pages, category pages, and the homepage.', transformInstruction: 'Extract the post title, author name, publish date, and a one-sentence summary.', maxPages: 30, useProxy: true, } for const page of job.result { console.log page.url, page.data } Three fields are required: baseUrl , crawlInstruction , and transformInstruction . maxPages defaults to 5 and can be set up to 20. For larger crawls that take more time, the default 120-second timeout may not be enough. If you are hitting timeouts, fire the crawl with submit and poll with get yourself: js const queued = await spidra.crawl.submit { baseUrl: 'https://docs.example.com', crawlInstruction: 'Follow all documentation pages. Skip changelog and login pages.', transformInstruction: 'Extract the page title and full body text.', maxPages: 20, } // Poll every 10 seconds let status = await spidra.crawl.get queued.jobId while status.status == 'completed' && status.status == 'failed' { await new Promise r = setTimeout r, 10000 status = await spidra.crawl.get queued.jobId console.log Status: ${status.status} } for const page of status.result ?? { console.log page.url, page.data } Re-extracting with a different prompt If you crawled a site and want to pull out different information, use extract to run a new AI pass over the already-crawled content without making new browser requests: js const queued = await spidra.crawl.extract completedJobId, 'Extract only product SKUs and prices as structured JSON', const result = await spidra.crawl.get queued.jobId Using the SDK in different environments Next.js API route js // app/api/scrape/route.ts import { SpidraClient } from 'spidra-js' import { NextResponse } from 'next/server' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } export async function POST request: Request { const { url, prompt } = await request.json try { const job = await spidra.scrape.run { urls: { url } , prompt, output: 'json', } return NextResponse.json { data: job.result.content } } catch error { return NextResponse.json { error: 'Scrape failed' }, { status: 500 } } } Express python import express from 'express' import { SpidraClient } from 'spidra-js' const app = express const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } app.use express.json app.post '/scrape', async req, res = { const { url, prompt } = req.body try { const job = await spidra.scrape.run { urls: { url } , prompt, output: 'json', } res.json { data: job.result.content } } catch err { res.status 500 .json { error: 'Scrape failed' } } } app.listen 3000 Bun The SDK works with Bun out of the box. No changes needed. bun add spidra-js js import { SpidraClient } from 'spidra-js' const spidra = new SpidraClient { apiKey: Bun.env.SPIDRA API KEY } const job = await spidra.scrape.run { urls: { url: 'https://example.com' } , prompt: 'Extract the main headline', } console.log job.result.content Error handling Every API error maps to a typed exception class. Catch exactly what you care about and let everything else propagate. js import { SpidraError, SpidraAuthenticationError, SpidraInsufficientCreditsError, SpidraRateLimitError, SpidraServerError, } from 'spidra-js' try { const job = await spidra.scrape.run { urls: { url: 'https://example.com' } , prompt: 'Extract the main headline', } console.log job.result.content } catch err { if err instanceof SpidraAuthenticationError { console.error 'API key is missing or invalid. Check your SPIDRA API KEY.' } else if err instanceof SpidraInsufficientCreditsError { console.error 'Account is out of credits. Top up at app.spidra.io.' } else if err instanceof SpidraRateLimitError { console.warn 'Rate limit hit. Slow down and retry.' } else if err instanceof SpidraServerError { console.error Server error ${err.status} : ${err.message}. Retry is usually safe. } else if err instanceof SpidraError { console.error API error ${err.status}: ${err.message} } else { throw err } } | Exception | HTTP status | When it fires | |---|---|---| SpidraAuthenticationError | 401 | API key missing or invalid | SpidraInsufficientCreditsError | 403 | No credits remaining | SpidraRateLimitError | 429 | Too many requests | SpidraServerError | 500 | Unexpected error on Spidra's side | SpidraError | any | Base class for all exceptions | Also check the ai extraction failed flag in the result. If AI extraction fails for any reason, Spidra falls back to raw Markdown and sets this flag: js const job = await spidra.scrape.run { urls: { url: 'https://example.com' } , prompt: 'Extract the main headline', } if job.result.ai extraction failed { // Raw Markdown fallback is in the data array const raw = job.result.data 0 ?.markdownContent console.warn 'AI extraction failed, using raw content' } else { console.log job.result.content } Putting it all together: a complete pipeline A full example that uses forEach with pagination to collect job listings from a directory, enforces a schema on the output, handles errors, and saves results to a JSONL file: js import { SpidraClient, SpidraError, SpidraInsufficientCreditsError } from 'spidra-js' import { writeFileSync } from 'fs' import as os from 'os' const spidra = new SpidraClient { apiKey: process.env.SPIDRA API KEY } const JOB SCHEMA = { type: 'object', required: 'title', 'company', 'location' , properties: { title: { type: 'string' }, company: { type: 'string' }, location: { type: 'string', 'null' }, remote: { type: 'boolean', 'null' }, salary min: { type: 'number', 'null' }, salary max: { type: 'number', 'null' }, employment type: { type: 'string', 'null' , enum: 'full time', 'part time', 'contract', null , }, }, } async function collectListings boardUrl: string { try { const job = await spidra.scrape.run { urls: { url: boardUrl, actions: { type: 'click', value: 'Accept cookies' }, { type: 'forEach', value: 'Find all job listing cards', mode: 'navigate', maxItems: 50, itemPrompt: 'Extract job title, company, location, remote status, salary range, and employment type', pagination: { nextSelector: 'a.next-page', maxPages: 3, }, }, , }, , output: 'json', schema: JOB SCHEMA, } if job.result.ai extraction failed { console.warn AI extraction failed for ${boardUrl} return } const content = job.result.content return Array.isArray content ? content : content } catch err { if err instanceof SpidraInsufficientCreditsError { throw err // bubble up — stop processing } if err instanceof SpidraError { console.error Error scraping ${boardUrl}: ${err.message} return } throw err } } const boards = 'https://jobs.example.com/engineering', 'https://careers.anothersite.com/remote', const allJobs: unknown = for const board of boards { console.log Collecting from ${board}... const listings = await collectListings board allJobs.push ...listings console.log Got ${listings.length} listings } const jsonl = allJobs.map job = JSON.stringify job .join os.EOL writeFileSync 'jobs.jsonl', jsonl console.log \nDone. ${allJobs.length} jobs saved to jobs.jsonl All scrape options | Option | Type | Description | |---|---|---| urls | array | Up to 3 URL objects. Each takes a url and optional actions . | prompt | string | What to extract, in plain English | output | string | 'markdown' default or 'json' | schema | object | JSON Schema for a guaranteed output shape | useProxy | boolean | Route through a residential proxy | proxyCountry | string | Two-letter country code or 'eu' / 'global' | extractContentOnly | boolean | Strip nav, ads, and boilerplate before extraction | screenshot | boolean | Capture a viewport screenshot | fullPageScreenshot | boolean | Capture a full-page screenshot | cookies | string | Raw Cookie header string for authenticated pages | What to read next Browser actions guide https://docs.spidra.io/features/actions covers every option for each action type including all forEach parameters Structured output guide https://docs.spidra.io/features/structured-output covers schemas in depth including Zod integration and schema limits Stealth mode guide https://docs.spidra.io/features/stealth-mode has the full country list and proxy options Python SDK tutorial https://claude.ai/blog/spidra-api-python-tutorial if you are working in Python Full API reference https://claude.ai/blog/spidra-api-tutorial if you want to use the REST API directly Get your API key at app.spidra.io https://app.spidra.io/ . The free plan has 300 credits and no card required.