cd /news/ai-products/turn-any-company-website-into-struct… · home topics ai-products article
[ARTICLE · art-25782] src=dev.to pub= topic=ai-products verified=true sentiment=↑ positive

Turn any company website into structured B2B data (one API call)

A developer built an API that turns any company website into structured B2B data in a single call. The API reads live site content, never guesses missing fields, and returns clean JSON with company name, sector, description, social links, contact email, and tech stack. It uses a two-pass tech detection system and strict schema validation to ensure reliability.

read2 min publishedJun 13, 2026

Build your own scraper. Brittle, and every site is different. You'll spend more time maintaining selectors than using the data. #

Pay a heavyweight data provider. Expensive, and the data is often a stale snapshot from months ago. #

Paste HTML into an LLM and pray. Sometimes you get valid JSON. Sometimes you get a hallucinated CEO email that doesn't exist.

I kept hitting this wall while working with lists of company domains, so I built a small API that does one thing well: send a company URL, get back clean JSON.

#

The two rules that shaped it

1. It reads the live site at request time. Not a database snapshot from last quarter. If a company rebranded yesterday, you get today's version.

2. It never guesses. This was the hardest constraint to enforce with an LLM in the pipeline. Missing fields come back as null

— never invented. If there's no contact email on the site, you get "email": null

, not a plausible-looking fake you'd import straight into your CRM.

#

What a call looks like

And the response:

#

How it works under the hood

A few design decisions, for the curious:

Two-pass tech detection. A fast pattern-matching pass first (think Wappalyzer-style fingerprints), then an LLM enrichment pass only for what patterns can't catch. Cheaper and faster than going full-LLM on everything. #

Hard content trimming before the LLM. Page content is capped before any model call. This keeps latency and cost predictable instead of exploding on heavy JS-rendered sites. #

Caching with a 14-day TTL. Repeat lookups on the same domain return in ~200 ms instead of re-scraping. The cached

field in the response tells you which path you hit. #

Strict schema validation. Every response is validated against a strict schema (Pydantic v2) before it leaves the API. Either the JSON conforms, or you get a proper error — never half-broken output.

#

Use cases I built it for #

Lead enrichment: turn a list of prospect domains into CRM-ready records. #

Tech-based targeting: filter prospects by their stack ("show me companies running Shopify"). #

Data hygiene: verify and refresh company records against the live web instead of stale databases.

#

Try it There's a free tier (100 requests/month), enough to test it against your own data:

👉 AI Live Company Enrichment & Tech Detector on RapidAPI I'd genuinely love feedback from other builders — on the positioning, the pricing, and especially: what field would you want it to extract next? Drop a comment below.

── more in #ai-products 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/turn-any-company-web…] indexed:0 read:2min 2026-06-13 ·