# Schema.org Is Now the API Contract Your AI Agents Read

> Source: <https://blog.r-lopes.com/posts/2026-06-06-schema-org-is-now-the-api-contract-your-ai-agents-read>
> Published: 2026-06-06 14:00:00+00:00

## The Problem

Agentic shoppers, research bots, and answer engines are increasingly the *first* consumers of public web pages — they extract, summarize, and recombine content rather than rank URLs [Source 4](#source-4). Sites that rely on rendered DOM and prose for meaning force agents into HTML scraping or screenshot loops that burn thousands of tokens per page and guess at button semantics [Source 9](#source-9). Without a machine-readable contract, your product, article, or event pages are ambiguous input; with one, they are a typed API. Structured data adoption is already at 50% of home pages and JSON-LD dominates at 43% — the contract layer is being written around you whether you participate or not [Source 1](#source-1).

## The Shape

Render JSON-LD server-side in Next.js, typed against `schema-dts`

, sanitized for XSS:

``` python
// app/products/[id]/page.tsx
import type { Product, WithContext } from 'schema-dts'

export default async function Page({ params }: { params: Promise<{ id: string }> }) {
  const { id } = await params
  const product = await getProduct(id)

  const jsonLd: WithContext<Product> = {
    '@context': 'https://schema.org',
    '@type': 'Product',
    name: product.name,
    image: product.image,
    description: product.description,
    sku: product.sku,
    brand: { '@type': 'Brand', name: product.brand },
    offers: {
      '@type': 'Offer',
      price: product.price.toFixed(2),
      priceCurrency: product.currency,
      availability: product.inStock
        ? 'https://schema.org/InStock'
        : 'https://schema.org/OutOfStock',
      url: `https://example.com/products/${id}`,
    },
    aggregateRating: product.ratingCount > 0 ? {
      '@type': 'AggregateRating',
      ratingValue: product.ratingValue,
      reviewCount: product.ratingCount,
    } : undefined,
  }

  return (
    <section>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{
          __html: JSON.stringify(jsonLd).replace(/</g, '\\u003c'),
        }}
      />
      <ProductView product={product} />
    </section>
  )
}
```

Validate the output in CI against the [Schema Markup Validator](https://validator.schema.org/) and Google's [Rich Results Test](https://search.google.com/test/rich-results) [Source 7](#source-7). The `\u003c`

replacement is non-negotiable — `JSON.stringify`

does not sanitize HTML and a `</script>`

in a product description ends the JSON-LD block and opens an XSS vector [Source 7](#source-7).

## How It Works

JSON-LD embedded in the initial HTML response is the cheapest contract you can offer an extractor. Google's own guidance treats it as the recommended structured-data form precisely because it sidesteps JavaScript hydration delays that LLM-based crawlers handle poorly [Source 1](#source-1). Crawlers like GPTBot can parse schema directly out of HTML, and the trend over the last three years is unambiguous: WebSite, Organization, and Product schemas keep climbing while microdata declines [Source 3](#source-3). Inner pages remain undercovered — JSON-LD sits at ~39% on desktop versus 43% on home pages — and that gap is where most teams leak ambiguity to agents [Source 1](#source-1).

The contract framing matters because schema-on-write systems give the *reader* a stable surface to plan against, the same lesson Netflix learned with NMDB: a validated schema acts as an API contract that decouples writers from the many applications consuming the data [Source 2](#source-2). Without it, every consumer reimplements schema-on-read parsing logic with its own quirks [Source 5](#source-5). For an LLM agent, "schema-on-read" means the model invents a structure during inference — exactly the imagination problem Anthropic's tool-design guidance warns against ("if your schema just says user ID is a string, the agent might pass `John`

, or `user 123`

, or literally anything") [Source 10](#source-10).

WebMCP and similar emerging standards push this further: sites expose declarative tools whose schemas the agent calls directly, replacing thousands of vision tokens or DOM-parsing tokens with a single typed call [Source 9](#source-9). JSON-LD is the lowest-rung version of that same idea — a passive, indexable contract — and the structured-output APIs every major model now ships (OpenAI's guaranteed JSON [Source 6](#source-6), Anthropic's `output_config.format`

[Source 12](#source-12), Pydantic AI [Source 11](#source-11), Outlines [Source 13](#source-13)) mean the consumer side is fully aligned with typed I/O. The agent expects typed inputs from your page and produces typed outputs from your tools. Untyped HTML in the middle is the only mismatched link.

```
   Page render               Indexed contract            Agent runtime
 ┌────────────┐    JSON-LD   ┌─────────────────┐  query  ┌──────────────┐
 │ Server     │ ───────────► │ Crawler /       │ ──────► │ LLM extractor│
 │ (RSC/SSR)  │  in initial  │ vector store /  │ typed   │ + tool call  │
 │            │     HTML     │ knowledge graph │  facts  │ (structured  │
 └────────────┘              └─────────────────┘ ◄────── │  output)     │
       ▲                            ▲                    └──────┬───────┘
       │ schema-dts types           │ schema.org vocab          │
       └─── compile-time check ─────┴─── runtime validation ────┘
```

## When It Breaks

| Condition | What happens | Use instead |
|---|---|---|
| Schema injected post-hydration via client JS | LLM crawlers and many bots miss it; only ~2% of sites use JS-injected schema for a reason
|

`layout`

/`page`

server components so it ships in initial HTML [Source 7](#source-7)`WebSite`

markup[Source 3](#source-3)`WebSite`

/`Organization`

only on home and one canonical About page [Source 1](#source-1)`<`

or `</script>`

[Source 7](#source-7)`JSON.stringify(jsonLd).replace(/</g, '\\u003c')`

or `serialize-javascript`

[Source 7](#source-7)`200`

status`<meta name="robots" content="noindex">`

is the only signal extractors get [Source 8](#source-8)[Source 8](#source-8)[Source 4](#source-4)[Source 4](#source-4)## CEMENT Brick

If your public pages ship meaning only in rendered prose and DOM, then AI agents — answer engines, shopping bots, research crawlers — will reconstruct that meaning probabilistically at thousands of tokens per page and disagree with each other about what your product, article, or organization actually *is*, because the consumer side of the web has already moved to typed I/O (JSON schemas in tool calls, structured outputs in model APIs, knowledge graphs as agent context) and an untyped HTML middle is now the weakest contract in the chain.

## Sources

- Engineering Docs
[implementing-the-netflix-media-database-53b5a840b42a](https://netflixtechblog.com/implementing-the-netflix-media-database-53b5a840b42a)- Engineering Docs
- Engineering Docs
- Engineering Docs
[Agentic Info Extraction with Structured Outputs](https://www.youtube.com/watch?v=hpMCvfIIM_A)[How to implement JSON-LD in your Next.js application](https://nextjs.org/docs/app/guides/json-ld)[loading.js](https://nextjs.org/docs/app/api-reference/file-conventions/loading)[The Rise of WebMCP](https://www.youtube.com/watch?v=35oWt7u2b-g)[The 7 Skills You Need to Build AI Agents](https://www.youtube.com/watch?v=mtiOK2QG9Q0)[PydanticAI - The NEW Agent Builder on the Block](https://www.youtube.com/watch?v=UnH7S5044GA)- Engineering Docs
[A new short course created with DotTxt is available now](https://www.youtube.com/watch?v=qUt0-B8s1vE)
