# How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch

> Source: <https://dev.to/isray_notarray/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claudes-websearch-and-webfetch-538f>
> Published: 2026-06-19 14:23:46+00:00

A while back I wrote [Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search](https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/), and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official `web_search`

and `web_fetch`

tool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy.

Before going further, the one thing worth being crystal clear on: when AI runs a query, **"search"** and **"fetching the page body"** are not the same operation. They're two separate stages.

Why doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it (because the context did overflow). So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch.

Once that two-stage split makes sense, the rest of this post is about what each stage actually pulls in.

Going straight to Anthropic's official `web_search`

tool docs — every search result entry has only four fields:

`url`

: the page URL`title`

: the page title`page_age`

: when the page was last updated`encrypted_content`

: encrypted content, not for the AI to read the article — it's for multi-turn conversation citationsThat's it. Four fields.

What the AI sees during the search stage is "URL, title, last updated" — three pieces of human-readable info. No body content at all.

What if the AI cites your content? There's a cap on that too:

Each

`web_search_result_location`

's`cited_text`

isup to 150 characters of the cited content

In short: at most 150 characters of quoted text. And that's just the API-level spec.

Claude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's [breakdown of Claude Code's internal behavior](https://mikhail.io/2025/10/claude-code-web-tools/), Claude Code even drops `page_age`

and `encrypted_content`

, keeping only `title`

and `url`

.

So basically — at the search stage, the AI sees nothing more than **one title and one URL** from your site. That's it.

Now for when the body actually gets pulled in — Stage 2, WebFetch.

Once the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it?

This needs to be split into two layers, because the API and Claude Code work differently.

Note

When I say "API" here, I mean the Anthropic API's`web_fetch`

tool. "Claude Code" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows.

The Anthropic API's `web_fetch`

tool has a parameter called `max_content_tokens`

that developers can set themselves — though the official docs use 100,000 tokens in their examples.

The docs also give a reference conversion:

| Content size | Estimated tokens |
|---|---|
| Average web page 10 KB | ~2,500 tokens |
| Large doc page 100 KB | ~25,000 tokens |
| Research paper PDF 500 KB | ~125,000 tokens |

So a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster.

One thing to note: `web_fetch`

's citation works differently from `web_search`

. It uses `start_char_index`

/ `end_char_index`

to pick out a specific position in the article (although the docs don't pin down a hard character limit).

Claude Code's built-in WebFetch goes a different route.

Per Mikhail Shilkov's breakdown, the WebFetch flow is:

The real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means **what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote**.

The citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is:

Enforce a strict 125-character maximum for quotes from any source document.

So quotes max out at 125 characters.

You're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be "how do I get Haiku to extract the parts I want to be quoted," not "how much content am I feeding Haiku."

Enough theory — let's actually test it.

I had Claude call the WebFetch tool against my blog's homepage (`israynotarray.com`

) and asked it to grab the body.

To make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged:

```
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "WebSearch|WebFetch",
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/log-web-tool.sh"
          }
        ]
      }
    ],
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/dump-transcript.sh"
          }
        ]
      }
    ]
  }
}
```

The prompt I used:

```
Please use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page.
```

The actual input passed to WebFetch:

```
{
  "url": "https://israynotarray.com",
  "prompt": "Tell me the main content of this homepage: site name, author bio, latest article list (titles and summaries), categories, plus any navbar or sidebar info."
}
```

After it ran, the WebFetch result — the summary that came back — looked roughly like this:

```
{
  "bytes": 1068,
  "code": 200,
  "codeText": "OK",
  "result": "# Page Content Analysis\n\nBased on the content snippet provided, I can share the following:\n\n## Site Structure\n- **Main category**: AI-related content, including Gemini and Gemini Omni products\n\n## Latest Article\n** Title**: \"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\"\n\n** Summary**:\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\n\n**Publish date**: 2026-06-06  \n** Reading time**: ~23 min\n\n## Limitations\nThe content snippet provided does not include author bio or full navbar info.",
  "durationMs": 4603,
  "url": "https://israynotarray.com"
}
```

See it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back.

I also tried an older post with a deliberately broken URL path, and got this:

```
{
  "bytes": 0,
  "code": 404,
  "codeText": "Not Found",
  "result": "The server returned HTTP 404 Not Found.\n\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool (e.g. `gh` for GitHub, or an MCP-provided fetch tool) instead of WebFetch.",
  "durationMs": 588,
  "url": "https://israynotarray.com/dqwdqwdqwd"
}
```

Even the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything.

Side note — this lines up with a caveat in Claude's official docs:

The web fetch tool currently does not support websites dynamically rendered with JavaScript.

If your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators (Hexo, Astro, Next.js in SSG mode) are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content.

There's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt.

AI crawlers basically split into two types: search-style (cite and link back to your site) and training-style (eat content to feed the model, not necessarily linking back). The common mapping:

| Crawler | Type | Behavior |
|---|---|---|
| Claude-SearchBot / Claude-User | Search | Real-time fetch when Claude answers, cites back |
| ClaudeBot | Training | Fetches content to feed Claude training |
| OAI-SearchBot / ChatGPT-User | Search | Real-time fetch when ChatGPT answers, cites back |
| GPTBot | Training | Fetches content to feed GPT training |
| PerplexityBot | Search | Used by Perplexity engine, cites back |
| Google-Extended | Training | For Gemini training |
| CCBot | Training | Common Crawl public dataset |

If you want to be cited by AI but don't want your content used for training, the most common strategy is "allow search-style, block training-style."

Here's a robots.txt template you can copy-paste:

```
# Search-style AI crawlers: allow (they cite back)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Training-style AI crawlers: block (consume data without citing)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Content signal: searchable but not for training, not for direct AI input
Content-Signal: ai-train=no, search=yes, ai-input=no
```

For a full Agent Readiness setup to score 100, see [From 3 to 100! How to Get Your Site to Pass isitagentready's AI Agent Readiness Check](https://israynotarray.com/en/ai/2026/04/21/from-3-to-100-isitagentready-readiness-guide/).

Once you understand all the constraints above, there are four things worth specifically working on.

At the search stage, all the AI sees about your article is two fields — title and URL.

If your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped.

A quick comparison:

The stronger version packs in "topic, tool, what it does, and article type" — the AI doesn't even need to open the page to know whether it's worth fetching.

At the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is "Before we get into X, let's recap a bit of history…", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background.

The right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in [Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search](https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/) too — worth reading alongside this.

`cited_text`

is 150 characters on `web_search`

and 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that "makes sense without context."

Consciously design sentences like that. For example:

After writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted.

H2, H3, `-`

lists, tables, ```

`js`

code blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as "what this section is about" indexes, lists as "main points" signals, and tables as "supporting data" units.

If your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted.

So how much of your blog does AI search actually pull?

The answer breaks down into three layers:

`web_search`

is 150 characters, Claude Code WebFetch is 125 charactersWriting for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post.

If your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.
