cd /news/large-language-models/how-much-of-your-blog-does-ai-search… · home topics large-language-models article
[ARTICLE · art-34026] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch

An engineer tested how much of a blog post Anthropic's Claude actually retrieves during AI search, finding that the search stage only returns the title and URL—no body content. The WebFetch stage can retrieve the full page body, with the API allowing up to 100,000 tokens, but Claude Code's built-in tool truncates content to 2,000 characters. This means AI search visibility depends heavily on titles and URLs, not full article text.

read11 min views7 publishedJun 19, 2026

A while back I wrote Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search, and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official web_search

and web_fetch

tool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy.

Before going further, the one thing worth being crystal clear on: when AI runs a query, "search" and "fetching the page body" are not the same operation. They're two separate stages.

Why doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it (because the context did overflow). So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch.

Once that two-stage split makes sense, the rest of this post is about what each stage actually pulls in.

Going straight to Anthropic's official web_search

tool docs — every search result entry has only four fields:

url

: the page URLtitle

: the page titlepage_age

: when the page was last updatedencrypted_content

: encrypted content, not for the AI to read the article — it's for multi-turn conversation citationsThat's it. Four fields.

What the AI sees during the search stage is "URL, title, last updated" — three pieces of human-readable info. No body content at all.

What if the AI cites your content? There's a cap on that too:

Each

web_search_result_location

'scited_text

isup to 150 characters of the cited content

In short: at most 150 characters of quoted text. And that's just the API-level spec.

Claude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's breakdown of Claude Code's internal behavior, Claude Code even drops page_age

and encrypted_content

, keeping only title

and url

.

So basically — at the search stage, the AI sees nothing more than one title and one URL from your site. That's it.

Now for when the body actually gets pulled in — Stage 2, WebFetch.

Once the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it?

This needs to be split into two layers, because the API and Claude Code work differently.

Note

When I say "API" here, I mean the Anthropic API'sweb_fetch

tool. "Claude Code" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows.

The Anthropic API's web_fetch

tool has a parameter called max_content_tokens

that developers can set themselves — though the official docs use 100,000 tokens in their examples.

The docs also give a reference conversion:

Content size Estimated tokens
Average web page 10 KB ~2,500 tokens
Large doc page 100 KB ~25,000 tokens
Research paper PDF 500 KB ~125,000 tokens

So a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster.

One thing to note: web_fetch

's citation works differently from web_search

. It uses start_char_index

/ end_char_index

to pick out a specific position in the article (although the docs don't pin down a hard character limit).

Claude Code's built-in WebFetch goes a different route.

Per Mikhail Shilkov's breakdown, the WebFetch flow is:

The real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote.

The citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is:

Enforce a strict 125-character maximum for quotes from any source document.

So quotes max out at 125 characters.

You're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be "how do I get Haiku to extract the parts I want to be quoted," not "how much content am I feeding Haiku."

Enough theory — let's actually test it.

I had Claude call the WebFetch tool against my blog's homepage (israynotarray.com

) and asked it to grab the body.

To make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "WebSearch|WebFetch",
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/log-web-tool.sh"
          }
        ]
      }
    ],
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/dump-transcript.sh"
          }
        ]
      }
    ]
  }
}

The prompt I used:

Please use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page.

The actual input passed to WebFetch:

{
  "url": "https://israynotarray.com",
  "prompt": "Tell me the main content of this homepage: site name, author bio, latest article list (titles and summaries), categories, plus any navbar or sidebar info."
}

After it ran, the WebFetch result — the summary that came back — looked roughly like this:

{
  "bytes": 1068,
  "code": 200,
  "codeText": "OK",
  "result": "# Page Content Analysis\n\nBased on the content snippet provided, I can share the following:\n\n## Site Structure\n- **Main category**: AI-related content, including Gemini and Gemini Omni products\n\n## Latest Article\n** Title**: \"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\"\n\n** Summary**:\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\n\n**Publish date**: 2026-06-06  \n** Reading time**: ~23 min\n\n## Limitations\nThe content snippet provided does not include author bio or full navbar info.",
  "durationMs": 4603,
  "url": "https://israynotarray.com"
}

See it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back.

I also tried an older post with a deliberately broken URL path, and got this:

{
  "bytes": 0,
  "code": 404,
  "codeText": "Not Found",
  "result": "The server returned HTTP 404 Not Found.\n\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool (e.g. `gh` for GitHub, or an MCP-provided fetch tool) instead of WebFetch.",
  "durationMs": 588,
  "url": "https://israynotarray.com/dqwdqwdqwd"
}

Even the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything.

Side note — this lines up with a caveat in Claude's official docs:

The web fetch tool currently does not support websites dynamically rendered with JavaScript.

If your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators (Hexo, Astro, Next.js in SSG mode) are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content.

There's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt.

AI crawlers basically split into two types: search-style (cite and link back to your site) and training-style (eat content to feed the model, not necessarily linking back). The common mapping:

Crawler Type Behavior
Claude-SearchBot / Claude-User Search Real-time fetch when Claude answers, cites back
ClaudeBot Training Fetches content to feed Claude training
OAI-SearchBot / ChatGPT-User Search Real-time fetch when ChatGPT answers, cites back
GPTBot Training Fetches content to feed GPT training
PerplexityBot Search Used by Perplexity engine, cites back
Google-Extended Training For Gemini training
CCBot Training Common Crawl public dataset

If you want to be cited by AI but don't want your content used for training, the most common strategy is "allow search-style, block training-style."

Here's a robots.txt template you can copy-paste:

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Content-Signal: ai-train=no, search=yes, ai-input=no

For a full Agent Readiness setup to score 100, see From 3 to 100! How to Get Your Site to Pass isitagentready's AI Agent Readiness Check.

Once you understand all the constraints above, there are four things worth specifically working on.

At the search stage, all the AI sees about your article is two fields — title and URL.

If your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped.

A quick comparison:

The stronger version packs in "topic, tool, what it does, and article type" — the AI doesn't even need to open the page to know whether it's worth fetching.

At the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is "Before we get into X, let's recap a bit of history…", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background.

The right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search too — worth reading alongside this.

cited_text

is 150 characters on web_search

and 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that "makes sense without context."

Consciously design sentences like that. For example:

After writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted.

H2, H3, -

lists, tables, ```

js

code blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as "what this section is about" indexes, lists as "main points" signals, and tables as "supporting data" units.

If your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted.

So how much of your blog does AI search actually pull?

The answer breaks down into three layers:

web_search

is 150 characters, Claude Code WebFetch is 125 charactersWriting for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post.

If your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-much-of-your-blo…] indexed:0 read:11min 2026-06-19 ·