How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch

An engineer tested how much of a blog post Anthropic's Claude actually retrieves during AI search, finding that the search stage only returns the title and URL—no body content. The WebFetch stage can retrieve the full page body, with the API allowing up to 100,000 tokens, but Claude Code's built-in tool truncates content to 2,000 characters. This means AI search visibility depends heavily on titles and URLs, not full article text.

A while back I wrote Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/ , and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official web search and web fetch tool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy. Before going further, the one thing worth being crystal clear on: when AI runs a query, "search" and "fetching the page body" are not the same operation. They're two separate stages. Why doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it because the context did overflow . So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch. Once that two-stage split makes sense, the rest of this post is about what each stage actually pulls in. Going straight to Anthropic's official web search tool docs — every search result entry has only four fields: url : the page URL title : the page title page age : when the page was last updated encrypted content : encrypted content, not for the AI to read the article — it's for multi-turn conversation citationsThat's it. Four fields. What the AI sees during the search stage is "URL, title, last updated" — three pieces of human-readable info. No body content at all. What if the AI cites your content? There's a cap on that too: Each web search result location 's cited text isup to 150 characters of the cited content In short: at most 150 characters of quoted text. And that's just the API-level spec. Claude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's breakdown of Claude Code's internal behavior https://mikhail.io/2025/10/claude-code-web-tools/ , Claude Code even drops page age and encrypted content , keeping only title and url . So basically — at the search stage, the AI sees nothing more than one title and one URL from your site. That's it. Now for when the body actually gets pulled in — Stage 2, WebFetch. Once the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it? This needs to be split into two layers, because the API and Claude Code work differently. Note When I say "API" here, I mean the Anthropic API's web fetch tool. "Claude Code" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows. The Anthropic API's web fetch tool has a parameter called max content tokens that developers can set themselves — though the official docs use 100,000 tokens in their examples. The docs also give a reference conversion: | Content size | Estimated tokens | |---|---| | Average web page 10 KB | ~2,500 tokens | | Large doc page 100 KB | ~25,000 tokens | | Research paper PDF 500 KB | ~125,000 tokens | So a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster. One thing to note: web fetch 's citation works differently from web search . It uses start char index / end char index to pick out a specific position in the article although the docs don't pin down a hard character limit . Claude Code's built-in WebFetch goes a different route. Per Mikhail Shilkov's breakdown, the WebFetch flow is: The real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote . The citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is: Enforce a strict 125-character maximum for quotes from any source document. So quotes max out at 125 characters. You're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be "how do I get Haiku to extract the parts I want to be quoted," not "how much content am I feeding Haiku." Enough theory — let's actually test it. I had Claude call the WebFetch tool against my blog's homepage israynotarray.com and asked it to grab the body. To make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged: { "hooks": { "PostToolUse": { "matcher": "WebSearch|WebFetch", "hooks": { "type": "command", "command": "/claude-tools/hooks/log-web-tool.sh" } } , "SessionEnd": { "hooks": { "type": "command", "command": "/claude-tools/hooks/dump-transcript.sh" } } } } The prompt I used: Please use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page. The actual input passed to WebFetch: { "url": "https://israynotarray.com", "prompt": "Tell me the main content of this homepage: site name, author bio, latest article list titles and summaries , categories, plus any navbar or sidebar info." } After it ran, the WebFetch result — the summary that came back — looked roughly like this: { "bytes": 1068, "code": 200, "codeText": "OK", "result": " Page Content Analysis\n\nBased on the content snippet provided, I can share the following:\n\n Site Structure\n- Main category : AI-related content, including Gemini and Gemini Omni products\n\n Latest Article\n Title : \"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\"\n\n Summary :\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\n\n Publish date : 2026-06-06 \n Reading time : ~23 min\n\n Limitations\nThe content snippet provided does not include author bio or full navbar info.", "durationMs": 4603, "url": "https://israynotarray.com" } See it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back. I also tried an older post with a deliberately broken URL path, and got this: { "bytes": 0, "code": 404, "codeText": "Not Found", "result": "The server returned HTTP 404 Not Found.\n\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool e.g. gh for GitHub, or an MCP-provided fetch tool instead of WebFetch.", "durationMs": 588, "url": "https://israynotarray.com/dqwdqwdqwd" } Even the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything. Side note — this lines up with a caveat in Claude's official docs: The web fetch tool currently does not support websites dynamically rendered with JavaScript. If your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators Hexo, Astro, Next.js in SSG mode are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content. There's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt. AI crawlers basically split into two types: search-style cite and link back to your site and training-style eat content to feed the model, not necessarily linking back . The common mapping: | Crawler | Type | Behavior | |---|---|---| | Claude-SearchBot / Claude-User | Search | Real-time fetch when Claude answers, cites back | | ClaudeBot | Training | Fetches content to feed Claude training | | OAI-SearchBot / ChatGPT-User | Search | Real-time fetch when ChatGPT answers, cites back | | GPTBot | Training | Fetches content to feed GPT training | | PerplexityBot | Search | Used by Perplexity engine, cites back | | Google-Extended | Training | For Gemini training | | CCBot | Training | Common Crawl public dataset | If you want to be cited by AI but don't want your content used for training, the most common strategy is "allow search-style, block training-style." Here's a robots.txt template you can copy-paste: Search-style AI crawlers: allow they cite back User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / User-agent: PerplexityBot Allow: / Training-style AI crawlers: block consume data without citing User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / Content signal: searchable but not for training, not for direct AI input Content-Signal: ai-train=no, search=yes, ai-input=no For a full Agent Readiness setup to score 100, see From 3 to 100 How to Get Your Site to Pass isitagentready's AI Agent Readiness Check https://israynotarray.com/en/ai/2026/04/21/from-3-to-100-isitagentready-readiness-guide/ . Once you understand all the constraints above, there are four things worth specifically working on. At the search stage, all the AI sees about your article is two fields — title and URL. If your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped. A quick comparison: The stronger version packs in "topic, tool, what it does, and article type" — the AI doesn't even need to open the page to know whether it's worth fetching. At the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is "Before we get into X, let's recap a bit of history…", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background. The right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/ too — worth reading alongside this. cited text is 150 characters on web search and 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that "makes sense without context." Consciously design sentences like that. For example: After writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted. H2, H3, - lists, tables, js code blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as "what this section is about" indexes, lists as "main points" signals, and tables as "supporting data" units. If your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted. So how much of your blog does AI search actually pull? The answer breaks down into three layers: web search is 150 characters, Claude Code WebFetch is 125 charactersWriting for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post. If your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.