{"slug": "how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and", "title": "How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch", "summary": "An engineer tested how much of a blog post Anthropic's Claude actually retrieves during AI search, finding that the search stage only returns the title and URL—no body content. The WebFetch stage can retrieve the full page body, with the API allowing up to 100,000 tokens, but Claude Code's built-in tool truncates content to 2,000 characters. This means AI search visibility depends heavily on titles and URLs, not full article text.", "body_md": "A while back I wrote [Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search](https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/), and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official `web_search`\n\nand `web_fetch`\n\ntool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy.\n\nBefore going further, the one thing worth being crystal clear on: when AI runs a query, **\"search\"** and **\"fetching the page body\"** are not the same operation. They're two separate stages.\n\nWhy doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it (because the context did overflow). So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch.\n\nOnce that two-stage split makes sense, the rest of this post is about what each stage actually pulls in.\n\nGoing straight to Anthropic's official `web_search`\n\ntool docs — every search result entry has only four fields:\n\n`url`\n\n: the page URL`title`\n\n: the page title`page_age`\n\n: when the page was last updated`encrypted_content`\n\n: encrypted content, not for the AI to read the article — it's for multi-turn conversation citationsThat's it. Four fields.\n\nWhat the AI sees during the search stage is \"URL, title, last updated\" — three pieces of human-readable info. No body content at all.\n\nWhat if the AI cites your content? There's a cap on that too:\n\nEach\n\n`web_search_result_location`\n\n's`cited_text`\n\nisup to 150 characters of the cited content\n\nIn short: at most 150 characters of quoted text. And that's just the API-level spec.\n\nClaude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's [breakdown of Claude Code's internal behavior](https://mikhail.io/2025/10/claude-code-web-tools/), Claude Code even drops `page_age`\n\nand `encrypted_content`\n\n, keeping only `title`\n\nand `url`\n\n.\n\nSo basically — at the search stage, the AI sees nothing more than **one title and one URL** from your site. That's it.\n\nNow for when the body actually gets pulled in — Stage 2, WebFetch.\n\nOnce the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it?\n\nThis needs to be split into two layers, because the API and Claude Code work differently.\n\nNote\n\nWhen I say \"API\" here, I mean the Anthropic API's`web_fetch`\n\ntool. \"Claude Code\" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows.\n\nThe Anthropic API's `web_fetch`\n\ntool has a parameter called `max_content_tokens`\n\nthat developers can set themselves — though the official docs use 100,000 tokens in their examples.\n\nThe docs also give a reference conversion:\n\n| Content size | Estimated tokens |\n|---|---|\n| Average web page 10 KB | ~2,500 tokens |\n| Large doc page 100 KB | ~25,000 tokens |\n| Research paper PDF 500 KB | ~125,000 tokens |\n\nSo a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster.\n\nOne thing to note: `web_fetch`\n\n's citation works differently from `web_search`\n\n. It uses `start_char_index`\n\n/ `end_char_index`\n\nto pick out a specific position in the article (although the docs don't pin down a hard character limit).\n\nClaude Code's built-in WebFetch goes a different route.\n\nPer Mikhail Shilkov's breakdown, the WebFetch flow is:\n\nThe real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means **what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote**.\n\nThe citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is:\n\nEnforce a strict 125-character maximum for quotes from any source document.\n\nSo quotes max out at 125 characters.\n\nYou're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be \"how do I get Haiku to extract the parts I want to be quoted,\" not \"how much content am I feeding Haiku.\"\n\nEnough theory — let's actually test it.\n\nI had Claude call the WebFetch tool against my blog's homepage (`israynotarray.com`\n\n) and asked it to grab the body.\n\nTo make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged:\n\n```\n{\n  \"hooks\": {\n    \"PostToolUse\": [\n      {\n        \"matcher\": \"WebSearch|WebFetch\",\n        \"hooks\": [\n          {\n            \"type\": \"command\",\n            \"command\": \"/claude-tools/hooks/log-web-tool.sh\"\n          }\n        ]\n      }\n    ],\n    \"SessionEnd\": [\n      {\n        \"hooks\": [\n          {\n            \"type\": \"command\",\n            \"command\": \"/claude-tools/hooks/dump-transcript.sh\"\n          }\n        ]\n      }\n    ]\n  }\n}\n```\n\nThe prompt I used:\n\n```\nPlease use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page.\n```\n\nThe actual input passed to WebFetch:\n\n```\n{\n  \"url\": \"https://israynotarray.com\",\n  \"prompt\": \"Tell me the main content of this homepage: site name, author bio, latest article list (titles and summaries), categories, plus any navbar or sidebar info.\"\n}\n```\n\nAfter it ran, the WebFetch result — the summary that came back — looked roughly like this:\n\n```\n{\n  \"bytes\": 1068,\n  \"code\": 200,\n  \"codeText\": \"OK\",\n  \"result\": \"# Page Content Analysis\\n\\nBased on the content snippet provided, I can share the following:\\n\\n## Site Structure\\n- **Main category**: AI-related content, including Gemini and Gemini Omni products\\n\\n## Latest Article\\n** Title**: \\\"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\\\"\\n\\n** Summary**:\\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\\n\\n**Publish date**: 2026-06-06  \\n** Reading time**: ~23 min\\n\\n## Limitations\\nThe content snippet provided does not include author bio or full navbar info.\",\n  \"durationMs\": 4603,\n  \"url\": \"https://israynotarray.com\"\n}\n```\n\nSee it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back.\n\nI also tried an older post with a deliberately broken URL path, and got this:\n\n```\n{\n  \"bytes\": 0,\n  \"code\": 404,\n  \"codeText\": \"Not Found\",\n  \"result\": \"The server returned HTTP 404 Not Found.\\n\\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool (e.g. `gh` for GitHub, or an MCP-provided fetch tool) instead of WebFetch.\",\n  \"durationMs\": 588,\n  \"url\": \"https://israynotarray.com/dqwdqwdqwd\"\n}\n```\n\nEven the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything.\n\nSide note — this lines up with a caveat in Claude's official docs:\n\nThe web fetch tool currently does not support websites dynamically rendered with JavaScript.\n\nIf your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators (Hexo, Astro, Next.js in SSG mode) are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content.\n\nThere's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt.\n\nAI crawlers basically split into two types: search-style (cite and link back to your site) and training-style (eat content to feed the model, not necessarily linking back). The common mapping:\n\n| Crawler | Type | Behavior |\n|---|---|---|\n| Claude-SearchBot / Claude-User | Search | Real-time fetch when Claude answers, cites back |\n| ClaudeBot | Training | Fetches content to feed Claude training |\n| OAI-SearchBot / ChatGPT-User | Search | Real-time fetch when ChatGPT answers, cites back |\n| GPTBot | Training | Fetches content to feed GPT training |\n| PerplexityBot | Search | Used by Perplexity engine, cites back |\n| Google-Extended | Training | For Gemini training |\n| CCBot | Training | Common Crawl public dataset |\n\nIf you want to be cited by AI but don't want your content used for training, the most common strategy is \"allow search-style, block training-style.\"\n\nHere's a robots.txt template you can copy-paste:\n\n```\n# Search-style AI crawlers: allow (they cite back)\nUser-agent: OAI-SearchBot\nAllow: /\n\nUser-agent: ChatGPT-User\nAllow: /\n\nUser-agent: Claude-SearchBot\nAllow: /\n\nUser-agent: Claude-User\nAllow: /\n\nUser-agent: PerplexityBot\nAllow: /\n\n# Training-style AI crawlers: block (consume data without citing)\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: ClaudeBot\nDisallow: /\n\nUser-agent: Google-Extended\nDisallow: /\n\nUser-agent: CCBot\nDisallow: /\n\nUser-agent: Bytespider\nDisallow: /\n\n# Content signal: searchable but not for training, not for direct AI input\nContent-Signal: ai-train=no, search=yes, ai-input=no\n```\n\nFor a full Agent Readiness setup to score 100, see [From 3 to 100! How to Get Your Site to Pass isitagentready's AI Agent Readiness Check](https://israynotarray.com/en/ai/2026/04/21/from-3-to-100-isitagentready-readiness-guide/).\n\nOnce you understand all the constraints above, there are four things worth specifically working on.\n\nAt the search stage, all the AI sees about your article is two fields — title and URL.\n\nIf your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped.\n\nA quick comparison:\n\nThe stronger version packs in \"topic, tool, what it does, and article type\" — the AI doesn't even need to open the page to know whether it's worth fetching.\n\nAt the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is \"Before we get into X, let's recap a bit of history…\", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background.\n\nThe right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in [Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search](https://israynotarray.com/en/ai/2026/04/08/aeo-answer-engine-optimization-guide/) too — worth reading alongside this.\n\n`cited_text`\n\nis 150 characters on `web_search`\n\nand 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that \"makes sense without context.\"\n\nConsciously design sentences like that. For example:\n\nAfter writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted.\n\nH2, H3, `-`\n\nlists, tables, ```\n\n`js`\n\ncode blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as \"what this section is about\" indexes, lists as \"main points\" signals, and tables as \"supporting data\" units.\n\nIf your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted.\n\nSo how much of your blog does AI search actually pull?\n\nThe answer breaks down into three layers:\n\n`web_search`\n\nis 150 characters, Claude Code WebFetch is 125 charactersWriting for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post.\n\nIf your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.", "url": "https://wpnews.pro/news/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and", "canonical_source": "https://dev.to/isray_notarray/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claudes-websearch-and-webfetch-538f", "published_at": "2026-06-19 14:23:46+00:00", "updated_at": "2026-06-19 14:37:03.524550+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "natural-language-processing", "developer-tools"], "entities": ["Anthropic", "Claude", "Mikhail Shilkov", "Claude Code"], "alternates": {"html": "https://wpnews.pro/news/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and", "markdown": "https://wpnews.pro/news/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and.md", "text": "https://wpnews.pro/news/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and.txt", "jsonld": "https://wpnews.pro/news/how-much-of-your-blog-does-ai-search-actually-grab-breaking-down-claude-s-and.jsonld"}}