{"slug": "what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs", "title": "What AI Crawlers Actually Do to a Small Blog: 9 Days of Logs", "summary": "A small Home Assistant blog received 18,209 AI crawler requests in nine days, accounting for 5.2% of total traffic. The majority came from ChatGPT-User (6,687 requests), which performs live fetches for answer engines rather than training crawlers. The blog's developer found that many AI crawlers do not execute JavaScript, causing structured data injected client-side to be invisible, and built an open-source tool to audit this gap.", "body_md": "I run a small Home Assistant / self-hosting blog. On a normal day a few dozen humans show up. So when I finally grepped my nginx logs for AI crawlers, the number made me stop: in nine days, AI bots hit the site **18,209 times**. On a blog this size, the machines reading me now outnumber the people.\n\nHere's the full breakdown, the things that surprised me, and a few points most *\"should I block AI bots?\"* threads get wrong.\n\nOf **348,667** total requests, **18,209 (5.2%)** came from AI/LLM user-agents:\n\n| Bot | Requests | What it actually is |\n|---|---|---|\nChatGPT-User |\n6,687 | OpenAI — live fetch when someone asks ChatGPT about a page |\n| Bytespider | 3,369 | ByteDance / TikTok crawler |\n| meta-externalagent | 3,274 | Meta AI |\n| Amazonbot | 1,923 | Amazon |\n| OAI-SearchBot | 1,211 | OpenAI search index |\n| ClaudeBot | 850 | Anthropic — training / index crawler |\n| PerplexityBot | 319 | Perplexity |\n| DuckAssistBot | 225 | DuckDuckGo AI |\n| GPTBot | 172 | OpenAI training crawler |\n| CCBot | 86 | Common Crawl (feeds many models) |\n| YouBot | 68 | You.com |\n\nThe biggest source by far — **ChatGPT-User, 6,687 requests** — isn't crawling to train a model. It's a *live fetch*: someone asked ChatGPT a question, ChatGPT decided my page was relevant, and pulled it in real time to answer them. Same story with `Perplexity-User`\n\nand the other assistant-side fetchers.\n\nThat flips the *\"should I block it?\"* math. ChatGPT-User isn't scraping you — it's a real person, through their assistant, reading your page right now. Block it and you don't stop any training; you just stop showing up in answers and lose the visit. (I can see the other end of it in my analytics: real sessions arriving from `claude.ai`\n\nand `gemini.google.com`\n\n.)\n\nSo the mental model *\"AI bot = scraper to block\"* is wrong for a big chunk of the traffic. There are **training crawlers** (GPTBot, ClaudeBot, CCBot) and there are **live answer-engine fetchers** (ChatGPT-User, Perplexity-User, DuckAssistBot). Treating them the same is the mistake.\n\nI checked who actually requests `/robots.txt`\n\n:\n\nThe practical upshot: a blanket `Disallow`\n\nis obeyed by the well-behaved **training crawlers** and ignored by the **user-fetchers** — because robots.txt was never meant for them. If your goal is *\"don't feed training, but keep appearing in answers,\"* the default already does roughly the right thing — but only for the bots that honour it.\n\nThis is the one that actually cost me, and the reason I ended up building a little tool around it.\n\nAlmost every AI crawler fetches your **raw server HTML and does not execute JavaScript**. So if your framework injects JSON-LD structured data on the client, or your streaming-SSR setup flushes meta tags into the ``\n\ninstead of the initial ``\n\n, those signals are **invisible** to the crawler — even though Google renders them and every SEO browser extension tells you you're fine.\n\nI only found six pages on my own site doing exactly that, because I built a crawler that deliberately parses the **JS-less view** and diffs it against the hydrated DOM. Googlebot renders; GPTBot and ClaudeBot mostly don't. If you care about being represented correctly in AI answers, your structured data and metadata have to live in the **server HTML**, not get painted on after hydration.\n\nFor a blog with a few dozen human readers a day, AI crawlers are now the single largest non-search audience hitting the server. They aren't going away — so it's worth knowing which ones are reading you, and making sure they can actually see what you publish.\n\n*The JS-less-view crawler I used is open-source ( seo-geo-audit) — it flags exactly this gap, plus the usual SEO checks, in plain Node with no paid dependencies.*", "url": "https://wpnews.pro/news/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs", "canonical_source": "https://dev.to/cloudapp_dev/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs-4nf0", "published_at": "2026-06-28 14:03:43+00:00", "updated_at": "2026-06-28 14:33:53.127621+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "developer-tools"], "entities": ["OpenAI", "ByteDance", "Meta", "Amazon", "Anthropic", "Perplexity", "DuckDuckGo", "Common Crawl"], "alternates": {"html": "https://wpnews.pro/news/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs", "markdown": "https://wpnews.pro/news/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs.md", "text": "https://wpnews.pro/news/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs.txt", "jsonld": "https://wpnews.pro/news/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs.jsonld"}}