# What AI Crawlers Actually Do to a Small Blog: 9 Days of Logs > Source: > Published: 2026-06-28 14:03:43+00:00 I run a small Home Assistant / self-hosting blog. On a normal day a few dozen humans show up. So when I finally grepped my nginx logs for AI crawlers, the number made me stop: in nine days, AI bots hit the site **18,209 times**. On a blog this size, the machines reading me now outnumber the people. Here's the full breakdown, the things that surprised me, and a few points most *"should I block AI bots?"* threads get wrong. Of **348,667** total requests, **18,209 (5.2%)** came from AI/LLM user-agents: | Bot | Requests | What it actually is | |---|---|---| ChatGPT-User | 6,687 | OpenAI — live fetch when someone asks ChatGPT about a page | | Bytespider | 3,369 | ByteDance / TikTok crawler | | meta-externalagent | 3,274 | Meta AI | | Amazonbot | 1,923 | Amazon | | OAI-SearchBot | 1,211 | OpenAI search index | | ClaudeBot | 850 | Anthropic — training / index crawler | | PerplexityBot | 319 | Perplexity | | DuckAssistBot | 225 | DuckDuckGo AI | | GPTBot | 172 | OpenAI training crawler | | CCBot | 86 | Common Crawl (feeds many models) | | YouBot | 68 | You.com | The biggest source by far — **ChatGPT-User, 6,687 requests** — isn't crawling to train a model. It's a *live fetch*: someone asked ChatGPT a question, ChatGPT decided my page was relevant, and pulled it in real time to answer them. Same story with `Perplexity-User` and the other assistant-side fetchers. That flips the *"should I block it?"* math. ChatGPT-User isn't scraping you — it's a real person, through their assistant, reading your page right now. Block it and you don't stop any training; you just stop showing up in answers and lose the visit. (I can see the other end of it in my analytics: real sessions arriving from `claude.ai` and `gemini.google.com` .) So the mental model *"AI bot = scraper to block"* is wrong for a big chunk of the traffic. There are **training crawlers** (GPTBot, ClaudeBot, CCBot) and there are **live answer-engine fetchers** (ChatGPT-User, Perplexity-User, DuckAssistBot). Treating them the same is the mistake. I checked who actually requests `/robots.txt` : The practical upshot: a blanket `Disallow` is obeyed by the well-behaved **training crawlers** and ignored by the **user-fetchers** — because robots.txt was never meant for them. If your goal is *"don't feed training, but keep appearing in answers,"* the default already does roughly the right thing — but only for the bots that honour it. This is the one that actually cost me, and the reason I ended up building a little tool around it. Almost every AI crawler fetches your **raw server HTML and does not execute JavaScript**. So if your framework injects JSON-LD structured data on the client, or your streaming-SSR setup flushes meta tags into the `` instead of the initial `` , those signals are **invisible** to the crawler — even though Google renders them and every SEO browser extension tells you you're fine. I only found six pages on my own site doing exactly that, because I built a crawler that deliberately parses the **JS-less view** and diffs it against the hydrated DOM. Googlebot renders; GPTBot and ClaudeBot mostly don't. If you care about being represented correctly in AI answers, your structured data and metadata have to live in the **server HTML**, not get painted on after hydration. For a blog with a few dozen human readers a day, AI crawlers are now the single largest non-search audience hitting the server. They aren't going away — so it's worth knowing which ones are reading you, and making sure they can actually see what you publish. *The JS-less-view crawler I used is open-source ( seo-geo-audit) — it flags exactly this gap, plus the usual SEO checks, in plain Node with no paid dependencies.*