# What AI Crawlers Actually Do to a Small Blog: 9 Days of Logs

> Source: <https://dev.to/cloudapp_dev/what-ai-crawlers-actually-do-to-a-small-blog-9-days-of-logs-4nf0>
> Published: 2026-06-28 14:03:43+00:00

I run a small Home Assistant / self-hosting blog. On a normal day a few dozen humans show up. So when I finally grepped my nginx logs for AI crawlers, the number made me stop: in nine days, AI bots hit the site **18,209 times**. On a blog this size, the machines reading me now outnumber the people.

Here's the full breakdown, the things that surprised me, and a few points most *"should I block AI bots?"* threads get wrong.

Of **348,667** total requests, **18,209 (5.2%)** came from AI/LLM user-agents:

| Bot | Requests | What it actually is |
|---|---|---|
ChatGPT-User |
6,687 | OpenAI — live fetch when someone asks ChatGPT about a page |
| Bytespider | 3,369 | ByteDance / TikTok crawler |
| meta-externalagent | 3,274 | Meta AI |
| Amazonbot | 1,923 | Amazon |
| OAI-SearchBot | 1,211 | OpenAI search index |
| ClaudeBot | 850 | Anthropic — training / index crawler |
| PerplexityBot | 319 | Perplexity |
| DuckAssistBot | 225 | DuckDuckGo AI |
| GPTBot | 172 | OpenAI training crawler |
| CCBot | 86 | Common Crawl (feeds many models) |
| YouBot | 68 | You.com |

The biggest source by far — **ChatGPT-User, 6,687 requests** — isn't crawling to train a model. It's a *live fetch*: someone asked ChatGPT a question, ChatGPT decided my page was relevant, and pulled it in real time to answer them. Same story with `Perplexity-User`

and the other assistant-side fetchers.

That flips the *"should I block it?"* math. ChatGPT-User isn't scraping you — it's a real person, through their assistant, reading your page right now. Block it and you don't stop any training; you just stop showing up in answers and lose the visit. (I can see the other end of it in my analytics: real sessions arriving from `claude.ai`

and `gemini.google.com`

.)

So the mental model *"AI bot = scraper to block"* is wrong for a big chunk of the traffic. There are **training crawlers** (GPTBot, ClaudeBot, CCBot) and there are **live answer-engine fetchers** (ChatGPT-User, Perplexity-User, DuckAssistBot). Treating them the same is the mistake.

I checked who actually requests `/robots.txt`

:

The practical upshot: a blanket `Disallow`

is obeyed by the well-behaved **training crawlers** and ignored by the **user-fetchers** — because robots.txt was never meant for them. If your goal is *"don't feed training, but keep appearing in answers,"* the default already does roughly the right thing — but only for the bots that honour it.

This is the one that actually cost me, and the reason I ended up building a little tool around it.

Almost every AI crawler fetches your **raw server HTML and does not execute JavaScript**. So if your framework injects JSON-LD structured data on the client, or your streaming-SSR setup flushes meta tags into the `<body>`

instead of the initial `<head>`

, those signals are **invisible** to the crawler — even though Google renders them and every SEO browser extension tells you you're fine.

I only found six pages on my own site doing exactly that, because I built a crawler that deliberately parses the **JS-less view** and diffs it against the hydrated DOM. Googlebot renders; GPTBot and ClaudeBot mostly don't. If you care about being represented correctly in AI answers, your structured data and metadata have to live in the **server HTML**, not get painted on after hydration.

For a blog with a few dozen human readers a day, AI crawlers are now the single largest non-search audience hitting the server. They aren't going away — so it's worth knowing which ones are reading you, and making sure they can actually see what you publish.

*The JS-less-view crawler I used is open-source ( seo-geo-audit) — it flags exactly this gap, plus the usual SEO checks, in plain Node with no paid dependencies.*
