{"slug": "your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15", "title": "Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET)", "summary": "After 2,190 production runs across 32 scrapers, a developer found that conditional GET requests and rate limiting — not robots.txt compliance — are the key to ethical web scraping that avoids server strain and bans. The 15-line Python fix stores ETag and Last-Modified headers from each response, then sends them back as If-None-Match and If-Modified-Since on subsequent requests, allowing servers to return a 304 status code when content hasn't changed. The approach, built on HTTP standards RFC 9110 and RFC 7232, prevents redundant downloads and duplicate data pipeline entries while minimizing server load.", "body_md": "Note:This is a cross-post. Canonical version (full long-form) lives on my blog:[https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/]\n\nThe \"ethical scraping\" debate keeps arguing about robots.txt and ToS. Those are real, but they're decisions you make *once*, before the first request. They tell you nothing about run 200, 600, or 900 — and that's where you actually load someone's server and where you actually get banned. (Good prompt for this post: Federico Trotta's [\"How to Scrape Open-Source Datasets Ethically\"](https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets) on The Web Scraping Club, May 24, 2026 — his line that a scraper \"that would barely register as noise on Amazon's servers could genuinely degrade performance for a public data portal\" is the part the robots.txt debate keeps skipping.)\n\nAfter **2,190 production scrapes** across 32 scrapers (the busiest, a Trustpilot review scraper, has **962 runs** on its own), I'm convinced of one thing: on a real schedule, \"polite to the source\" and \"doesn't get banned\" stop being two questions and become one. And the answer is mostly **conditional GET** plus a sane rate limit — not a robots.txt checkbox.\n\nWhere those numbers come from: my own Apify dashboard ([apify.com/knotless_cadence](https://apify.com/knotless_cadence)), as of May 2026. 2,190 = total runs summed across my 32 published actors; 962 = the Trustpilot scraper's own lifetime counter. Raw platform numbers, not sampled or extrapolated.\n\nThis is the practical, code-first version. The long-form reasoning (and what 962 runs against one site actually taught me) is on the canonical post above.\n\nIt's not a hack — it's in the HTTP standard ([RFC 9110 §13](https://httpwg.org/specs/rfc9110.html), and the older focused [RFC 7232: Conditional Requests](https://datatracker.ietf.org/doc/html/rfc7232)). Most servers will tell you whether a page changed *before* sending the body — for free — if you ask right:\n\n`ETag`\n\nand/or `Last-Modified`\n\non the response.`If-None-Match`\n\n/ `If-Modified-Since`\n\non the next request.`304 Not Modified`\n\nA `304`\n\nis the most considerate response you can get: you confirmed there's no new data without making the server render and ship a page you already have. You also stop feeding duplicate rows into your pipeline.\n\nPlain `httpx`\n\n. Persists its cache to disk so it survives across runs. Throttles itself so it doesn't hammer one host. `requests`\n\nworks identically — same header names, same `304`\n\n.\n\n``` python\nimport time\nimport json\nimport os\nimport hashlib\nimport httpx\n\nclass PoliteFetcher:\n    \"\"\"Conditional-GET fetcher.\n\n    Stores each URL's ETag / Last-Modified, sends them back as\n    If-None-Match / If-Modified-Since on the next fetch, and sleeps\n    `min_interval` seconds between hits to keep load on the source low.\n\n    A 304 response means: nothing changed, no body sent, skip parsing.\n    \"\"\"\n\n    def __init__(self, cache_path=\"cache.json\", min_interval=1.0,\n                 user_agent=\"polite-scraper/1.0 (+you@example.com)\"):\n        self.cache_path = cache_path\n        self.min_interval = min_interval\n        self.user_agent = user_agent\n        self._last_hit = 0.0\n        self.cache = {}\n        if os.path.exists(cache_path):\n            with open(cache_path) as f:\n                self.cache = json.load(f)\n\n    def _throttle(self):\n        wait = self.min_interval - (time.monotonic() - self._last_hit)\n        if wait > 0:\n            time.sleep(wait)\n        self._last_hit = time.monotonic()\n\n    def get(self, url):\n        meta = self.cache.get(url, {})\n        headers = {\"User-Agent\": self.user_agent}\n        if meta.get(\"etag\"):\n            headers[\"If-None-Match\"] = meta[\"etag\"]\n        if meta.get(\"last_modified\"):\n            headers[\"If-Modified-Since\"] = meta[\"last_modified\"]\n\n        self._throttle()\n        r = httpx.get(url, headers=headers, timeout=20)\n\n        if r.status_code == 304:\n            # No new data. The server did almost no work. Reuse what we have.\n            return {\"status\": 304, \"changed\": False,\n                    \"body_hash\": meta.get(\"body_hash\")}\n\n        if r.status_code == 200:\n            body_hash = hashlib.sha256(r.content).hexdigest()\n            self.cache[url] = {\n                \"etag\": r.headers.get(\"etag\"),\n                \"last_modified\": r.headers.get(\"last-modified\"),\n                \"body_hash\": body_hash,\n            }\n            with open(self.cache_path, \"w\") as f:\n                json.dump(self.cache, f)\n            return {\"status\": 200, \"changed\": True,\n                    \"body_hash\": body_hash, \"content\": r.content}\n\n        # 4xx / 5xx — let the caller decide on retry/backoff.\n        return {\"status\": r.status_code, \"changed\": None, \"body_hash\": None}\n```\n\n`httpbingo.org`\n\nhas an `/etag/{tag}`\n\nendpoint that hands back an ETag and honors `If-None-Match`\n\n:\n\n```\nf = PoliteFetcher(min_interval=0.5)\nurl = \"https://httpbingo.org/etag/demo123\"\n\nprint(f.get(url)[\"status\"])   # 200  -> first time, full download\nprint(f.get(url)[\"status\"])   # 304  -> server says \"you already have it\"\nprint(f.get(url)[\"status\"])   # 304  -> still nothing new\n```\n\nOutput when I ran it:\n\n```\nrun 1: {'status': 200, 'changed': True,  'body_hash': '<your-hash>'}\nrun 2: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}\nrun 3: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}\n```\n\nYour\n\n`body_hash`\n\nwill differ — httpbingo echoes your request headers (User-Agent, timestamps) into the body, so the hex is yours, not mine. What's reproducible is the status sequence`200 → 304 → 304`\n\n, not the hash.\n\nThe `_throttle()`\n\nabove is deliberately dumb — one fixed delay per host. You usually don't need clever. You need a delay a human reading the access log wouldn't flinch at. Three rules I actually follow:\n\n`429`\n\n/ `Retry-After`\n\n.None of these live in robots.txt. The ethical rate limit lives in your code.\n\nI can't hand you a ranked uptime table of named sites — I don't have clean enough per-source numbers to publish one without inventing it, and inventing numbers is the fastest way to make a scraping post worthless. What I can say from 2,190 runs: the sources that kept working were the ones where my scraper behaved like a considerate guest (conditional GET, a delay, an honest User-Agent). The ones I lost were usually the ones where I got greedy with concurrency or skipped conditional GET because \"it's just a few thousand pages.\"\n\nI know that last one from getting it wrong. The first version of one of my recurring scrapers had no conditional-GET layer — I skipped it thinking \"it's a couple thousand pages, I'll add caching later.\" Around run 200 (rough memory, not a logged number) it started catching throttling it hadn't before. I blamed the site for a week. Then I added the ETag / `If-None-Match`\n\nlayer, the per-run request count dropped, and the throttling stopped. The bug was me.\n\nThat's a correlation, not a controlled experiment. Some of those lost-access incidents were probably the site changing its own defenses, nothing to do with me — and I can't cleanly separate those out, so I won't pretend the politeness *caused* the uptime. I'm not going to inflate it into an industry trend with a percentage either. But the direction isn't subtle: **politeness and persistence track together.** The scraper that's kind to the source is the one still running next quarter.\n\nFull long-form (the reasoning, the 962-runs story, the Monday checklist): [https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/](https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/)\n\nI've run 2,190 production scrapes across 32 scrapers (profile: [https://apify.com/knotless_cadence](https://apify.com/knotless_cadence)). If you need a recurring scraper that stays up instead of getting throttled on run 200, I build those — tell me the source and the schedule: ** spinov001@gmail.com**.\n\n*Drafted with AI assistance, edited and fact-checked by me.*", "url": "https://wpnews.pro/news/your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15", "canonical_source": "https://dev.to/0012303/your-recurring-scraper-is-re-downloading-data-that-didnt-change-heres-the-15-line-fix-25lc", "published_at": "2026-05-26 01:12:12+00:00", "updated_at": "2026-05-26 01:33:24.333083+00:00", "lang": "en", "topics": ["ai-ethics"], "entities": ["Federico Trotta", "The Web Scraping Club", "Trustpilot", "Apify", "Amazon"], "alternates": {"html": "https://wpnews.pro/news/your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15", "markdown": "https://wpnews.pro/news/your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15.md", "text": "https://wpnews.pro/news/your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15.txt", "jsonld": "https://wpnews.pro/news/your-recurring-scraper-is-re-downloading-data-that-didn-t-change-here-s-the-15.jsonld"}}