Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET) After 2,190 production runs across 32 scrapers, a developer found that conditional GET requests and rate limiting — not robots.txt compliance — are the key to ethical web scraping that avoids server strain and bans. The 15-line Python fix stores ETag and Last-Modified headers from each response, then sends them back as If-None-Match and If-Modified-Since on subsequent requests, allowing servers to return a 304 status code when content hasn't changed. The approach, built on HTTP standards RFC 9110 and RFC 7232, prevents redundant downloads and duplicate data pipeline entries while minimizing server load. Note:This is a cross-post. Canonical version full long-form lives on my blog: https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/ The "ethical scraping" debate keeps arguing about robots.txt and ToS. Those are real, but they're decisions you make once , before the first request. They tell you nothing about run 200, 600, or 900 — and that's where you actually load someone's server and where you actually get banned. Good prompt for this post: Federico Trotta's "How to Scrape Open-Source Datasets Ethically" https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets on The Web Scraping Club, May 24, 2026 — his line that a scraper "that would barely register as noise on Amazon's servers could genuinely degrade performance for a public data portal" is the part the robots.txt debate keeps skipping. After 2,190 production scrapes across 32 scrapers the busiest, a Trustpilot review scraper, has 962 runs on its own , I'm convinced of one thing: on a real schedule, "polite to the source" and "doesn't get banned" stop being two questions and become one. And the answer is mostly conditional GET plus a sane rate limit — not a robots.txt checkbox. Where those numbers come from: my own Apify dashboard apify.com/knotless cadence https://apify.com/knotless cadence , as of May 2026. 2,190 = total runs summed across my 32 published actors; 962 = the Trustpilot scraper's own lifetime counter. Raw platform numbers, not sampled or extrapolated. This is the practical, code-first version. The long-form reasoning and what 962 runs against one site actually taught me is on the canonical post above. It's not a hack — it's in the HTTP standard RFC 9110 §13 https://httpwg.org/specs/rfc9110.html , and the older focused RFC 7232: Conditional Requests https://datatracker.ietf.org/doc/html/rfc7232 . Most servers will tell you whether a page changed before sending the body — for free — if you ask right: ETag and/or Last-Modified on the response. If-None-Match / If-Modified-Since on the next request. 304 Not Modified A 304 is the most considerate response you can get: you confirmed there's no new data without making the server render and ship a page you already have. You also stop feeding duplicate rows into your pipeline. Plain httpx . Persists its cache to disk so it survives across runs. Throttles itself so it doesn't hammer one host. requests works identically — same header names, same 304 . python import time import json import os import hashlib import httpx class PoliteFetcher: """Conditional-GET fetcher. Stores each URL's ETag / Last-Modified, sends them back as If-None-Match / If-Modified-Since on the next fetch, and sleeps min interval seconds between hits to keep load on the source low. A 304 response means: nothing changed, no body sent, skip parsing. """ def init self, cache path="cache.json", min interval=1.0, user agent="polite-scraper/1.0 +you@example.com " : self.cache path = cache path self.min interval = min interval self.user agent = user agent self. last hit = 0.0 self.cache = {} if os.path.exists cache path : with open cache path as f: self.cache = json.load f def throttle self : wait = self.min interval - time.monotonic - self. last hit if wait 0: time.sleep wait self. last hit = time.monotonic def get self, url : meta = self.cache.get url, {} headers = {"User-Agent": self.user agent} if meta.get "etag" : headers "If-None-Match" = meta "etag" if meta.get "last modified" : headers "If-Modified-Since" = meta "last modified" self. throttle r = httpx.get url, headers=headers, timeout=20 if r.status code == 304: No new data. The server did almost no work. Reuse what we have. return {"status": 304, "changed": False, "body hash": meta.get "body hash" } if r.status code == 200: body hash = hashlib.sha256 r.content .hexdigest self.cache url = { "etag": r.headers.get "etag" , "last modified": r.headers.get "last-modified" , "body hash": body hash, } with open self.cache path, "w" as f: json.dump self.cache, f return {"status": 200, "changed": True, "body hash": body hash, "content": r.content} 4xx / 5xx — let the caller decide on retry/backoff. return {"status": r.status code, "changed": None, "body hash": None} httpbingo.org has an /etag/{tag} endpoint that hands back an ETag and honors If-None-Match : f = PoliteFetcher min interval=0.5 url = "https://httpbingo.org/etag/demo123" print f.get url "status" 200 - first time, full download print f.get url "status" 304 - server says "you already have it" print f.get url "status" 304 - still nothing new Output when I ran it: run 1: {'status': 200, 'changed': True, 'body hash': '