# Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET)

> Source: <https://dev.to/0012303/your-recurring-scraper-is-re-downloading-data-that-didnt-change-heres-the-15-line-fix-25lc>
> Published: 2026-05-26 01:12:12+00:00

Note:This is a cross-post. Canonical version (full long-form) lives on my blog:[https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/]

The "ethical scraping" debate keeps arguing about robots.txt and ToS. Those are real, but they're decisions you make *once*, before the first request. They tell you nothing about run 200, 600, or 900 — and that's where you actually load someone's server and where you actually get banned. (Good prompt for this post: Federico Trotta's ["How to Scrape Open-Source Datasets Ethically"](https://substack.thewebscraping.club/p/how-to-scrape-open-source-datasets) on The Web Scraping Club, May 24, 2026 — his line that a scraper "that would barely register as noise on Amazon's servers could genuinely degrade performance for a public data portal" is the part the robots.txt debate keeps skipping.)

After **2,190 production scrapes** across 32 scrapers (the busiest, a Trustpilot review scraper, has **962 runs** on its own), I'm convinced of one thing: on a real schedule, "polite to the source" and "doesn't get banned" stop being two questions and become one. And the answer is mostly **conditional GET** plus a sane rate limit — not a robots.txt checkbox.

Where those numbers come from: my own Apify dashboard ([apify.com/knotless_cadence](https://apify.com/knotless_cadence)), as of May 2026. 2,190 = total runs summed across my 32 published actors; 962 = the Trustpilot scraper's own lifetime counter. Raw platform numbers, not sampled or extrapolated.

This is the practical, code-first version. The long-form reasoning (and what 962 runs against one site actually taught me) is on the canonical post above.

It's not a hack — it's in the HTTP standard ([RFC 9110 §13](https://httpwg.org/specs/rfc9110.html), and the older focused [RFC 7232: Conditional Requests](https://datatracker.ietf.org/doc/html/rfc7232)). Most servers will tell you whether a page changed *before* sending the body — for free — if you ask right:

`ETag`

and/or `Last-Modified`

on the response.`If-None-Match`

/ `If-Modified-Since`

on the next request.`304 Not Modified`

A `304`

is the most considerate response you can get: you confirmed there's no new data without making the server render and ship a page you already have. You also stop feeding duplicate rows into your pipeline.

Plain `httpx`

. Persists its cache to disk so it survives across runs. Throttles itself so it doesn't hammer one host. `requests`

works identically — same header names, same `304`

.

``` python
import time
import json
import os
import hashlib
import httpx

class PoliteFetcher:
    """Conditional-GET fetcher.

    Stores each URL's ETag / Last-Modified, sends them back as
    If-None-Match / If-Modified-Since on the next fetch, and sleeps
    `min_interval` seconds between hits to keep load on the source low.

    A 304 response means: nothing changed, no body sent, skip parsing.
    """

    def __init__(self, cache_path="cache.json", min_interval=1.0,
                 user_agent="polite-scraper/1.0 (+you@example.com)"):
        self.cache_path = cache_path
        self.min_interval = min_interval
        self.user_agent = user_agent
        self._last_hit = 0.0
        self.cache = {}
        if os.path.exists(cache_path):
            with open(cache_path) as f:
                self.cache = json.load(f)

    def _throttle(self):
        wait = self.min_interval - (time.monotonic() - self._last_hit)
        if wait > 0:
            time.sleep(wait)
        self._last_hit = time.monotonic()

    def get(self, url):
        meta = self.cache.get(url, {})
        headers = {"User-Agent": self.user_agent}
        if meta.get("etag"):
            headers["If-None-Match"] = meta["etag"]
        if meta.get("last_modified"):
            headers["If-Modified-Since"] = meta["last_modified"]

        self._throttle()
        r = httpx.get(url, headers=headers, timeout=20)

        if r.status_code == 304:
            # No new data. The server did almost no work. Reuse what we have.
            return {"status": 304, "changed": False,
                    "body_hash": meta.get("body_hash")}

        if r.status_code == 200:
            body_hash = hashlib.sha256(r.content).hexdigest()
            self.cache[url] = {
                "etag": r.headers.get("etag"),
                "last_modified": r.headers.get("last-modified"),
                "body_hash": body_hash,
            }
            with open(self.cache_path, "w") as f:
                json.dump(self.cache, f)
            return {"status": 200, "changed": True,
                    "body_hash": body_hash, "content": r.content}

        # 4xx / 5xx — let the caller decide on retry/backoff.
        return {"status": r.status_code, "changed": None, "body_hash": None}
```

`httpbingo.org`

has an `/etag/{tag}`

endpoint that hands back an ETag and honors `If-None-Match`

:

```
f = PoliteFetcher(min_interval=0.5)
url = "https://httpbingo.org/etag/demo123"

print(f.get(url)["status"])   # 200  -> first time, full download
print(f.get(url)["status"])   # 304  -> server says "you already have it"
print(f.get(url)["status"])   # 304  -> still nothing new
```

Output when I ran it:

```
run 1: {'status': 200, 'changed': True,  'body_hash': '<your-hash>'}
run 2: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}
run 3: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}
```

Your

`body_hash`

will differ — httpbingo echoes your request headers (User-Agent, timestamps) into the body, so the hex is yours, not mine. What's reproducible is the status sequence`200 → 304 → 304`

, not the hash.

The `_throttle()`

above is deliberately dumb — one fixed delay per host. You usually don't need clever. You need a delay a human reading the access log wouldn't flinch at. Three rules I actually follow:

`429`

/ `Retry-After`

.None of these live in robots.txt. The ethical rate limit lives in your code.

I can't hand you a ranked uptime table of named sites — I don't have clean enough per-source numbers to publish one without inventing it, and inventing numbers is the fastest way to make a scraping post worthless. What I can say from 2,190 runs: the sources that kept working were the ones where my scraper behaved like a considerate guest (conditional GET, a delay, an honest User-Agent). The ones I lost were usually the ones where I got greedy with concurrency or skipped conditional GET because "it's just a few thousand pages."

I know that last one from getting it wrong. The first version of one of my recurring scrapers had no conditional-GET layer — I skipped it thinking "it's a couple thousand pages, I'll add caching later." Around run 200 (rough memory, not a logged number) it started catching throttling it hadn't before. I blamed the site for a week. Then I added the ETag / `If-None-Match`

layer, the per-run request count dropped, and the throttling stopped. The bug was me.

That's a correlation, not a controlled experiment. Some of those lost-access incidents were probably the site changing its own defenses, nothing to do with me — and I can't cleanly separate those out, so I won't pretend the politeness *caused* the uptime. I'm not going to inflate it into an industry trend with a percentage either. But the direction isn't subtle: **politeness and persistence track together.** The scraper that's kind to the source is the one still running next quarter.

Full long-form (the reasoning, the 962-runs story, the Monday checklist): [https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/](https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/)

I've run 2,190 production scrapes across 32 scrapers (profile: [https://apify.com/knotless_cadence](https://apify.com/knotless_cadence)). If you need a recurring scraper that stays up instead of getting throttled on run 200, I build those — tell me the source and the schedule: ** spinov001@gmail.com**.

*Drafted with AI assistance, edited and fact-checked by me.*
