What are the best practices for detecting and fetching deltas from a dataset?

Hugging Face Hub lacks a first-class feature for row-level delta detection in datasets. Best practices involve tracking repository revisions, comparing file manifests between commits, and downloading only changed files or shards using Hub APIs like snapshot_download.

Hmm… my current read is that this does not exist as a first-class official feature yet; depending on the dataset, there may be workable workarounds: just in case, @lhoestq /u/lhoestq I would split this into two separate problems: Detecting that the dataset repo changed Deciding what the smallest safe fetch/reprocessing unit is For detection, I would look at Hub Webhooks https://huggingface.co/docs/hub/webhooks first, or poll the dataset repo SHA with huggingface hub if webhooks are not an option. For deltas, I would not assume a generic “fetch only the new/changed rows since last time” API for arbitrary HF datasets. The practical workaround I would try first is revision/file/shard-level tracking : php last processed sha - detect new sha - compare file manifests between the two revisions - find added/modified/deleted data files - download/reprocess only those changed files or shards - advance last processed sha only after success This is not the same as true row-level CDC, but it is a low-risk pattern that is testable with current Hub APIs. Suggested minimal workflow 1. Keep your own processing state Do not rely only on the local Hub cache or on “latest seen revision”. Keep explicit downstream state, for example: last processed sha = the last dataset repo commit fully processed by your pipeline Then: - A webhook fires, or polling detects a newer SHA. - Your worker compares last processed sha with the new SHA. - Your pipeline processes the changed files/shards. - Only after success, update last processed sha . This matters because a webhook event only means “the repo changed”. It does not mean your downstream job finished successfully. 2. Detect the new revision If using webhooks, the payload includes repo/revision information such as repo.headSha and updatedRefs with oldSha / newSha , according to the Hub Webhooks docs https://huggingface.co/docs/hub/webhooks . If polling, HfApi.list repo commits can be used to get recent commits, and HfApi.dataset info can expose dataset metadata such as the current repo SHA. 3. Compare file manifests between revisions Use repo tree metadata, for example HfApi.list repo tree , with: recursive=True expand=True revision=some sha repo type="dataset" Then compare fields such as: path size blob id lfs.sha256, if present xet hash, if present last commit, if useful The output you want is something like: added files modified files deleted files metadata-only changes Then filter to actual data files such as: .parquet .jsonl .json .csv .tar .zip data/ 4. Download only changed paths After you know the changed paths, use the Hub download APIs as the fetching layer: For example: python from huggingface hub import snapshot download snapshot download repo id="namespace/dataset name", repo type="dataset", revision="new sha here", allow patterns=changed data files, local dir="./changed shards", You can also use dry run=True to check what would be downloaded and whether files are already cached: python from huggingface hub import snapshot download infos = snapshot download repo id="namespace/dataset name", repo type="dataset", revision="new sha here", allow patterns=changed data files, dry run=True, for info in infos: print info.filename, info.file size, info.is cached, info.will download I would treat this as a download optimization step, not as the source of truth for semantic dataset changes. The key design question: what is the change boundary? The workaround is much easier if the dataset is already organized into stable shards. | Dataset layout | Practical downstream strategy | | Append-only Parquet shards | Process only new/changed Parquet files. Best case. | | Append-only WebDataset shards | Process only new/changed .tar shards. Also good. | | Stable date/source partitions | Reprocess changed partitions or shards. | | Existing shards are rewritten | Reprocess the rewritten shards. Do not infer row deltas unless you have IDs/changelog. | | One giant file/archive | Hard. Any small logical change may force whole-file reprocessing. | | Custom loading script hides file boundaries | Harder. Inspect actual repo files if possible. | | Stable row IDs + changelog/tombstones | Row-level workflows become possible. | | No stable IDs/changelog | Row-level delta is risky to infer. | So my first implementation would probably target file/shard-level deltas , not row-level deltas. Why dataset layout matters For large datasets, the Hub documentation generally points users toward formats such as Parquet and WebDataset. See: There is also relevant forum guidance from lhoestq in a large-dataset loading discussion: for a large dataset packaged as one huge zip plus a custom loader, the recommendation was to use multiple shards and formats such as WebDataset or Parquet instead: How to load a large HF dataset efficiently? https://discuss.huggingface.co/t/how-to-load-a-large-hf-dataset-efficiently/69288 . That advice is about loading and dataset layout, not directly about delta fetching. But it points to the same practical boundary: if the dataset is split into meaningful shards, downstream consumers can reprocess only changed shards. If everything is hidden inside one huge file, delta processing becomes much harder. If you control the producer side, I would make the dataset downstream-friendly: data/ date=2026-07-01/part-00000.parquet date=2026-07-01/part-00001.parquet date=2026-07-02/part-00000.parquet manifest.jsonl changelog/ 2026-07-01.jsonl 2026-07-02.jsonl schema/ v1.json Useful producer-side properties: - append-only shard names - stable row/example IDs - no rewriting of old shards unless necessary - explicit tombstones or delete markers if deletes matter - schema versioning - a manifest or changelog that downstream consumers can trust - batched commits rather than many tiny high-frequency commits There is a related datasets issue for producer-side appending, push to hub ..., append=True . That issue is not the same as consumer-side delta fetching, but it is a useful signal that incremental dataset workflows are not just a one-line solved path for every case. Things I would not over-assume The main trap is mixing transport/cache-level optimization with application-level delta semantics . | Mechanism | Helps with | Does not automatically give | | Webhook | Detecting repo updates | Changed rows/examples | | Repo SHA | Version boundary | Semantic delta by itself | | Manifest comparison | Changed files/shards | Exact row-level insert/update/delete | snapshot download / hf hub download | Fetching selected files | Delta detection by itself | | Hub cache | Avoiding redundant downloads | Pipeline processing state | | Xet/chunk deduplication | Storage/transfer efficiency | Dataset-level changelog | | Dataset Viewer Parquet files | Inspection/queryable derived artifacts | General delta API | For example, the Hub docs on editing datasets https://huggingface.co/docs/hub/datasets-editing and PyArrow integration https://huggingface.co/docs/hub/datasets-pyarrow discuss optimized Parquet/Xet behavior that can reduce upload/download/storage costs. That is useful, but I would not treat chunk-level deduplication as a replacement for stable row IDs, manifests, or changelogs. Similarly, the Dataset Viewer /parquet endpoint can list converted Parquet files for a dataset. That can be useful for inspection, but I would not treat viewer-converted Parquet files as a general-purpose dataset changelog unless your pipeline is explicitly designed around that derived representation. A few edge cases I would handle explicitly README/card-only updates : do not trigger heavy processing. Deleted files : remove or mark corresponding downstream state. Renamed files : may initially appear as delete + add. Schema changes : treat separately from normal row additions. Branch/tag updates : filter to the branch you actually consume, usually main . Private/gated datasets : make sure the token used by the webhook worker or poller has access. Retries : make processing idempotent for the same old/new SHA range. Concurrent updates : if several commits arrive while processing, compare from last processed sha to the newest target SHA, or process ranges in order. Cache cleanup : manage Hub cache separately from pipeline state. Streaming : useful for avoiding full materialization, but not a delta detector by itself. Existing shard rewrites : reprocess the whole changed shard unless the dataset gives a stronger row-level contract. There is also a practical HF blog example that uses a Hub compare URL with oldSha..newSha to identify changed files and skip README-only updates: The 5 Most Under-Rated Tools on Hugging Face https://github.com/huggingface/blog/blob/main/unsung-heroes.md . I would treat that as a useful implementation pattern, while still considering manifest comparison through huggingface hub APIs as the more general approach to describe. If you really need row-level CDC If the real requirement is “give me inserted/updated/deleted rows since version X”, I would treat that as a changelog/table-format requirement , not something to infer from arbitrary Hub files. For comparison: Those systems have explicit metadata, snapshots, operation logs, or change streams. A generic HF dataset repo may not have that contract unless the dataset producer publishes it. For HF datasets, row-level CDC usually requires at least one of: stable row/example IDs producer-published changelog producer-published manifest tombstone/delete markers table-format metadata or accepting shard-level reprocessing instead Minimal test I would run first Before building a full system, I would try this small experiment: - Pick a dataset repo and two revisions: old sha and new sha . - Use list repo tree ..., recursive=True, expand=True for both revisions. - Filter to likely data files. - Compare path , size , blob id , lfs.sha256 , and/or xet hash . - Classify added / modified / deleted files. - Dry-run download only those changed paths. - Check whether the changed files are a useful reprocessing unit for your pipeline. - If yes, build the webhook/polling + retryable pipeline around that. - If no, the dataset layout or missing producer metadata is probably the limiting factor. Sketch of the manifest comparison test python from huggingface hub import HfApi, snapshot download api = HfApi repo id = "namespace/dataset name" def file identity item : lfs = getattr item, "lfs", None return { "path": getattr item, "path", None , "size": getattr item, "size", None , "blob id": getattr item, "blob id", None , "lfs sha256": getattr lfs, "sha256", None if lfs else None, "xet hash": getattr item, "xet hash", None , } def build manifest repo id: str, revision: str : manifest = {} for item in api.list repo tree repo id=repo id, repo type="dataset", revision=revision, recursive=True, expand=True, : path = getattr item, "path", None if path is None: continue if path in {"README.md", ".gitattributes"}: continue manifest path = file identity item return manifest old sha = "old sha here" new sha = "new sha here" old = build manifest repo id, old sha new = build manifest repo id, new sha old paths = set old new paths = set new added = new paths - old paths deleted = old paths - new paths modified = { path for path in old paths & new paths if old path = new path } data suffixes = ".parquet", ".jsonl", ".json", ".csv", ".tar", ".zip" changed data files = sorted path for path in added | modified if path.endswith data suffixes or path.startswith "data/" print "added:", sorted added print "deleted:", sorted deleted print "modified:", sorted modified print "changed data files:", changed data files infos = snapshot download repo id=repo id, repo type="dataset", revision=new sha, allow patterns=changed data files, dry run=True, for info in infos: print info.filename, info.file size, info.is cached, info.will download This test does not solve row-level CDC. It only checks whether file/shard-level delta processing is practical for the dataset. Short version I would start with this: Use Webhooks or SHA polling for detection. Store last processed sha. Compare repo file manifests between old and new revisions. Treat changed data files/shards as the delta. Download and reprocess only those changed shards. Handle deletions and schema changes explicitly. Advance last processed sha only after downstream success. Do not expect arbitrary row-level deltas unless the dataset provides stable IDs, changelog, or table-format metadata. That is not a perfect built-in delta system, but it is a practical and testable path with current Hub APIs.