Hmm… my current read is that this does not exist as a first-class official feature yet; depending on the dataset, there may be workable workarounds:
just in case, @lhoestq
I would split this into two separate problems:
Detecting that the dataset repo changed Deciding what the smallest safe fetch/reprocessing unit is
For detection, I would look at Hub Webhooks first, or poll the dataset repo SHA with huggingface_hub
if webhooks are not an option.
For deltas, I would not assume a generic “fetch only the new/changed rows since last time” API for arbitrary HF datasets. The practical workaround I would try first is revision/file/shard-level tracking:
last_processed_sha
-> detect new_sha
-> compare file manifests between the two revisions
-> find added/modified/deleted data files
-> download/reprocess only those changed files or shards
-> advance last_processed_sha only after success
This is not the same as true row-level CDC, but it is a low-risk pattern that is testable with current Hub APIs.
#
Suggested minimal workflow
- Keep your own processing state
Do not rely only on the local Hub cache or on “latest seen revision”. Keep explicit downstream state, for example:
last_processed_sha = the last dataset repo commit fully processed by your pipeline
Then:
- A webhook fires, or polling detects a newer SHA.
- Your worker compares
last_processed_sha
with the new SHA.
- Your pipeline processes the changed files/shards.
- Only after success, update
last_processed_sha
.
This matters because a webhook event only means “the repo changed”. It does not mean your downstream job finished successfully.
- Detect the new revision
If using webhooks, the payload includes repo/revision information such as repo.headSha
and updatedRefs
with oldSha
/ newSha
, according to the Hub Webhooks docs.
If polling, HfApi.list_repo_commits
can be used to get recent commits, and HfApi.dataset_info
can expose dataset metadata such as the current repo SHA.
- Compare file manifests between revisions
Use repo tree metadata, for example HfApi.list_repo_tree
, with:
recursive=True
expand=True
revision=some_sha
repo_type="dataset"
Then compare fields such as:
path
size
blob_id
lfs.sha256, if present
xet_hash, if present
last_commit, if useful
The output you want is something like:
added files
modified files
deleted files
metadata-only changes
Then filter to actual data files such as:
*.parquet
*.jsonl
*.json
*.csv
*.tar
*.zip
data/**
- Download only changed paths
After you know the changed paths, use the Hub download APIs as the fetching layer:
For example:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="namespace/dataset_name",
repo_type="dataset",
revision="new_sha_here",
allow_patterns=changed_data_files,
local_dir="./changed_shards",
)
You can also use dry_run=True
to check what would be downloaded and whether files are already cached:
from huggingface_hub import snapshot_download
infos = snapshot_download(
repo_id="namespace/dataset_name",
repo_type="dataset",
revision="new_sha_here",
allow_patterns=changed_data_files,
dry_run=True,
)
for info in infos:
print(info.filename, info.file_size, info.is_cached, info.will_download)
I would treat this as a download optimization step, not as the source of truth for semantic dataset changes.
The key design question: what is the change boundary?
The workaround is much easier if the dataset is already organized into stable shards.
| Dataset layout |
Practical downstream strategy |
| Append-only Parquet shards |
Process only new/changed Parquet files. Best case. |
| Append-only WebDataset shards |
Process only new/changed .tar shards. Also good. |
| Stable date/source partitions |
Reprocess changed partitions or shards. |
| Existing shards are rewritten |
Reprocess the rewritten shards. Do not infer row deltas unless you have IDs/changelog. |
| One giant file/archive |
Hard. Any small logical change may force whole-file reprocessing. |
| Custom script hides file boundaries |
Harder. Inspect actual repo files if possible. |
| Stable row IDs + changelog/tombstones |
Row-level workflows become possible. |
| No stable IDs/changelog |
Row-level delta is risky to infer. |
So my first implementation would probably target file/shard-level deltas, not row-level deltas.
#
Why dataset layout matters
For large datasets, the Hub documentation generally points users toward formats such as Parquet and WebDataset. See:
There is also relevant forum guidance from lhoestq in a large-dataset discussion: for a large dataset packaged as one huge zip plus a custom , the recommendation was to use multiple shards and formats such as WebDataset or Parquet instead: How to load a large HF dataset efficiently?.
That advice is about and dataset layout, not directly about delta fetching. But it points to the same practical boundary: if the dataset is split into meaningful shards, downstream consumers can reprocess only changed shards. If everything is hidden inside one huge file, delta processing becomes much harder.
If you control the producer side, I would make the dataset downstream-friendly:
data/
date=2026-07-01/part-00000.parquet
date=2026-07-01/part-00001.parquet
date=2026-07-02/part-00000.parquet
_manifest.jsonl
_changelog/
2026-07-01.jsonl
2026-07-02.jsonl
_schema/
v1.json
Useful producer-side properties:
- append-only shard names
- stable row/example IDs
- no rewriting of old shards unless necessary
- explicit tombstones or delete markers if deletes matter
- schema versioning
- a manifest or changelog that downstream consumers can trust
- batched commits rather than many tiny high-frequency commits
There is a related datasets
issue for producer-side appending, push_to_hub(..., append=True)
. That issue is not the same as consumer-side delta fetching, but it is a useful signal that incremental dataset workflows are not just a one-line solved path for every case.
Things I would not over-assume
The main trap is mixing transport/cache-level optimization with application-level delta semantics.
| Mechanism |
Helps with |
Does not automatically give |
| Webhook |
Detecting repo updates |
Changed rows/examples |
| Repo SHA |
Version boundary |
Semantic delta by itself |
| Manifest comparison |
Changed files/shards |
Exact row-level insert/update/delete |
snapshot_download / hf_hub_download |
Fetching selected files |
Delta detection by itself |
| Hub cache |
Avoiding redundant downloads |
Pipeline processing state |
| Xet/chunk deduplication |
Storage/transfer efficiency |
Dataset-level changelog |
| Dataset Viewer Parquet files |
Inspection/queryable derived artifacts |
General delta API |
For example, the Hub docs on editing datasets and PyArrow integration discuss optimized Parquet/Xet behavior that can reduce upload/download/storage costs. That is useful, but I would not treat chunk-level deduplication as a replacement for stable row IDs, manifests, or changelogs.
Similarly, the Dataset Viewer /parquet
endpoint can list converted Parquet files for a dataset. That can be useful for inspection, but I would not treat viewer-converted Parquet files as a general-purpose dataset changelog unless your pipeline is explicitly designed around that derived representation.
#
A few edge cases I would handle explicitly
README/card-only updates: do not trigger heavy processing.
Deleted files: remove or mark corresponding downstream state.
Renamed files: may initially appear as delete + add.
Schema changes: treat separately from normal row additions.
Branch/tag updates: filter to the branch you actually consume, usually main
.
Private/gated datasets: make sure the token used by the webhook worker or poller has access.
Retries: make processing idempotent for the same old/new SHA range.
Concurrent updates: if several commits arrive while processing, compare from last_processed_sha
to the newest target SHA, or process ranges in order. Cache cleanup: manage Hub cache separately from pipeline state. Streaming: useful for avoiding full materialization, but not a delta detector by itself. Existing shard rewrites: reprocess the whole changed shard unless the dataset gives a stronger row-level contract.
There is also a practical HF blog example that uses a Hub compare URL with oldSha..newSha
to identify changed files and skip README-only updates: The 5 Most Under-Rated Tools on Hugging Face. I would treat that as a useful implementation pattern, while still considering manifest comparison through huggingface_hub
APIs as the more general approach to describe.
If you really need row-level CDC
If the real requirement is “give me inserted/updated/deleted rows since version X”, I would treat that as a changelog/table-format requirement, not something to infer from arbitrary Hub files.
For comparison:
Those systems have explicit metadata, snapshots, operation logs, or change streams. A generic HF dataset repo may not have that contract unless the dataset producer publishes it.
For HF datasets, row-level CDC usually requires at least one of:
stable row/example IDs
producer-published changelog
producer-published manifest
tombstone/delete markers
table-format metadata
or accepting shard-level reprocessing instead
Minimal test I would run first
Before building a full system, I would try this small experiment:
- Pick a dataset repo and two revisions:
old_sha
and new_sha
.
- Use
list_repo_tree(..., recursive=True, expand=True)
for both revisions.
- Filter to likely data files.
- Compare
path
, size
, blob_id
, lfs.sha256
, and/or xet_hash
.
- Classify added / modified / deleted files.
- Dry-run download only those changed paths.
- Check whether the changed files are a useful reprocessing unit for your pipeline.
- If yes, build the webhook/polling + retryable pipeline around that.
- If no, the dataset layout or missing producer metadata is probably the limiting factor.
#
Sketch of the manifest comparison test
from huggingface_hub import HfApi, snapshot_download
api = HfApi()
repo_id = "namespace/dataset_name"
def file_identity(item):
lfs = getattr(item, "lfs", None)
return {
"path": getattr(item, "path", None),
"size": getattr(item, "size", None),
"blob_id": getattr(item, "blob_id", None),
"lfs_sha256": getattr(lfs, "sha256", None) if lfs else None,
"xet_hash": getattr(item, "xet_hash", None),
}
def build_manifest(repo_id: str, revision: str):
manifest = {}
for item in api.list_repo_tree(
repo_id=repo_id,
repo_type="dataset",
revision=revision,
recursive=True,
expand=True,
):
path = getattr(item, "path", None)
if path is None:
continue
if path in {"README.md", ".gitattributes"}:
continue
manifest[path] = file_identity(item)
return manifest
old_sha = "old_sha_here"
new_sha = "new_sha_here"
old = build_manifest(repo_id, old_sha)
new = build_manifest(repo_id, new_sha)
old_paths = set(old)
new_paths = set(new)
added = new_paths - old_paths
deleted = old_paths - new_paths
modified = {
path
for path in old_paths & new_paths
if old[path] != new[path]
}
data_suffixes = (".parquet", ".jsonl", ".json", ".csv", ".tar", ".zip")
changed_data_files = sorted(
path
for path in added | modified
if path.endswith(data_suffixes) or path.startswith("data/")
)
print("added:", sorted(added))
print("deleted:", sorted(deleted))
print("modified:", sorted(modified))
print("changed data files:", changed_data_files)
infos = snapshot_download(
repo_id=repo_id,
repo_type="dataset",
revision=new_sha,
allow_patterns=changed_data_files,
dry_run=True,
)
for info in infos:
print(info.filename, info.file_size, info.is_cached, info.will_download)
This test does not solve row-level CDC. It only checks whether file/shard-level delta processing is practical for the dataset.
Short version
I would start with this:
Use Webhooks or SHA polling for detection.
Store last_processed_sha.
Compare repo file manifests between old and new revisions.
Treat changed data files/shards as the delta.
Download and reprocess only those changed shards.
Handle deletions and schema changes explicitly.
Advance last_processed_sha only after downstream success.
Do not expect arbitrary row-level deltas unless the dataset provides stable IDs, changelog, or table-format metadata.
That is not a perfect built-in delta system, but it is a practical and testable path with current Hub APIs.