What are the best practices for detecting and fetching deltas from a dataset?

wpnews.pro

Hmm… my current read is that this does not exist as a first-class official feature yet; depending on the dataset, there may be workable workarounds:

just in case, @lhoestq

I would split this into two separate problems:

Detecting that the dataset repo changed Deciding what the smallest safe fetch/reprocessing unit is

For detection, I would look at Hub Webhooks first, or poll the dataset repo SHA with huggingface_hub

if webhooks are not an option.

For deltas, I would not assume a generic “fetch only the new/changed rows since last time” API for arbitrary HF datasets. The practical workaround I would try first is revision/file/shard-level tracking:

last_processed_sha
  -> detect new_sha
  -> compare file manifests between the two revisions
  -> find added/modified/deleted data files
  -> download/reprocess only those changed files or shards
  -> advance last_processed_sha only after success

This is not the same as true row-level CDC, but it is a low-risk pattern that is testable with current Hub APIs.

#

Suggested minimal workflow

Keep your own processing state

Do not rely only on the local Hub cache or on “latest seen revision”. Keep explicit downstream state, for example:

last_processed_sha = the last dataset repo commit fully processed by your pipeline

Then:

A webhook fires, or polling detects a newer SHA.
Your worker compares last_processed_sha

with the new SHA.

Your pipeline processes the changed files/shards.
Only after success, update last_processed_sha

.

This matters because a webhook event only means “the repo changed”. It does not mean your downstream job finished successfully.

Detect the new revision

If using webhooks, the payload includes repo/revision information such as repo.headSha

and updatedRefs

with oldSha

/ newSha

, according to the Hub Webhooks docs.

If polling, HfApi.list_repo_commits

can be used to get recent commits, and HfApi.dataset_info

can expose dataset metadata such as the current repo SHA.

Compare file manifests between revisions

Use repo tree metadata, for example HfApi.list_repo_tree

, with:

recursive=True
expand=True
revision=some_sha
repo_type="dataset"

Then compare fields such as:

path
size
blob_id
lfs.sha256, if present
xet_hash, if present
last_commit, if useful

The output you want is something like:

added files
modified files
deleted files
metadata-only changes

Then filter to actual data files such as:

*.parquet
*.jsonl
*.json
*.csv
*.tar
*.zip
data/**

Download only changed paths

After you know the changed paths, use the Hub download APIs as the fetching layer:

For example:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="namespace/dataset_name",
    repo_type="dataset",
    revision="new_sha_here",
    allow_patterns=changed_data_files,
    local_dir="./changed_shards",
)

You can also use dry_run=True

to check what would be downloaded and whether files are already cached:

from huggingface_hub import snapshot_download

infos = snapshot_download(
    repo_id="namespace/dataset_name",
    repo_type="dataset",
    revision="new_sha_here",
    allow_patterns=changed_data_files,
    dry_run=True,
)

for info in infos:
    print(info.filename, info.file_size, info.is_cached, info.will_download)

I would treat this as a download optimization step, not as the source of truth for semantic dataset changes.

The key design question: what is the change boundary?

The workaround is much easier if the dataset is already organized into stable shards.

So my first implementation would probably target file/shard-level deltas, not row-level deltas.

#

Why dataset layout matters

For large datasets, the Hub documentation generally points users toward formats such as Parquet and WebDataset. See:

There is also relevant forum guidance from lhoestq in a large-dataset discussion: for a large dataset packaged as one huge zip plus a custom , the recommendation was to use multiple shards and formats such as WebDataset or Parquet instead: How to load a large HF dataset efficiently?.

That advice is about and dataset layout, not directly about delta fetching. But it points to the same practical boundary: if the dataset is split into meaningful shards, downstream consumers can reprocess only changed shards. If everything is hidden inside one huge file, delta processing becomes much harder.

If you control the producer side, I would make the dataset downstream-friendly:

data/
  date=2026-07-01/part-00000.parquet
  date=2026-07-01/part-00001.parquet
  date=2026-07-02/part-00000.parquet

_manifest.jsonl

_changelog/
  2026-07-01.jsonl
  2026-07-02.jsonl

_schema/
  v1.json

Useful producer-side properties:

append-only shard names
stable row/example IDs
no rewriting of old shards unless necessary
explicit tombstones or delete markers if deletes matter
schema versioning
a manifest or changelog that downstream consumers can trust
batched commits rather than many tiny high-frequency commits

There is a related datasets

issue for producer-side appending, push_to_hub(..., append=True)

. That issue is not the same as consumer-side delta fetching, but it is a useful signal that incremental dataset workflows are not just a one-line solved path for every case.

Things I would not over-assume

The main trap is mixing transport/cache-level optimization with application-level delta semantics.

For example, the Hub docs on editing datasets and PyArrow integration discuss optimized Parquet/Xet behavior that can reduce upload/download/storage costs. That is useful, but I would not treat chunk-level deduplication as a replacement for stable row IDs, manifests, or changelogs.

Similarly, the Dataset Viewer /parquet

endpoint can list converted Parquet files for a dataset. That can be useful for inspection, but I would not treat viewer-converted Parquet files as a general-purpose dataset changelog unless your pipeline is explicitly designed around that derived representation.

#

A few edge cases I would handle explicitly

README/card-only updates: do not trigger heavy processing. Deleted files: remove or mark corresponding downstream state. Renamed files: may initially appear as delete + add. Schema changes: treat separately from normal row additions. Branch/tag updates: filter to the branch you actually consume, usually main

. Private/gated datasets: make sure the token used by the webhook worker or poller has access. Retries: make processing idempotent for the same old/new SHA range. Concurrent updates: if several commits arrive while processing, compare from last_processed_sha

to the newest target SHA, or process ranges in order. Cache cleanup: manage Hub cache separately from pipeline state. Streaming: useful for avoiding full materialization, but not a delta detector by itself. Existing shard rewrites: reprocess the whole changed shard unless the dataset gives a stronger row-level contract.

There is also a practical HF blog example that uses a Hub compare URL with oldSha..newSha

to identify changed files and skip README-only updates: The 5 Most Under-Rated Tools on Hugging Face. I would treat that as a useful implementation pattern, while still considering manifest comparison through huggingface_hub

APIs as the more general approach to describe.

If you really need row-level CDC

If the real requirement is “give me inserted/updated/deleted rows since version X”, I would treat that as a changelog/table-format requirement, not something to infer from arbitrary Hub files.

For comparison:

Those systems have explicit metadata, snapshots, operation logs, or change streams. A generic HF dataset repo may not have that contract unless the dataset producer publishes it.

For HF datasets, row-level CDC usually requires at least one of:

stable row/example IDs
producer-published changelog
producer-published manifest
tombstone/delete markers
table-format metadata
or accepting shard-level reprocessing instead

Minimal test I would run first

Before building a full system, I would try this small experiment:

Pick a dataset repo and two revisions: old_sha

and new_sha

.

Use list_repo_tree(..., recursive=True, expand=True)

for both revisions.

Filter to likely data files.
Compare path

, size

, blob_id

, lfs.sha256

, and/or xet_hash

.

Classify added / modified / deleted files.
Dry-run download only those changed paths.
Check whether the changed files are a useful reprocessing unit for your pipeline.
If yes, build the webhook/polling + retryable pipeline around that.
If no, the dataset layout or missing producer metadata is probably the limiting factor.

#

Sketch of the manifest comparison test

from huggingface_hub import HfApi, snapshot_download

api = HfApi()
repo_id = "namespace/dataset_name"

def file_identity(item):
    lfs = getattr(item, "lfs", None)

    return {
        "path": getattr(item, "path", None),
        "size": getattr(item, "size", None),
        "blob_id": getattr(item, "blob_id", None),
        "lfs_sha256": getattr(lfs, "sha256", None) if lfs else None,
        "xet_hash": getattr(item, "xet_hash", None),
    }

def build_manifest(repo_id: str, revision: str):
    manifest = {}

    for item in api.list_repo_tree(
        repo_id=repo_id,
        repo_type="dataset",
        revision=revision,
        recursive=True,
        expand=True,
    ):
        path = getattr(item, "path", None)
        if path is None:
            continue

        if path in {"README.md", ".gitattributes"}:
            continue

        manifest[path] = file_identity(item)

    return manifest

old_sha = "old_sha_here"
new_sha = "new_sha_here"

old = build_manifest(repo_id, old_sha)
new = build_manifest(repo_id, new_sha)

old_paths = set(old)
new_paths = set(new)

added = new_paths - old_paths
deleted = old_paths - new_paths
modified = {
    path
    for path in old_paths & new_paths
    if old[path] != new[path]
}

data_suffixes = (".parquet", ".jsonl", ".json", ".csv", ".tar", ".zip")

changed_data_files = sorted(
    path
    for path in added | modified
    if path.endswith(data_suffixes) or path.startswith("data/")
)

print("added:", sorted(added))
print("deleted:", sorted(deleted))
print("modified:", sorted(modified))
print("changed data files:", changed_data_files)

infos = snapshot_download(
    repo_id=repo_id,
    repo_type="dataset",
    revision=new_sha,
    allow_patterns=changed_data_files,
    dry_run=True,
)

for info in infos:
    print(info.filename, info.file_size, info.is_cached, info.will_download)

This test does not solve row-level CDC. It only checks whether file/shard-level delta processing is practical for the dataset.

Short version

I would start with this:

Use Webhooks or SHA polling for detection.
Store last_processed_sha.
Compare repo file manifests between old and new revisions.
Treat changed data files/shards as the delta.
Download and reprocess only those changed shards.
Handle deletions and schema changes explicitly.
Advance last_processed_sha only after downstream success.
Do not expect arbitrary row-level deltas unless the dataset provides stable IDs, changelog, or table-format metadata.

That is not a perfect built-in delta system, but it is a practical and testable path with current Hub APIs.

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

What are the best practices for detecting and fetching deltas from a dataset?

#

#

#

#

Run your AI side-project on zahid.host