cd /news/developer-tools/what-are-the-best-practices-for-dete… · home topics developer-tools article
[ARTICLE · art-47230] src=discuss.huggingface.co ↗ pub= topic=developer-tools verified=true sentiment=· neutral

What are the best practices for detecting and fetching deltas from a dataset?

Hugging Face Hub lacks a first-class feature for row-level delta detection in datasets. Best practices involve tracking repository revisions, comparing file manifests between commits, and downloading only changed files or shards using Hub APIs like snapshot_download.

read9 min views1 publishedJul 3, 2026
What are the best practices for detecting and fetching deltas from a dataset?
Image: Discuss (auto-discovered)

Hmm… my current read is that this does not exist as a first-class official feature yet; depending on the dataset, there may be workable workarounds:

just in case, @lhoestq

I would split this into two separate problems:

Detecting that the dataset repo changed Deciding what the smallest safe fetch/reprocessing unit is

For detection, I would look at Hub Webhooks first, or poll the dataset repo SHA with huggingface_hub

if webhooks are not an option.

For deltas, I would not assume a generic “fetch only the new/changed rows since last time” API for arbitrary HF datasets. The practical workaround I would try first is revision/file/shard-level tracking:

last_processed_sha
  -> detect new_sha
  -> compare file manifests between the two revisions
  -> find added/modified/deleted data files
  -> download/reprocess only those changed files or shards
  -> advance last_processed_sha only after success

This is not the same as true row-level CDC, but it is a low-risk pattern that is testable with current Hub APIs.

#

Suggested minimal workflow

  1. Keep your own processing state

Do not rely only on the local Hub cache or on “latest seen revision”. Keep explicit downstream state, for example:

last_processed_sha = the last dataset repo commit fully processed by your pipeline

Then:

  • A webhook fires, or polling detects a newer SHA.
  • Your worker compares last_processed_sha

with the new SHA.

  • Your pipeline processes the changed files/shards.
  • Only after success, update last_processed_sha

.

This matters because a webhook event only means “the repo changed”. It does not mean your downstream job finished successfully.

  1. Detect the new revision

If using webhooks, the payload includes repo/revision information such as repo.headSha

and updatedRefs

with oldSha

/ newSha

, according to the Hub Webhooks docs.

If polling, HfApi.list_repo_commits

can be used to get recent commits, and HfApi.dataset_info

can expose dataset metadata such as the current repo SHA.

  1. Compare file manifests between revisions

Use repo tree metadata, for example HfApi.list_repo_tree

, with:

recursive=True
expand=True
revision=some_sha
repo_type="dataset"

Then compare fields such as:

path
size
blob_id
lfs.sha256, if present
xet_hash, if present
last_commit, if useful

The output you want is something like:

added files
modified files
deleted files
metadata-only changes

Then filter to actual data files such as:

*.parquet
*.jsonl
*.json
*.csv
*.tar
*.zip
data/**
  1. Download only changed paths

After you know the changed paths, use the Hub download APIs as the fetching layer:

For example:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="namespace/dataset_name",
    repo_type="dataset",
    revision="new_sha_here",
    allow_patterns=changed_data_files,
    local_dir="./changed_shards",
)

You can also use dry_run=True

to check what would be downloaded and whether files are already cached:

from huggingface_hub import snapshot_download

infos = snapshot_download(
    repo_id="namespace/dataset_name",
    repo_type="dataset",
    revision="new_sha_here",
    allow_patterns=changed_data_files,
    dry_run=True,
)

for info in infos:
    print(info.filename, info.file_size, info.is_cached, info.will_download)

I would treat this as a download optimization step, not as the source of truth for semantic dataset changes.

The key design question: what is the change boundary?

The workaround is much easier if the dataset is already organized into stable shards.

| Dataset layout | Practical downstream strategy | | Append-only Parquet shards | Process only new/changed Parquet files. Best case. | | Append-only WebDataset shards | Process only new/changed .tar shards. Also good. | | Stable date/source partitions | Reprocess changed partitions or shards. | | Existing shards are rewritten | Reprocess the rewritten shards. Do not infer row deltas unless you have IDs/changelog. | | One giant file/archive | Hard. Any small logical change may force whole-file reprocessing. | | Custom script hides file boundaries | Harder. Inspect actual repo files if possible. | | Stable row IDs + changelog/tombstones | Row-level workflows become possible. | | No stable IDs/changelog | Row-level delta is risky to infer. |

So my first implementation would probably target file/shard-level deltas, not row-level deltas.

#

Why dataset layout matters

For large datasets, the Hub documentation generally points users toward formats such as Parquet and WebDataset. See:

There is also relevant forum guidance from lhoestq in a large-dataset discussion: for a large dataset packaged as one huge zip plus a custom , the recommendation was to use multiple shards and formats such as WebDataset or Parquet instead: How to load a large HF dataset efficiently?.

That advice is about and dataset layout, not directly about delta fetching. But it points to the same practical boundary: if the dataset is split into meaningful shards, downstream consumers can reprocess only changed shards. If everything is hidden inside one huge file, delta processing becomes much harder.

If you control the producer side, I would make the dataset downstream-friendly:

data/
  date=2026-07-01/part-00000.parquet
  date=2026-07-01/part-00001.parquet
  date=2026-07-02/part-00000.parquet

_manifest.jsonl

_changelog/
  2026-07-01.jsonl
  2026-07-02.jsonl

_schema/
  v1.json

Useful producer-side properties:

  • append-only shard names
  • stable row/example IDs
  • no rewriting of old shards unless necessary
  • explicit tombstones or delete markers if deletes matter
  • schema versioning
  • a manifest or changelog that downstream consumers can trust
  • batched commits rather than many tiny high-frequency commits

There is a related datasets

issue for producer-side appending, push_to_hub(..., append=True)

. That issue is not the same as consumer-side delta fetching, but it is a useful signal that incremental dataset workflows are not just a one-line solved path for every case.

Things I would not over-assume

The main trap is mixing transport/cache-level optimization with application-level delta semantics.

| Mechanism | Helps with | Does not automatically give | | Webhook | Detecting repo updates | Changed rows/examples | | Repo SHA | Version boundary | Semantic delta by itself | | Manifest comparison | Changed files/shards | Exact row-level insert/update/delete | snapshot_download / hf_hub_download | Fetching selected files | Delta detection by itself | | Hub cache | Avoiding redundant downloads | Pipeline processing state | | Xet/chunk deduplication | Storage/transfer efficiency | Dataset-level changelog | | Dataset Viewer Parquet files | Inspection/queryable derived artifacts | General delta API |

For example, the Hub docs on editing datasets and PyArrow integration discuss optimized Parquet/Xet behavior that can reduce upload/download/storage costs. That is useful, but I would not treat chunk-level deduplication as a replacement for stable row IDs, manifests, or changelogs.

Similarly, the Dataset Viewer /parquet

endpoint can list converted Parquet files for a dataset. That can be useful for inspection, but I would not treat viewer-converted Parquet files as a general-purpose dataset changelog unless your pipeline is explicitly designed around that derived representation.

#

A few edge cases I would handle explicitly

README/card-only updates: do not trigger heavy processing. Deleted files: remove or mark corresponding downstream state. Renamed files: may initially appear as delete + add. Schema changes: treat separately from normal row additions. Branch/tag updates: filter to the branch you actually consume, usually main

. Private/gated datasets: make sure the token used by the webhook worker or poller has access. Retries: make processing idempotent for the same old/new SHA range. Concurrent updates: if several commits arrive while processing, compare from last_processed_sha

to the newest target SHA, or process ranges in order. Cache cleanup: manage Hub cache separately from pipeline state. Streaming: useful for avoiding full materialization, but not a delta detector by itself. Existing shard rewrites: reprocess the whole changed shard unless the dataset gives a stronger row-level contract.

There is also a practical HF blog example that uses a Hub compare URL with oldSha..newSha

to identify changed files and skip README-only updates: The 5 Most Under-Rated Tools on Hugging Face. I would treat that as a useful implementation pattern, while still considering manifest comparison through huggingface_hub

APIs as the more general approach to describe.

If you really need row-level CDC

If the real requirement is “give me inserted/updated/deleted rows since version X”, I would treat that as a changelog/table-format requirement, not something to infer from arbitrary Hub files.

For comparison:

Those systems have explicit metadata, snapshots, operation logs, or change streams. A generic HF dataset repo may not have that contract unless the dataset producer publishes it.

For HF datasets, row-level CDC usually requires at least one of:

stable row/example IDs
producer-published changelog
producer-published manifest
tombstone/delete markers
table-format metadata
or accepting shard-level reprocessing instead

Minimal test I would run first

Before building a full system, I would try this small experiment:

  • Pick a dataset repo and two revisions: old_sha

and new_sha

.

  • Use list_repo_tree(..., recursive=True, expand=True)

for both revisions.

  • Filter to likely data files.
  • Compare path

, size

, blob_id

, lfs.sha256

, and/or xet_hash

.

  • Classify added / modified / deleted files.
  • Dry-run download only those changed paths.
  • Check whether the changed files are a useful reprocessing unit for your pipeline.
  • If yes, build the webhook/polling + retryable pipeline around that.
  • If no, the dataset layout or missing producer metadata is probably the limiting factor.

#

Sketch of the manifest comparison test

from huggingface_hub import HfApi, snapshot_download

api = HfApi()
repo_id = "namespace/dataset_name"

def file_identity(item):
    lfs = getattr(item, "lfs", None)

    return {
        "path": getattr(item, "path", None),
        "size": getattr(item, "size", None),
        "blob_id": getattr(item, "blob_id", None),
        "lfs_sha256": getattr(lfs, "sha256", None) if lfs else None,
        "xet_hash": getattr(item, "xet_hash", None),
    }

def build_manifest(repo_id: str, revision: str):
    manifest = {}

    for item in api.list_repo_tree(
        repo_id=repo_id,
        repo_type="dataset",
        revision=revision,
        recursive=True,
        expand=True,
    ):
        path = getattr(item, "path", None)
        if path is None:
            continue

        if path in {"README.md", ".gitattributes"}:
            continue

        manifest[path] = file_identity(item)

    return manifest

old_sha = "old_sha_here"
new_sha = "new_sha_here"

old = build_manifest(repo_id, old_sha)
new = build_manifest(repo_id, new_sha)

old_paths = set(old)
new_paths = set(new)

added = new_paths - old_paths
deleted = old_paths - new_paths
modified = {
    path
    for path in old_paths & new_paths
    if old[path] != new[path]
}

data_suffixes = (".parquet", ".jsonl", ".json", ".csv", ".tar", ".zip")

changed_data_files = sorted(
    path
    for path in added | modified
    if path.endswith(data_suffixes) or path.startswith("data/")
)

print("added:", sorted(added))
print("deleted:", sorted(deleted))
print("modified:", sorted(modified))
print("changed data files:", changed_data_files)

infos = snapshot_download(
    repo_id=repo_id,
    repo_type="dataset",
    revision=new_sha,
    allow_patterns=changed_data_files,
    dry_run=True,
)

for info in infos:
    print(info.filename, info.file_size, info.is_cached, info.will_download)

This test does not solve row-level CDC. It only checks whether file/shard-level delta processing is practical for the dataset.

Short version

I would start with this:

Use Webhooks or SHA polling for detection.
Store last_processed_sha.
Compare repo file manifests between old and new revisions.
Treat changed data files/shards as the delta.
Download and reprocess only those changed shards.
Handle deletions and schema changes explicitly.
Advance last_processed_sha only after downstream success.
Do not expect arbitrary row-level deltas unless the dataset provides stable IDs, changelog, or table-format metadata.

That is not a perfect built-in delta system, but it is a practical and testable path with current Hub APIs.

── more in #developer-tools 4 stories · sorted by recency
── more on @hugging face hub 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/what-are-the-best-pr…] indexed:0 read:9min 2026-07-03 ·