{"slug": "what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset", "title": "What are the best practices for detecting and fetching deltas from a dataset?", "summary": "Hugging Face Hub lacks a first-class feature for row-level delta detection in datasets. Best practices involve tracking repository revisions, comparing file manifests between commits, and downloading only changed files or shards using Hub APIs like snapshot_download.", "body_md": "Hmm… my current read is that this does not exist as a first-class official feature yet; depending on the dataset, there may be workable workarounds:\n\njust in case, [@lhoestq](/u/lhoestq)\n\nI would split this into two separate problems:\n\n**Detecting that the dataset repo changed**\n**Deciding what the smallest safe fetch/reprocessing unit is**\n\nFor detection, I would look at [Hub Webhooks](https://huggingface.co/docs/hub/webhooks) first, or poll the dataset repo SHA with `huggingface_hub`\n\nif webhooks are not an option.\n\nFor deltas, I would not assume a generic “fetch only the new/changed rows since last time” API for arbitrary HF datasets. The practical workaround I would try first is **revision/file/shard-level tracking**:\n\n``` php\nlast_processed_sha\n  -> detect new_sha\n  -> compare file manifests between the two revisions\n  -> find added/modified/deleted data files\n  -> download/reprocess only those changed files or shards\n  -> advance last_processed_sha only after success\n```\n\nThis is not the same as true row-level CDC, but it is a low-risk pattern that is testable with current Hub APIs.\n\n##\nSuggested minimal workflow\n\n1. Keep your own processing state\n\nDo not rely only on the local Hub cache or on “latest seen revision”. Keep explicit downstream state, for example:\n\n```\nlast_processed_sha = the last dataset repo commit fully processed by your pipeline\n```\n\nThen:\n\n- A webhook fires, or polling detects a newer SHA.\n- Your worker compares\n`last_processed_sha`\n\nwith the new SHA.\n- Your pipeline processes the changed files/shards.\n- Only after success, update\n`last_processed_sha`\n\n.\n\nThis matters because a webhook event only means “the repo changed”. It does not mean your downstream job finished successfully.\n\n2. Detect the new revision\n\nIf using webhooks, the payload includes repo/revision information such as `repo.headSha`\n\nand `updatedRefs`\n\nwith `oldSha`\n\n/ `newSha`\n\n, according to the [Hub Webhooks docs](https://huggingface.co/docs/hub/webhooks).\n\nIf polling, `HfApi.list_repo_commits`\n\ncan be used to get recent commits, and `HfApi.dataset_info`\n\ncan expose dataset metadata such as the current repo SHA.\n\n3. Compare file manifests between revisions\n\nUse repo tree metadata, for example `HfApi.list_repo_tree`\n\n, with:\n\n```\nrecursive=True\nexpand=True\nrevision=some_sha\nrepo_type=\"dataset\"\n```\n\nThen compare fields such as:\n\n```\npath\nsize\nblob_id\nlfs.sha256, if present\nxet_hash, if present\nlast_commit, if useful\n```\n\nThe output you want is something like:\n\n```\nadded files\nmodified files\ndeleted files\nmetadata-only changes\n```\n\nThen filter to actual data files such as:\n\n```\n*.parquet\n*.jsonl\n*.json\n*.csv\n*.tar\n*.zip\ndata/**\n```\n\n4. Download only changed paths\n\nAfter you know the changed paths, use the Hub download APIs as the fetching layer:\n\nFor example:\n\n``` python\nfrom huggingface_hub import snapshot_download\n\nsnapshot_download(\n    repo_id=\"namespace/dataset_name\",\n    repo_type=\"dataset\",\n    revision=\"new_sha_here\",\n    allow_patterns=changed_data_files,\n    local_dir=\"./changed_shards\",\n)\n```\n\nYou can also use `dry_run=True`\n\nto check what would be downloaded and whether files are already cached:\n\n``` python\nfrom huggingface_hub import snapshot_download\n\ninfos = snapshot_download(\n    repo_id=\"namespace/dataset_name\",\n    repo_type=\"dataset\",\n    revision=\"new_sha_here\",\n    allow_patterns=changed_data_files,\n    dry_run=True,\n)\n\nfor info in infos:\n    print(info.filename, info.file_size, info.is_cached, info.will_download)\n```\n\nI would treat this as a download optimization step, not as the source of truth for semantic dataset changes.\n\nThe key design question: what is the change boundary?\n\nThe workaround is much easier if the dataset is already organized into stable shards.\n\n| Dataset layout |\nPractical downstream strategy |\n| Append-only Parquet shards |\nProcess only new/changed Parquet files. Best case. |\n| Append-only WebDataset shards |\nProcess only new/changed `.tar` shards. Also good. |\n| Stable date/source partitions |\nReprocess changed partitions or shards. |\n| Existing shards are rewritten |\nReprocess the rewritten shards. Do not infer row deltas unless you have IDs/changelog. |\n| One giant file/archive |\nHard. Any small logical change may force whole-file reprocessing. |\n| Custom loading script hides file boundaries |\nHarder. Inspect actual repo files if possible. |\n| Stable row IDs + changelog/tombstones |\nRow-level workflows become possible. |\n| No stable IDs/changelog |\nRow-level delta is risky to infer. |\n\nSo my first implementation would probably target **file/shard-level deltas**, not row-level deltas.\n\n##\nWhy dataset layout matters\n\nFor large datasets, the Hub documentation generally points users toward formats such as Parquet and WebDataset. See:\n\nThere is also relevant forum guidance from lhoestq in a large-dataset loading discussion: for a large dataset packaged as one huge zip plus a custom loader, the recommendation was to use multiple shards and formats such as WebDataset or Parquet instead: [How to load a large HF dataset efficiently?](https://discuss.huggingface.co/t/how-to-load-a-large-hf-dataset-efficiently/69288).\n\nThat advice is about loading and dataset layout, not directly about delta fetching. But it points to the same practical boundary: if the dataset is split into meaningful shards, downstream consumers can reprocess only changed shards. If everything is hidden inside one huge file, delta processing becomes much harder.\n\nIf you control the producer side, I would make the dataset downstream-friendly:\n\n```\ndata/\n  date=2026-07-01/part-00000.parquet\n  date=2026-07-01/part-00001.parquet\n  date=2026-07-02/part-00000.parquet\n\n_manifest.jsonl\n\n_changelog/\n  2026-07-01.jsonl\n  2026-07-02.jsonl\n\n_schema/\n  v1.json\n```\n\nUseful producer-side properties:\n\n- append-only shard names\n- stable row/example IDs\n- no rewriting of old shards unless necessary\n- explicit tombstones or delete markers if deletes matter\n- schema versioning\n- a manifest or changelog that downstream consumers can trust\n- batched commits rather than many tiny high-frequency commits\n\nThere is a related `datasets`\n\nissue for producer-side appending, `push_to_hub(..., append=True)`\n\n. That issue is not the same as consumer-side delta fetching, but it is a useful signal that incremental dataset workflows are not just a one-line solved path for every case.\n\nThings I would not over-assume\n\nThe main trap is mixing **transport/cache-level optimization** with **application-level delta semantics**.\n\n| Mechanism |\nHelps with |\nDoes not automatically give |\n| Webhook |\nDetecting repo updates |\nChanged rows/examples |\n| Repo SHA |\nVersion boundary |\nSemantic delta by itself |\n| Manifest comparison |\nChanged files/shards |\nExact row-level insert/update/delete |\n`snapshot_download` / `hf_hub_download` |\nFetching selected files |\nDelta detection by itself |\n| Hub cache |\nAvoiding redundant downloads |\nPipeline processing state |\n| Xet/chunk deduplication |\nStorage/transfer efficiency |\nDataset-level changelog |\n| Dataset Viewer Parquet files |\nInspection/queryable derived artifacts |\nGeneral delta API |\n\nFor example, the Hub docs on [editing datasets](https://huggingface.co/docs/hub/datasets-editing) and [PyArrow integration](https://huggingface.co/docs/hub/datasets-pyarrow) discuss optimized Parquet/Xet behavior that can reduce upload/download/storage costs. That is useful, but I would not treat chunk-level deduplication as a replacement for stable row IDs, manifests, or changelogs.\n\nSimilarly, the Dataset Viewer `/parquet`\n\nendpoint can list converted Parquet files for a dataset. That can be useful for inspection, but I would not treat viewer-converted Parquet files as a general-purpose dataset changelog unless your pipeline is explicitly designed around that derived representation.\n\n##\nA few edge cases I would handle explicitly\n\n**README/card-only updates**: do not trigger heavy processing.\n**Deleted files**: remove or mark corresponding downstream state.\n**Renamed files**: may initially appear as delete + add.\n**Schema changes**: treat separately from normal row additions.\n**Branch/tag updates**: filter to the branch you actually consume, usually `main`\n\n.\n**Private/gated datasets**: make sure the token used by the webhook worker or poller has access.\n**Retries**: make processing idempotent for the same old/new SHA range.\n**Concurrent updates**: if several commits arrive while processing, compare from `last_processed_sha`\n\nto the newest target SHA, or process ranges in order.\n**Cache cleanup**: manage Hub cache separately from pipeline state.\n**Streaming**: useful for avoiding full materialization, but not a delta detector by itself.\n**Existing shard rewrites**: reprocess the whole changed shard unless the dataset gives a stronger row-level contract.\n\nThere is also a practical HF blog example that uses a Hub compare URL with `oldSha..newSha`\n\nto identify changed files and skip README-only updates: [The 5 Most Under-Rated Tools on Hugging Face](https://github.com/huggingface/blog/blob/main/unsung-heroes.md). I would treat that as a useful implementation pattern, while still considering manifest comparison through `huggingface_hub`\n\nAPIs as the more general approach to describe.\n\nIf you really need row-level CDC\n\nIf the real requirement is “give me inserted/updated/deleted rows since version X”, I would treat that as a **changelog/table-format requirement**, not something to infer from arbitrary Hub files.\n\nFor comparison:\n\nThose systems have explicit metadata, snapshots, operation logs, or change streams. A generic HF dataset repo may not have that contract unless the dataset producer publishes it.\n\nFor HF datasets, row-level CDC usually requires at least one of:\n\n```\nstable row/example IDs\nproducer-published changelog\nproducer-published manifest\ntombstone/delete markers\ntable-format metadata\nor accepting shard-level reprocessing instead\n```\n\nMinimal test I would run first\n\nBefore building a full system, I would try this small experiment:\n\n- Pick a dataset repo and two revisions:\n`old_sha`\n\nand `new_sha`\n\n.\n- Use\n`list_repo_tree(..., recursive=True, expand=True)`\n\nfor both revisions.\n- Filter to likely data files.\n- Compare\n`path`\n\n, `size`\n\n, `blob_id`\n\n, `lfs.sha256`\n\n, and/or `xet_hash`\n\n.\n- Classify added / modified / deleted files.\n- Dry-run download only those changed paths.\n- Check whether the changed files are a useful reprocessing unit for your pipeline.\n- If yes, build the webhook/polling + retryable pipeline around that.\n- If no, the dataset layout or missing producer metadata is probably the limiting factor.\n\n##\nSketch of the manifest comparison test\n\n``` python\nfrom huggingface_hub import HfApi, snapshot_download\n\napi = HfApi()\nrepo_id = \"namespace/dataset_name\"\n\ndef file_identity(item):\n    lfs = getattr(item, \"lfs\", None)\n\n    return {\n        \"path\": getattr(item, \"path\", None),\n        \"size\": getattr(item, \"size\", None),\n        \"blob_id\": getattr(item, \"blob_id\", None),\n        \"lfs_sha256\": getattr(lfs, \"sha256\", None) if lfs else None,\n        \"xet_hash\": getattr(item, \"xet_hash\", None),\n    }\n\ndef build_manifest(repo_id: str, revision: str):\n    manifest = {}\n\n    for item in api.list_repo_tree(\n        repo_id=repo_id,\n        repo_type=\"dataset\",\n        revision=revision,\n        recursive=True,\n        expand=True,\n    ):\n        path = getattr(item, \"path\", None)\n        if path is None:\n            continue\n\n        if path in {\"README.md\", \".gitattributes\"}:\n            continue\n\n        manifest[path] = file_identity(item)\n\n    return manifest\n\nold_sha = \"old_sha_here\"\nnew_sha = \"new_sha_here\"\n\nold = build_manifest(repo_id, old_sha)\nnew = build_manifest(repo_id, new_sha)\n\nold_paths = set(old)\nnew_paths = set(new)\n\nadded = new_paths - old_paths\ndeleted = old_paths - new_paths\nmodified = {\n    path\n    for path in old_paths & new_paths\n    if old[path] != new[path]\n}\n\ndata_suffixes = (\".parquet\", \".jsonl\", \".json\", \".csv\", \".tar\", \".zip\")\n\nchanged_data_files = sorted(\n    path\n    for path in added | modified\n    if path.endswith(data_suffixes) or path.startswith(\"data/\")\n)\n\nprint(\"added:\", sorted(added))\nprint(\"deleted:\", sorted(deleted))\nprint(\"modified:\", sorted(modified))\nprint(\"changed data files:\", changed_data_files)\n\ninfos = snapshot_download(\n    repo_id=repo_id,\n    repo_type=\"dataset\",\n    revision=new_sha,\n    allow_patterns=changed_data_files,\n    dry_run=True,\n)\n\nfor info in infos:\n    print(info.filename, info.file_size, info.is_cached, info.will_download)\n```\n\nThis test does not solve row-level CDC. It only checks whether file/shard-level delta processing is practical for the dataset.\n\nShort version\n\nI would start with this:\n\n```\nUse Webhooks or SHA polling for detection.\nStore last_processed_sha.\nCompare repo file manifests between old and new revisions.\nTreat changed data files/shards as the delta.\nDownload and reprocess only those changed shards.\nHandle deletions and schema changes explicitly.\nAdvance last_processed_sha only after downstream success.\nDo not expect arbitrary row-level deltas unless the dataset provides stable IDs, changelog, or table-format metadata.\n```\n\nThat is not a perfect built-in delta system, but it is a practical and testable path with current Hub APIs.", "url": "https://wpnews.pro/news/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset", "canonical_source": "https://discuss.huggingface.co/t/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset/177360#post_3", "published_at": "2026-07-03 11:42:49+00:00", "updated_at": "2026-07-03 21:28:51.846934+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure"], "entities": ["Hugging Face Hub", "HfApi", "snapshot_download", "Hub Webhooks"], "alternates": {"html": "https://wpnews.pro/news/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset", "markdown": "https://wpnews.pro/news/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset.md", "text": "https://wpnews.pro/news/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset.txt", "jsonld": "https://wpnews.pro/news/what-are-the-best-practices-for-detecting-and-fetching-deltas-from-a-dataset.jsonld"}}