cd /news/machine-learning/dataset-usage-inference-without-shad… · home topics machine-learning article
[ARTICLE · art-40282] src=arxiv.org ↗ pub= topic=machine-learning verified=true sentiment=· neutral

Dataset Usage Inference without Shadow Models or Held-out Data

Researchers introduced a practical Dataset Usage Inference (DUI) framework that estimates what fraction of a dataset was used to train a machine learning model without requiring shadow models or held-out data. The method generates synthetic non-member samples and uses mixture proportion estimation to quantify dataset usage, demonstrated on large image generative models.

read1 min views1 publishedJun 26, 2026

arXiv:2606.26257v1 Announce Type: new Abstract: How much of my data was used to train a machine learning model? Dataset Usage Inference (DUI) aims to answer this by estimating what fraction of a dataset contributed to a model's training. However, existing DUI methods rely on assumptions that rarely hold in practice: they require training expensive shadow models to imitate the target model, and they assume access to both known training samples and an in-distribution held-out set confirmed to be absent from training. These conditions make current approaches impractical for modern large models and real data ownership disputes. We introduce a practical DUI framework that removes these constraints. Our method requires neither shadow models nor real held-out data. Instead, it generates synthetic non-member samples, extracts diverse membership signals, and casts DUI as a mixture proportion estimation problem to estimate what share of the candidate dataset was used during training. Experiments on large image generative models show that our method reliably quantifies dataset usage, providing a practical tool for data owners to determine how much of their data was used to train a model.

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/dataset-usage-infere…] indexed:0 read:1min 2026-06-26 ·