{"slug": "dataset-usage-inference-without-shadow-models-or-held-out-data", "title": "Dataset Usage Inference without Shadow Models or Held-out Data", "summary": "Researchers introduced a practical Dataset Usage Inference (DUI) framework that estimates what fraction of a dataset was used to train a machine learning model without requiring shadow models or held-out data. The method generates synthetic non-member samples and uses mixture proportion estimation to quantify dataset usage, demonstrated on large image generative models.", "body_md": "arXiv:2606.26257v1 Announce Type: new\nAbstract: How much of my data was used to train a machine learning model? Dataset Usage Inference (DUI) aims to answer this by estimating what fraction of a dataset contributed to a model's training. However, existing DUI methods rely on assumptions that rarely hold in practice: they require training expensive shadow models to imitate the target model, and they assume access to both known training samples and an in-distribution held-out set confirmed to be absent from training. These conditions make current approaches impractical for modern large models and real data ownership disputes. We introduce a practical DUI framework that removes these constraints. Our method requires neither shadow models nor real held-out data. Instead, it generates synthetic non-member samples, extracts diverse membership signals, and casts DUI as a mixture proportion estimation problem to estimate what share of the candidate dataset was used during training. Experiments on large image generative models show that our method reliably quantifies dataset usage, providing a practical tool for data owners to determine how much of their data was used to train a model.", "url": "https://wpnews.pro/news/dataset-usage-inference-without-shadow-models-or-held-out-data", "canonical_source": "https://arxiv.org/abs/2606.26257", "published_at": "2026-06-26 04:00:00+00:00", "updated_at": "2026-06-26 04:17:43.872809+00:00", "lang": "en", "topics": ["machine-learning", "ai-research", "ai-ethics"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/dataset-usage-inference-without-shadow-models-or-held-out-data", "markdown": "https://wpnews.pro/news/dataset-usage-inference-without-shadow-models-or-held-out-data.md", "text": "https://wpnews.pro/news/dataset-usage-inference-without-shadow-models-or-held-out-data.txt", "jsonld": "https://wpnews.pro/news/dataset-usage-inference-without-shadow-models-or-held-out-data.jsonld"}}