# New framework for auditing machine unlearning

> Source: <https://research.google/blog/new-framework-for-auditing-machine-unlearning/>
> Published: 2026-06-10 17:34:55+00:00

June 10, 2026

Mónica Ribero, Research Scientist, Google Research

We introduce a method designed to confidently determine whether there is statistically significant evidence that two sets of data observations come from entirely different underlying distributions.

[Machine unlearning](https://blog.bluedot.org/p/what-is-machine-unlearning-and-why-is-it-useful?utm_source=google&utm_medium=pmax&utm_campaign=FoAI_&utm_term=&utm_content=&gad_source=1&gad_campaignid=22554833691&gbraid=0AAAAA_kXnYMbcfHY6n_122eOSgMgJ8Srz&gclid=CjwKCAiAzZ_NBhAEEiwAMtqKyzqcP658gcW-LVhkrI9dNSUngywj-IBJW7CAErduYTAMG2rKSYq5BhoCowcQAvD_BwE) allows AI systems to "forget" specific parts of their training data without the massive cost of retraining a model from scratch. This is essential for regulatory compliance (like [GDPR’s "Right to be Forgotten"](https://gdpr.eu/right-to-be-forgotten/)), AI safety, and model quality.

As models process increasingly massive and highly sensitive datasets, verifying machine unlearning has moved from theoretical ideal to a strict requirement, where developers must now mathematically prove privacy. However, because auditors often don’t have access to the model's internal workings or original training data, they must verify the system strictly by querying it and analyzing the output samples.

One method data scientists and researchers rely on for verification is [two-sample testing](https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing), a statistical method that determines if two sets of data observations come from entirely different underlying distributions. For example, to verify unlearning, auditors might compare outputs from a model that never saw a specific record against a model that supposedly "forgot" it. If the outputs are statistically different within a defined threshold, the unlearning failed.

As models grow in size and complexity, two-sample testing and other statistical tools used for machine unlearning auditing become challenging to implement and they lose statistical power. To identify a real violation from random noise inherent in large-scale models, and with enough statistical significance, an auditor needs to extract a large number of samples. This makes real-world testing completely computationally very expensive..

To address this growing challenge, we introduce [Regularized f-Divergence Kernel Tests](https://arxiv.org/pdf/2601.19755), presented at [AISTATS 2026](https://virtual.aistats.org/), a new framework designed to make auditing ML models much more sensitive, flexible, and accurate. We theoretically prove that our tests naturally control for false positives for any sample size, and that the risk of false negatives reliably converges to zero as the number of available data samples increases.

Evaluating model safety often requires measuring the distance, or divergence, between two complex data sets. Different applications naturally require different notions of “distance”. While popular standard tools like [maximum mean discrepancy](https://www.onurtunali.com/ml/2019/03/08/maximum-mean-discrepancy-in-machine-learning.html) (MMD) excel at detecting broad, global shifts across data (such as a model systematically generating brighter images than its counterpart), they often lack the necessary specificity to capture complex anomalies. For instance, if the addition of a specific person's data causes a model to generate a highly specific outlier output only when prompted in a very exact way — while having an equal distribution on all other samples — traditional MMD tests might completely overlook this local shift.

Also, most existing testing frameworks force researchers to make error-prone manual choices, such as picking the specific statistic best suited for either global or local shifts or tuning complex settings like [kernel bandwidths](https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection) and [regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)#Tikhonov_regularization_(ridge_regression)) parameters.

In addition to being hard in practice, two-sample testing as a verification method is flawed when verifying unlearning of ML models. Consider the example below showing how two models trained from scratch on the exact same data can produce different distributions. The blue distribution is the distribution of a model retrained without compromised data. However, its distribution is different from the standard (green) due to retraining with different batch sizes. This results in a false positive, indicating that the tested model is unsafe.

Furthermore, [recent work](https://arxiv.org/abs/2510.16629) shows that an AI model can never perfectly “forget” data just by tweaking its current settings; unless it re-traces every step of its original training, it will always leave behind a permanent footprint of the information it was supposed to delete. Accordingly, achieving perfect “[retrain equivalence](https://proceedings.neurips.cc/paper/2021/hash/9627c45df543c816a3ddf2d8ea686a99-Abstract.html)” is fundamentally impossible for standard, local unlearning algorithms and a traditional two-sample test can always find a dependence on the “forget set”.

We resolve this challenge by proposing a relative distance test that measures whether an unlearned model is distributionally closer to a safely retrained model or to the original, compromised one.

Our test acts as a highly adaptable statistical toolkit that leverages [f-divergences](https://en.wikipedia.org/wiki/F-divergence) to allow auditors to pinpoint highly specific types of data shifts, including:

Calculating these divergences on high-dimensional, real-world data is notoriously difficult. To make these complex optimization problems tractable without requiring massive amounts of compute, we use [kernel regularization methods](https://arxiv.org/pdf/2409.14980) to estimate the differences efficiently.

Our adaptive testing approach automatically selects the best divergence and the optimal hyperparameter configurations to maximize the reliability of the test, entirely eliminating the need for sample splitting.

Because our proposed tests are general, we experimented across a wide variety of problems. We evaluated our framework on [perturbed uniforms](https://arxiv.org/abs/2110.15073) (synthetic two-sample benchmarks), as well as [the Expo1D outlier detection task](https://arxiv.org/abs/1806.02350) within physics datasets — a specialized area that uses ML to search for new physical phenomena outside the standard model of particle physics. We used high-energy physics data because that field requires the world’s most precise "difference detectors” — the idea being, if the framework can spot a rare particle that defies the laws of physics, it can spot a tiny privacy leak in an AI model.

We then shifted our primary focus to the critical, real-world applications of auditing differential privacy and evaluating machine unlearning:

Our framework successfully recovered or outperformed all previous baseline methods with significantly less manual tuning.

The experimental results demonstrated that no single test consistently outperforms the others across every possible scenario. Instead, different f-divergences act as specialized sensors that "light up" for different types of localized data shifts. By using an aggregated approach across diverse statistics, our framework successfully caught subtle errors and anomalies that standard tests completely missed.

For privacy auditing, the hockey-stick divergence test proved to be a powerful and effective tool. Because it directly aligns with the mathematical foundations of pure differential privacy, it allows auditors to tightly control the acceptable degree of data shift. Our adaptive testing framework successfully caught privacy violations using significantly fewer data samples and requiring far less hyperparameter tuning than previous baseline testers.

In one notable instance, our framework detected violations in a specific [sparse vector technique mechanism](https://arxiv.org/abs/1603.01699) (SVT3) using only a few thousand samples, while previously studied [techniques](https://arxiv.org/abs/2307.05608) like [DP-Auditorium](https://arxiv.org/abs/2307.05608) required millions of samples to approximate the same violation detection rate.

Our findings also suggest a redefinition of how to evaluate machine unlearning. As shown in the table below, we observed that none of the approximate unlearning methods we evaluated were compliant with the strict, standard two-sample unlearning definition. Because two-sample tests simply look for any distributional difference, they incorrectly flagged perfectly safe, retrained models as unlearning failures.

In contrast, our proposed relative three-sample test successfully overcame this flaw. It correctly and consistently identified the safely retrained models as "safe". When evaluating the approximate unlearning algorithms, only the random label technique passed the evaluation.

Other popular methods, such as finetuning, pruning, and [Selective Synaptic Dampening](https://github.com/if-loops/selective-synaptic-dampening), were found to be ineffective at truly forgetting the targeted data. We emphasize that our primary goal in these experiments was the evaluation of the unlearning methodologies, rather than designing the algorithms themselves. Consequently, we used simplified implementations of these unlearning procedures; more rigorous setups will be required to rank unlearning methods in practical production environments.

Our newly proposed framework provides a much more precise, adaptable, and mathematically sound lens for examining ML behavior. By leveraging [regularized f-Divergence kernel tests](https://arxiv.org/pdf/2601.19755), researchers and auditors can now statistically prove whether a model is behaving unsafely or leaking data across a massive class of problems and complex distributional shifts.

As this field evolves, theoretically grounding our empirical observations to characterize exactly which specific divergence is optimal for other novel tasks remains an exciting direction for future work. Establishing tighter sample complexity bounds will also be a key focus to make these audits even more efficient.

*The work described here was done jointly with Antonin Schrab and Arthur Gretton. We thank Nicole Mitchell and Eleni Triantafillou for insightful feedback, and Kimberly Schwede for the graphics and Mark Simborg for helpful edits.*