What my privacy papers (don't) have to say about copyright and generative AI

In his article, Nicholas Carlini explains that his research on "memorization" in machine learning models, which demonstrates that models can sometimes output verbatim training data, is often cited in copyright lawsuits against generative AI. However, he clarifies that his work focuses on privacy and security risks, such as the leakage of personally identifiable information (PII), not on copyrightable expressive works like art or literature. Carlini warns that while his papers prove models can emit training data, they do not provide simple evidence for either side in copyright cases, requiring more nuanced legal argumentation.

by Nicholas Carlini 2025-03-11 There are now a bunch of court cases that ask whether or not training a generative machine learning model on data that is copyrighted is itself a copyright violation. And because I'm one of the people that led the first few papers to show machine learning models can output verbatim training examples text or image , the lawyers in these cases often point to my papers as evidence that models either do, or do not, violate copyright. This worries me. I write these papers on this so-called "memorization" problem from a privacy and security angle: it would be bad if, for example, a hospital trained a model on patient data and then released the model, because an attacker could query the model to recover specific patient medical information. At right, for example, you can see an image that Stable Diffusion trained on, and the same image we could "extract" from the model by making queries to it. But there are significant differences between what I do and what lawyers want. I focus on showing PII leaks e.g., emails, phone numbers, and addresses for text models; pictures of people's faces for image models . Lawyers typically care about copyrightable "expressive" work e.g., art or literature . This is not what I'm studying: an attacker doesn't care if a model leaks a poem or a painting to an adversary, the attacker wants your credit card number or medical history. And so this is what I go looking for, to see if it's possible to steal. At some level, the work I do isn't completely unrelated. If it's possible to interact with a model and make it reproduce a particular copyright work, then this might be a problem. But because my motivation is not copyright, the evidence I provide probably isn't directly applicable to legal cases. And so, in this article, I want to tell you a hypothetical lawyer who is going to cite my work what conclusions you can and can't draw from these papers. In particular, I want you to understand that: There is exactly one thing I think people can take away from my papers when arguing whether or not training machine learning models violates copyright: models sometimes emit training data. It doesn't happen every time. Models can produce output that is not just a direct copy of what they were trained on. And it doesn't happen never. But, sometimes, models directly output text or images that they were trained on. This means you're not going to be able to get away with the simple line arguments that say "my model isn't violating copyright because it can't reproduce any training images." This is just factually incorrect. And you can't get away with an equally simple line of argument that says "your model is violating copyright because it reproduces my entire book verbatim." This probably is incorrect but it's not technically impossible . You're going to need to do some actual legal argumentation based on the nuance in these facts. Let me now walk through my privacy papers and tell you what conclusions can be drawn. This is the first paper that really showed large machine learning models could output data they were trained on. This was in 2020, and so we were evaluating GPT-2, the pre-pre-precursor to ChatGPT. This language model was tiny by today's standards---just 1.5 billion parameters. We also managed to extract only a tiny amount of training data from the model---just 600 unique training examples. This implies something very important for the lawyers: it means, somewhere "inside" the model, it "knows" some of the training examples. I'm using scare quotes here because anthropomorphising these models isn't helpful, but English is hard; obviously the model doesn't "know" anything, and there is no "inside" the model. Intuitively, though, I hope you see what I mean. By providing GPT-2 nothing meaningful as input, and running the algorithm forward using nothing more than the model's own weights as input, it outputs text that is verbatim present in the training data. The probability that this would have happened by random chance is astronomically low, and so we can say that the model has "memorized" this training data. To deny this happens is to deny science. To quote one of my collaborators in an an excellent paper dedicated to the study of copyright and machine learning: "Please do not be upset with us if our technical description of how memorization works is inconvenient for your theory of copyright law. It is not climate scientists’ fault that greenhouse-gas emissions raise global temperatures, regardless of what that does for one’s theory of administrative or environmental law. It is simply a scientific fact, supported by extensive research and expert consensus. So here." Let me now contextualize this 600 number. As I said: the focus of our paper was on privacy. No one had ever shown models output large amounts of training data before. And so once we'd shown that GPT-2 could violate privacy of even some of its training data, the marginal value of showing that it can violate the privacy of more training data is relatively low. So we didn't go looking for more. And so when some people cite this paper as evidence that that state-of-the-art models memorize only 600 training examples out of several billion, they are wrong. We only showed that the model memorized 600 training examples because we validated each and every memorized training example manually, searching for it on the Internet, until we could be relatively sure it was or was not memorized. This process was slow and time consuming, and so we were only able to check 1800 different training examples, and of these, 600 were memorized. So the rate of memorization is likely considerably higher than just this number. Also: GPT-2 is a pretty terrible model by today's standards, and facts that are true about GPT-2 are not necessarily true about larger and better models today. If you're going to make a legal argument about a model that is not GPT-2, you should probably study that model specifically, and not this one. In the prior GPT-2 paper I talked about above, we speculated and had a tiny bit of evidence that larger models would memorize more than smaller models. Next, in this paper we quantitatively measured to what extent this is the case, and found a compelling trend where larger models memorized much more training data than smaller models. Specifically, our paper showed the following three plots: Each of these three plots have a lot of technical details that I'm glossing over, and for this reason, is probably not something you want to cite if you're making copyright arguments. The exact details of the argument can have a big difference on the outcome, and so you should make sure that what's demonstrated actually matches the argument you're trying to make. Let's now turn our attention from generative language models towards generative vision models. It turns out that these vision models behave similarly, and diffusion models also output memorized training examples. Again, we focus our paper on privacy, and show that models can reproduce images they were trained on. Defining memorization in the space of image models is somewhat harder, because unlike text, you're almost definitely not going to be able to reproduce an image pixel-for-pixel. In our paper we define two images as similar if the pixel-wise distance between them is "very small". This means we're going to miss a lot of potential memorized training examples, but anything we say is memorized is definitely memorized. Again, we do this because we're motivated by privacy. It matters more to me that I get a lower bound that's irrefutable than a best-estimate average case. Here's an interesting fact, though: the rate of memorization appears somewhat lower than language models. We tried relatively hard to generate as many potential memorized training examples as we could, and found just a hundred of them. There are several reasons we might guess this happens. First: we were studying Stable Diffusion, a relatively small again, by today's standards generative image model. Bigger models might memorize more. But also AND THIS IS BASELESS CONJECTURE AND NOT SCIENTIFIC FACT the "size" of an image is much larger than for text. Reproducing an image in its entirety requires that the model has "stored" probably 10,000 to 100,00 equivalent words worth of data. I want to make one more final point with regard to diffusion models: With language models, sometimes people say "how do you know the model didn't just get lucky and generate the text by chance?" Giving a technical answer talking about entropy and the like is hard for some people who are nontechnical to understand. It's easier to make them "feel" this for image models. Given that I can take the stable diffusion model parameters, input the prompt "Ann Graham Lotz" and get a picture of Ann Graham Lotz, the only possible explanation is that the model has somewhere internally stored a picture of Ann Graham Lotz. There just is no other explanation; it can't be due to chance. The attacks in all of the above papers are all focused on attacking "base" models that aren't directly used in production. Once someone has one of these base models, they usually do a bunch of additional training to make them behave nicer and follow instructions better. In the next paper I want to discuss, we were interested in measuring whether or not this post-training process makes the privacy challenges go away. We found that it doesn't. Specifically, we showed that even aligned, production models like ChatGPT can output memorized training data. We did this again with an adversarial mind-set: we found that there exists a prompt that, when given to the model, will cause it to output memorized training examples with very high probability. We also found that the rate of memorization in ChatGPT is much higher than in prior models. We aren't able to explain the cause of this difference, but it's a very pronounced effect. But at the same time, while we did find that ChatGPT can output memorized training data, almost all of the memorized training data we found found was rather boring boilerplate text that was present on the internet in many places. Now this is fine when making the privacy argument I want to be able to go to hospitals and financial institutions and internet search companies and be able to tell them "thou shalt not train on sensitive data", because if you did, then your model would also output that sensitive data. The fact that ChatGPT outputs the type of data it was trained on is a good thing for my argument. But it's not as good of a thing for the lawyers, because most of what we extract is boring and uninteresting. Lawyers might say it's "not expressive". Let me now comment on a few legal filings and decisions where my papers have been cited. According to researcher Nicholas Carlini, “diffusion models are explicitly trained to reconstruct the training set. source First let me be pedantic on one point: research papers are not written by a single person; this particular paper has nine co-authors; attributing this quote to me personally is not appropriate. Although, in this case, they're not wrong; I did write something similar to that sentence in the paper. Now let me comment on the substance of the quote. What I am saying here is a technical point about the difference between how diffusion models are trained and how another type of generative model is trained GANs, Generative Adversarial Networks . Whereas GANs are trained by training a generator to fool a discriminator that itself is trained to be able to distinguish between real and fake data , diffusion models are much more like language models in that diffusion models are directly trained to reconstruct training images when given noisy versions of those images. This technical difference offers a potential explanation for why diffusion models are more likely to output memorized training data than GANs. Importantly, I am not sa