Welcome! Thank you all for the wonderful discussions that we’ve already been having.
This is a unique time. We’re excited to bring technical scholars and legal scholars together to learn from each other.
Today, we’re going to focus on IP and on privacy. In between, we’ll have posters.
What are we talking about today? There are many ways to break down the problem. Dataset creation, model training, and model serving all involve choices.
For dataset creation: what type of data, how much data, web data, which web data, and what makes data “good?” Should we use copyrighted data? Private data? What even makes data private? (We delve into some of these in the explainers on the conference website.)
For model training, questions include whether we will use differential privacy? A retrieval model?
For model serving, that includes prompting, generation, fine-tuning, and human feedback. Fine-tuning does more training on a base or foundation model to make it more useful for a particular domain. Human feedback asks users whether a generation is good or bad, and that feedback can be used to further refine the model.
The recurring theme: there are lots of choices. All of these choices implicate concrete research questions. ML researchers have the pleasure of engaging with these questions. These choices also implicate values beyond just ML. The lawsuits against AI companies raise significant questions. The research we do is incorporated into products that serve the public, which has broader implications, such as copyright.
The big theme for today: bring together experts working on both sets of questions to share knowledge. How do concrete legal issues affect the choices ML researchers make and the questions ML researchers pursue?
##Some Nonobvious Observation About Copyright’s Scope For Generative AI Developers ##
Section 1202 makes it illegal to remove or alter CMI, knowing that it will facilitate copyright infringement. It was enacted in 1998 out of fear that hackers would strip out CMI from works, or that infringers would use false CMI to offer up copyrighted works as their own. Congress put in a $2500 minimum statutory damages provision, so these are billion-dollar lawsuits.
Getty is claming that its watermarks on it stock photographs are CMI. Stability is the defendant in two of these cases. One is brought by Getty; the other is a class-action complaint in which Sarah Anderson is the lead plaintiff. OpenAI is being sued twice by the same lawyer (the Saveri group) in the Silverman and Tremlay lawsuits. Meta is being sued by that group as well, and another legal group is suing Alphabet and DeepMind. There are also a substantial number of non-copyright lawsuits. The most significant is the suit by four John Does against OpenAI, GitHub, and Microsoft over GitHub Copilot (which doesn’t actually include a copyright claim).
In DC, Congress is holding hearings, and the Copyright Office is holding listening sessions, and then it will have a notice of inquiry this fall. People in the generative AI community should take this seriously.
Why are these lawsuits being brought? The lawyers in class actions may take as much as a third of any recovery. And many authors fiercely object to the use of their works as training data (even things that were posted on the Internet).
Copyright only protects original expression. The melody of a musical work, words in a poem, lines of source code. There is also a recognition that two works can be substantially similar (not identical) even though they don’t have exact literal identity. Often the ultimate question is from the perspective of a “lay observer/listener/viewer.”
Some works like visual art are highly expressive, while others like factual compilations have a lot of unprotectable (e.g. factual) elements. Courts will often filter out the unprotectable elements from these “thin” copyrights. Example: Saul Steinberg’s “New Yorker’s View of the World,” which was infringed by the movie poster for Moscow on the Hudson). There are lots of differences, but the overall appearance is similar. A more recent example, where there are differences of opinion, is Warhol’s _Orange Prince vs. Lynn Goldsmith’s photograph of Prince, on which it was based.
Another example: Ho v. Taflove. Chang started working for Ho, then switched to Taflove. Ho developed a new mathematical model of electron behavior. Chang and Taflove published a paper drawing from Ho’s work, causing Ho’s work to be rejected (since the model had already been published). But Ho lost his copyright infringement lawsuit (even though it was academic misconduct) because it was an idea, not expression. It didn’t matter how creative it was, it wasn’t part of the expression in the work. Under the merger doctrine of Baker v. Selden, copyright gives no monopoly on scientific ideas.
The Copyright Act gives exclusive rights including the derivative works right. It’s defined in an open-ended way. Can a generative AI system infringe? The Silverman complaint argues that ChatGPT can produce a detailed summary of her book, which is a derivative work. And training data can contain may copies of a work, so the AI model can essentially memorize them. (Slide with lots of examples of Snoopy)
Is the person who puts in the prompt “Snoopy doghouse at Christmas” the infringer? Is the AI system a direct infringer? I don’t think so, because there is no human volition to bring about that image, making the user the infringer. But there is indirect infringement, so perhaps Midjourney could be the indirect infringer! There are four different categories. The most relevant is “vicarious” infringement: does the entity have the “right and ability to control” users’ conduct, and do they have a financial benefit from the user’s infringement?
In general, the outputs will not be substantially similar to expressive elements in particular input works. Insofar as there isn’t infringement of a specific work, there isn’t infringement. The Anderson complaint all but admits this. The GitHub complain reports that 1% of outputs included code matching training data. Is that enough to make Copilot a direct or indirect infringer? The Getty complaint includes a specific Stable Diffusion output, which has enough differences that it’s probably not substantially similar.
These lawsuits are in the very early stages. Note that challenges to a lot of other new technologies have failed, while others have succeeded.
##Is Training AI Copyright Infringement? ##
I come to you as a time traveller from a time before PowerPoint. Training data is, as Pam said, the big kahuna. Even if these systems can be prompted to create similar outputs, it doesn’t mean the end of the enterprise. But if it is illegal to train on any work created after 1928 (the cutoff for copyright), training is in big trouble. This is true of all kinds of AI, not just generative AI. To train a self-driving car, you need to train on copyrighted photos of stop signs.
Some incumbents happen to have big databases of training data. Google has its web index, Facebook has billions of uploaded photos. The most common approach is to look at the web – Common Crawl has gathered it together (subject to robots.txt). The LAION image database is built on Common Crawl. All of this is still copyrighted, even if it’s out there on the web. And you can’t do anything with a computer that doesn’t involve making new copies. The potential for copyright liability is staggering if all of those copies of all of those works are infringing.
I will focus on fair use. Fair use allows some uses of works that copyright presumptively forbid, when the use serves a valuable social purpose, or doesn’t interfere with the copyright owner’s market. There are some cases holding that making temporary copies or internal copies for noninfringing uses is okay. The closest analog is the Google Book Search case. Google scanned millions of books and made a search engine. It wouldn’t give you the whole book text, but it would give you a small four-line snippet around the search result. Authors and publishers sued, but the courts said that Google’s internal copies were fair use because the purpose was not to substitute for the books. It didn’t interfere with a market for book snippets; there is no such market.
Other analogies: there are video-game cases. A company made its game to run on a platform, or vice-versa, and in both cases you have to reverse-engineer a copyrighted work (the game or the console). The product is noninfringing (a new game) but the intermediate step involved copying. This was fair use.
The thing about law is that it is backward-looking. When you find a new question, you look for the closest old question that it’s an analogy to. Is this closer to cases finding infringement, or to cases finding no infringement? Most of these cases in prior eras were clear that internal uses are fair use.
Now, past performance is no guarantee of future results. We’re in the midst of a big anti-tech backlash. “AI will destroy the world, and it will also put me out of a job.” This might affect judges; there is a potential risk here.
Fair use has always also worried about displacing markets. So if there were a market for selling licenses to train on their data, and I took it for free instead of paying, that would be less likely to be a fair use. There hasn’t been such a market, but you could imagine it developing. Suppose that Getty and Shutterstock developed a licensing program. That could be harder.
Mark is a lawyer for Stability AI in one of these cases. But it’s notable to him that Stable Diffusion is trained on two billion images. What’s the market price for one of those images? Is it a dollar? If so, there goes the entire market cap of Stability AI. Is it a hundredth of a cent? That doesn’t sound like a market that’s likely to happen?
Finally, fair use is a U.S. legal doctrine. Even if you’re developing your technology in the U.S., you’re probably using it somewhere else. Other countries don’t have fair use as such. Fair dealing is narrower; some countries have specific exceptions for text and data mining (e.g. Israel, Japan, U.K.). And there are lots of countries that just haven’t thought about it. So while the lawsuits have been filed here, and that’s where the money is, the bigger legal risks could be in other jurisdictions, like Europe or India or China.
I think the law will and should conclude that AI training is fair use. We get better AI: less bias, better speech recognition, safer cars. It’s unlikely that Getty’s image set alone is broad and representative. But this is by no means a sure thing.
Just a couple of other things. The output question is a harder question for copyright law. It’s a much more occasional question, because the vast majority of outputs are not infringing. But it’s still interesting and important, and system design can have a major imapct on it. If you find your system produces an output that is very similar to a specific input. Why?
One of the things that computer scientists can do is to help us clearly articulate the ways that the technology work. In Anderson, the plaintiffs say that Stable Diffusion is a “collage” technology. That’s a bad metaphor, and it misleads. We need the technical community to outline good ways of understanding the math behind it.
I’m not a lawyer; my background is in STS. I lead an interdisiplinary team that focuses on the impacts of OpenAI’s work.
My goal is to clarify things that may not be obvious about “how things work” in AI development and deployment. Insights about where things happen rather than the technical details.
AI development vs. deployment = building the thing vs. shipping the thing. Development is pre-training and fine-tuning; deployment is exposing those artifacts to users and applying them to tasks.
Lots of focus is on the model, but other levels of abstraction also matter, such as the components and systems built on top of those models. There are typically many components to a system or a platform. ChatGPT+ has plugins, a moderation API, usage policies, Whisper for speech recognition, the GPT4 model itself (fine-tuned), and custom instructions.
These distinctions matter. The legal issues in AI are not all about the base model. You can imagine a harmful-to-helpful spectrum. You might improve the safety or legality of a specific use case, but things are much more complex for a model that has a wide range of uses. It has a distribution on the spectrum, so policies shift and modify the distribution.
Development and deployment are non-linear. Lots of learning and feedback loops.
GPT-4 was a meme from before it existed. Then there were small-scale experiments, followed by pre-training until August 2022, testing and red-teaming, fine-tuning and RLHF, early commercial testing, later-stage red-teaming, system card and public announcement, scaling up of the platform, plugins, and continued iteration on the model and the system. This is an evolving service in many versions. An in parallel with all of this, OpenAI iterated on its platform and usage policies, and launched ChatGPT4. The system card included an extensive risk assessment.
You can imagine a wide variety of use cases, and layers raise legal issues that are different from base models. The moderation endpoint, for example, is a protective layer on top of the base model. Fine-tuning and the system includes attempts to insert disclaimers and refuse certain requests, and there are mechanisms for user feedback.
Lesson 1: It’s not all about the base model. It’s important but not everything. Outside of specific context like open-source releases, it’s not the only thing you have to think about. Example: regulated advice (e.g. finance, legal, medical). This is much more a function of fine-tuning and moderation and filters and disclaimers than it is of the base model. Any reasonably competent base model will know something about some of these topics, but that’s different from allowing it to proclaim itself as knowledgable. Another example: using tools to browse or draw from a database.
Lesson 2: David Collingridge quote. Important insight is that comprehensive upfront analysis is not possible. Many components and tools don’t exist when you would do that; they were adapted in response to experience with how people use it. The “solution” is reversability and iteration. Even with an API, it is not trivial to reverse changes (let alone with an open-source model). But still, some systems can be iterated on more easily than others. Example of a hard-to-anticipate use case: simplifying surgical consent forms. Example: fine-tuning is massively cheaper than pre-training. And use-case policies can be changed much more easily than retraining. System-level interventions for responding to bad inputs/outputs can be deployed even if retraining the model is infeasible.
Lesson 3: Law sometimes provides little guidance for decision-making. There is wide latitude as to which use cases a company can allow. That’s not unreasonable. There’s limited clarity as to what is an appropriate degree of caution (not just in relation to IP and privacy). Examples: human oversight is important sometimes, but norms of what is an appropriate human-in-the-loop can be loose. Many companies go beyond what is obviously legally required. Similarly, companies need to disclose limitations and dangers of the system. But sometimes, when the system behaves humbly, it can paradoxically make users think it’s more sophisticated. Disclosure to the general public – a challenge is that companies radically disagree about what the risks are and where the technology is going.
To pull this all together, the law fits in at lots of places and lots of times in the AI development process. Even when it does fit in, the implications aren’t always clear. So what to do?
Cooper: Can you talk a bit about the authorship piece of generative AI?
Samuelson: Wrote about this 35 years ago when AI was hot. Then it died down again. She thought most of the articles then were dumb, because people said “the computer should own it.” She went through the arguments people made, and said, “The person who is generating the content can figure out whether it is something that has commercial viability.” And that’s not very far from where the Copyright Office is today. If you have computer-generated material in your work, you need to identify it and disclaim the computer-generated part.
Lemley: I agree, but this isn’t sustainable. It’s not sustainable economically, because there will be a strong demand for someone to own these works with exact value. They’ll say “if value then right.” We should resist this pressure, but it’s out there. But also, in the short term, people will provide creativity by prompt creation. Short prompts won’t get copyright, but if the prompt becomes detailed enough, at some point they will say my contributions reach the threshold of creativity. That threshold in copyright law is very low. People will try to navigate that line by identifying their contributions.
But that will create problems for copyright law. AI inverts the usual idea-expression dichotomy because the creative expressive details are easy now, even though my creativity seems to come out in the “idea” (the prompt). This will also create problems for infringement, because we test substantial similarity at the levels of the output works. Maybe my picture of penguins at a beach looks like your picture of penguins at a beach because we both asked the same software for “penguins at the beach.” Similarity of works will no longer be evidence of copying.
Villa: Push back on this. I’m the open-source guy. We’ve seen in the past twenty years a lot of pushback against “if value then right.” Open source peole want to share, and that has created a constituency for making it easier to give things away. We’re seeing this with the ubiquity of photography. Is there protectable expression if we all take photographs of the same thing. Wikipedia was able to get the freedom of panorama into European law. If you took a picture of the Atomium, you were infringing the copyright in the building. There are now constituencies that reject “if value then right.”
Balkin: Go back to the purposes of copyright. In the old days, we imagined a world where there would be lots of public-domain material. We’re in that world now. There are lots of things that will potentially be free of copyright. What’s the right incentive structure for the people who still want to make things?
Lemley: Clearly, no AI would create if it didn’t get copyright it its outputs. That’s what motivates it. One response is that yeah, there will be more stuff that’s uncopyrighted. People will be happy to create without copyright. People make photos on their phones not because of copyright but because they want the pictures. People make music and release it for free. These models can coexist with proprietary models. But one reason we see the current moral panic around generative AI is that there is a generation of people who are facing automated competition. They are not the first people to worry about competing with free. My sense is, it works out. Yes, there is competition from free stuff, but they also reduce artists’ costs of creation, so artists will use generative AI just as they use Photoshop. It’s cheaper to get music and to make music. And people who create don’t do it because it’s the best incentive, but because they’re driven to. (The market price of a law-review article is $0.) But that’s not a complete answer, because they still have to make a living. A third answer: there’s a desire for the story, the authenticity, the human connection. So artisinal stuff coexists with mass-produced stuff in every area where we see human creativity. There will be disruption, but human creativity won’t disappear.
Lee: The way that we conceive of generative AI in products today isn’t the way that they will always work. How will creative people work with AIs? It will depend on those AI products’ UIs.
Samuelson: One of the things that happened in the 1980s when there was the last “computer-generated yada yada” was that the U.K. passed a sui generis copyright-like system for computer works. That’s an option that will get some attention. There are people in Congress who are interested in these issues. So if there’s a perception that there needs to be something, that’s an option. Also, we could consider what copyright lawyers call “formalities”: notice meant you had to opt in to copyright. Anything else is in the public domain. We used to think that was a good thing. Creators can also take comfort that many computer-generated works will be within a genre. So the people who do “real art” will benefit because they can create things that are better and more interesting.
Lemley: There’s a good CS paper: if you train language models on content generated by language models, they eat themselves.
Villa: We may have been in a transitory period where lots of creativity was publicly shared collaboratively in a commons. Think of Stack Overflow, Reddit, etc. ChatGPT reduced StackOverflow posts. Reddit is self-immolating for other reasons.
Lemley: They’re doing it because they want to capture the value of AI training data.
Villa: We’ve taken public collaborative creativity for granted. But if we start asking AIs to produce these things, it could damage or dry up the commons we’ve been using for our training. Individual rational choices could create collective action challenges.
Lee: The world of computer-generated training data is very big. There may be some valid uses, but some have to be very careful.
Villa: Software law has often been copyright and speech law. We’ve had to add in privacy law. But the combinatorial issues in machine learning are fiendishly complex. The legal profession is not ready for it. But being an ML lawyer will be extremely challenging. The issues connect, and being “only” a copyright lawyer is missing important context. See Amanda Levendowski’s paper on fair use and ML bias.
Lee: It’s hard even to define what regurgitation or memorization is. “Please like and subscribe” is probably fine, but other phrases might be different and problematic in context.
Lemley: There’s a fundmaental tension here. If we’re worried about and false speech, a solution would be to repeat training data exactly. That’s not what copyright law prefers. We may have a choice between hallucination and memorization. [JG: he put it much better.]
Samuelson: Lots of lawyers operate in silos. But when we teach Internet law, we have to get much broader. [JG: did someone say Internet Law?] There is a nice strand of people in our space who know something about the First Amendment, about jurisdiction, etc. etc.
Lemley: Academia is probably better about this than practice.
Balkin: There is a distinction between authenticity (a conception of human connection) and originality (this has never been seen before). A real difference between what we do to form connection, and where people do stuff that’s entertaining to them. Just because what comes out of these systems right now is not every interesting, maybe it will be more so in a few years. But that won’t solve the problem of human connection, which is one of the purposes of art. Just because people want to create doesn’t tell you what the economics of creation will be.
Lemley: One point of art is to promote human connection. AI doesn’t do that. Another point is to provoke and challenge; AI can sometimes do that better than humans because it comes from a perspective that no one has. An exciting thing about generative AI is that it can do inhuman things. It has certainly done that in scientific domains. My view here is colored by 130 years of history. Every time this comes up, creators say this new technology will destroy art.
Balkin: Let’s go further back to patronage, where dukes and counts subsidize creativity. We moved to democracy plus markets as how we pay for art.
Lemley: Every time these objections have been made, they have been wrong. Sousa was upset about the phonograph because people wouldn’t buy his sheet music. But technology has always made it easier to create; I think AI will do the same. I have no artistic talent but I can make a painting now. The business model of being one of the small people who can make a good painting is now in trouble. But we have broadened the universe of people who can create.
Lee: We should talk about the people who are using and the people are providing the sources. Take Stack Overflow, where people post to help.
Villa: Look at Wikipedia and open-source software. Wikipedia is in some sense anti-human-connection in its tone. But there is a huge amount of connection in its creation rather than in consumption. Maybe there will be alternative means of connection. I’m not as economically optimistic as Mark is. But we have definitely now seen new patronage models based on the idea that everyone will be famous to fifteen people. Part of being an artist is the performance of authenticity. That works for some artists and not for others. It’s the Internet doing a lot of this, not ML.
Lee: Some people will create traditionally, a larger group will use generative AI. Will we need copyright protections for people using generative AI to create art?
Lemley: This is the prompt-engineering question again. I think we will go in that direction, but I don’t know that we need it. The Internet, 3D printing, generative AI – these shrink the universe of people who need copyright. We’re better off with a world where the people who need it participate in that model and the larger universe of people don’t need to.
Samuelson: People like Kris Kashtanova want to be recognized as authors. The Copyright Office issued a certificate, then cancelled it after learning about Kashtanova’s use of Midjourney and amended it to narrow it to reject prompt engineering as basis for copyright. We as people can recognize someone as an author even if the copyright system doesn’t. (Kashtanova has another application in.)
I like the metaphor of “Copilot.” It’s a nice metaphor for the optimistic version of what AI could be. I’ll use it to generate ideas and refine them, and use it as inspiration for my creative vision. That’s different from thinking about it as something that’s destroying my field. Some writers and visual artists are sincere that they think these technologies will kill their field.
Villa: The U.S. copyright system assumes that it’s a system of incentives. The European tradition has a moral-rights focus that even if I have given away everything, I have some inalienable rights and would be harmed if my work is exploited in certain ways. We’re seeing that idea in software now. 25 years ago, it was a very technolibertarian space. But AI developers now worry that the world can be harmed by their code. They want to give it away at no cost, but with moral strings attached. Every lawyer here has looked at these ethical licenses and cringed, because American copyright is not designed as a tool to make these kinds of ethical judgments. There is a fundamental mismatch between copyright and these goals.
Paul Ohm: I’m on team law. Do you want to defend the proposition that copyright is the first think we talk about, or one of the major things we talk about? There are lots of vulnerable people that copyright does nothing for.
Villa: Software developers have been told for 25 years that copyright licenses are how you create ethical change in the world. Richard Stallman told me that this is how you create that change. The open-source licensing community is now saying to developers “No, but not like that!”
Lee: Is this about copyright in models?
Balkin: In the early days of the Internet, the hottest issues were copyright. Five years ago it was all speech. Now it’s copyright again. Why? It’s because copyright is a kind of property, and property is central to governance and control. I say “You can’t do this because it’s my property.” Property is often a very bad tool, but in a pinch it’s the first thing you turn to and the only thing you have.
Consider social media. You get a person – Elon Musk – who uses not copyright but control over a system to get people to do what he wants. But that’s a mature development of an industry. But we’re not at that point in the development of AI. It’s very early on.
Samuelson: It’s because copyright has the most generous remedy toolkit of any law in the universe, including $150,000 statutory damages per infringed work. Contract law has lots of limitations on its remedies. Breach of an open-source license is practically worthless without a copyright claim attached, because you probably can’t get an injunction. It’s the first thing out there because we’ve been overly generous with the remedy package. It applies automatically, lasts life plus 70, and has very broad rights. There are dozens of lawsuits that claim copyright because they want those tools, even though the harm is not a copyright harm. You use copyright to protect privacy, or deal with open-source license breach, or suppress criticism. Copyright has blossomed (or you could use a cancer metaphor) into something much bigger. Courts are good at saying, you may be hurt by this, but this is not a copyright claim. But you can see from these lawsuits why copyright is so tempting.
Villa: I’m shocked there are only ten lawsuits and not hundreds.
Corynne McSherry: What about rights of publicity? That was the focus in the congressional hearing two weeks ago.
Samuelson: The impersonation (Drake + The Weeknd) is a big issue because you can use their voices to say things they didn’t say and sing things they didn’t sing. I think impersonation might be narrower than the right of publicity. (The full RoP is a kind of quasi-property right in a person’s name, likeness, and other attributes of their personality. You’re appropriating that, and sometimes implying that they sponsor or agree with you.
Lemley: We should prohibit some deepfakes and impersonation, but current right of publicity law is a disaster. My first instinct is that a federal law might not be bad if it gets rid of state law. My second instinct is that it’s Congress, so I’m always nervous. If it becomes a right to stop people from criticizing me or publicizing accurate things about me, that’s much worse.
Villa: Copyright is standardized globally by treaty. A right of publicity law doesn’t even work across all fifty states. That will be a major challenge for practitioners. All of these additional legal layers are not standardized across the world. Implementation will be a real challenge for creative communities that cross borders.
Samuelson: It will all be standardized through terms of service, so they will essentially become the law. And that isn’t a good thing either.
I’m going to talk about free speech and generative AI.
AI systems require huge amounts of data, which must be collected and analyzed. We want to start by asking the extent to which the state can regulate the collection and analysis of the data. Here, I’m going to invoke Privacy’s Great Change of Being: collection, collation, use, distribution, sale, or destruction of data. The First Amendment is most concerned with the back end: sale, distribution, and destruction. On the front end, it has much less to say. You can record a police officer in a public place, but otherwise it doesn’t say much. Privacy law can limit the collection of data.
The first caveat is that lots of data is not personal data. You can’t use privacy law as a general-purpose regulatory tool, just like you can’t use IP. The company might say, “We think what we’re doing is an editorial function like at a newspaper.” That would be a First Amendment right to train and tune however they want. That argument hasn’t been made yet, but it will come.
A very similar argument is being made for social media: is there a First Amendment right to program the algorithms however you want? I think the First Amendment argument is weaker here in AI than it is in social media. But the law here is uncertain.
If you are going to claim that you the company have First Amendment interests, then you are claiming that you are the speaker or publisher or the AI’s outputs. And that’s important for what we’re going to turn to: the relation between the outputs of AI models and the First Amendment.
First, does the AI have rights on its own? No. It’s not a person, it’s not sentient, it doesn’t have the characteristics of personhood that we use to accord First Amendment rights. But there is a wrinkle: the First Amendment protects the speech of artificial persons, like churches and corporations. So maybe we should treat AI systems as artificial persons.
I think that this isn’t going to work. The reason the law gives artificial entities rights is that it’s a social device for people’s purposes. That’s not the case for AI systems. OpenAI has First Amendment rights; that makes sense because people organize. But ChatGPT is not a device for achieving the ends of the people in the company.
Second, whose rights are at stake here? Which people have rights and what rights are they, and who will be responsible for the harms AI causes? Technology is a means of mediating relationships of social power and changes how people can exercise power over each other. Who has the right to use this power, and who can get remedies for harms?
The company will claim that the AI speech is their speech, or the user who prompts the AI will claim that the speech is theirs. In this case, ordinary First Amendment law applies. A human speaker is using this technology as a tool.
The next problem is where someone says, “This is not my speech” but the law says it is. (E.g., someone who repeats a defamatory statement.) But in defamation law, you don’t just have to show that it’s harmful, but you also have to show willfulness. So, for example, in Counterman v. Colorado, you have to show that the person making a threat knew that they were making one.
You can see the problem here. What is my intention here when I provide an AI that hallucinates threats? When you have the intermediation of an AI system that engages in torts. Here’s another problem: The AI system incites a riot. Here again we have a mens rea problem. The company doesn’t have the intent, even though the effects are the same. We’ll need new doctrine, because otherwise the company is insulated from liability.
The second case is an interesting one. First Amendment law is interested in the rights of listeners as well as speakers. Suppose I want to read the Communist Manifesto. Marx and Engels don’t have First Amendment rights: they’re overseas non-Americans who are also dead. So it must be that listeners have a right to access information.
Once again, we’ll have to come up with a different way of thinking about it. Here’s a possibility. There’s an entire body of law organized for listener rights: it’s commercial speech. It’s speech whose purpose is to tell you whether to buy a product. The justification is not that speakers have rights here, it’s that listeners have a right to access true useful information. So we could treat AI outputs under rules designed to deal with listener rights.
There will be some problems with this. Sometimes we wouldn’t want to say that the fact that speech is false is a reason to ban it, e.g. in matters of opinion. So commercial speech is not an answer to the problem, but it could be an inspiration.
Another possibility. There’s speech, but it’s part of a product. Take Google Maps. You push a button. You don’t have a conversation, it gives you directions. It’s simply a product that provides information upon request. The law has treated this as an ordinary case of products liability. But if it’s anything beyond this – if it’s an encyclopedia or a book – the law will treat it as a free-speech case. That third example will be very minor.
In privacy, collection, collation, and use regulations are consistent with the First Amendment. But that’s not a complete solution because most data is not personal data. In the First Amendment, the central problem is the mens rea problem, and that’s a problem whether or not someene claims the speech as their own. In both cases, we’ll need new doctrines.
Colin Doyle, The Restatement (Artificial) of Torts
LLMs seem poised to take over legal research and writing tasks. This article proposes that we can task them with creating Restatements of law, and use that as testing grounds for their performances.
The Restatements are treatises designed to synthesize U.S. law and provide clear concise summaries of what the laws are. Laws differ from state to state, so that the Restatements are both descriptive and normative projects that aim to clarify the law.
How can we do this? The process I’m using is similar to how a human would craft a Restatement. We give the LLM a large number of cases on a topic. Its first step is to write a casebrief for each case in the knowledge base. Its second task is loop through the casebriefs and copy the relevant rules from each brief. Its third iteration is to distill the rules into shorter ones, group them together, and have the language model mark the trends in the law. THen use those notes to write a blackletter law provision, lifted from the American Law Institute’s model for how to write Restatement provisions. Then it writes comments, illustrations, and a Reporter’s Note listing the cases it was derived from. I also asked it to apply the ALI style manual.
It does produce something credible. We get accurate blackletter provisions. We get sensible commentary. We get Reporter’s Notes that cite to the right cases. The comments can be generic, and it breaks down when there are too many cases (keeping the list of cases spills out of the context window).
I’m excited about the possibility of comparing the artificial restatement with the human restatement. Where there’s a consensus, the two are identical. But where the rules get more complicated, we have a divergence. Also, there’s not just one artificial Restatement – we can prime a language model to produce rules with different values and goals.
Shayne Longpre, The Data Provenance Project:
I hope to convince you that we’re experiencing a crisis in data transparency. Models train on billions of tokens from tens of thousands of sources and thousands of distilled datasets. That leads us to lose a basic understanding of the underlying data, and to lose the provenance of that data.
Lots of data from large companies is just undocumented. We cannot properly audit the risks without understanding the underlying data. Reuse, biases, toxicities, all kinds of unintended behavior, and poor-quality models. The best repository we have is HuggingFace. We took a random sample of datasets and found that the licenses for 50% were misattributed.
We’re doing a massive dataset audit to trace the provenance from text sources to dataset creators to licenses. We’re releasing tools to let developers filter by the properties we’ve identified so they can drill down on the data that they’re using. We’re looking not just at the original sources but also at the datasets and at the data collections.
Practitioners seem to fall into two categories. Either they train on everything, or they are extremely cautious and are concerned about every license they look at.
Rui-Jie Yew, Break It Till You Make It:
When is training on copyrighted data legal? One common claim is that when the final use is noninfringing, then the intermediate training process is fair use.This presumes a close relationship between training and application. This is a contrast to a case where there’s an expressive input and an infringing expressive output, e.g., make a model of one song and then make an output that sounds like it.
The real world is much messier. Developers will make one pre-trained model that can be used for many applications, some of which are expressive and some of which are non-expressive. So if there’s a pretrained model that is then fine-tuned to classify images and also to generate music, the model has both expressive and non-expressive uses. This complicates liability attribution and allocation.
This also touches on the point that different architectures introduce different legal pressures. Pre-training is an important part of the AI supply chain.
Jonas A. Geiping, Diffusion Art or Digital Forgery?:
Diffusion models copy. If you generate lots of random images, and then compare them to the training dataset, you find lots of close images. Sometimes there are close matches for both content and style, sometimes they match more on style and less on the content. We find that about 1.88% of generated images are replications.
What causes replication? One cause is data duplication, where the same image is in the dataset with lots of variations (e.g. a couch with different posters in the background). These duplications can be partial, because humans have varied them.
Another cause is text conditioning, where the model associates specific images with specific captions. If you see the same caption, you’re much more likely to generate the same image. If you break the link – you train on examples where you resample when the captions are the same – this phenomenon goes away. The model no longer treats the caption as a key.
Mitigations: introduce variation in captions during training, or add a bit of noise to captions at test time.
Rui-Jie Yew (for Stephen Casper), Measuring the Success of Diffusion Models at Imitating Human Artists:
Diffusion models are getting better at imitating specific artists. How do we evaluate how effectively diffusion models imitate specific artists?
Our goal is not to automate determinations of infringement. Two visual images can be similar in non-obvious ways that are relevant to copyright law.
We start by identifying 70 artists and reclassify them based on Stable Diffusion’s imitation of their works. Most of them were successfully identified. We also used cosine different to measure similarity between artists’ images and Stable Diffusion imitations of their work versus other artists. Again, there are statistically significant similarities with the imitations.
So yes, Stable Diffusion is successful at imitating human artists!
Machine learning is a really simple thing that tries to train a model to do something useful. Consider trying to become a lawyer. One way to train them is to go to law school, learn from professors, take exams, etc. Another way is to memorize every test you can and memorize all of the answers. At the end of both, you can probably pass the bar. But one way is actually useful; the other is only useful for passing the bar.
We train machine learning models by doing the second thing. We show them all the tests in the world, and ask them to memorize all the answers at the same time. Any amount of generalization is by luck alone. When we train models on text, we train them by asking them to predict the last blank in a sequence. When we train models on images, we add noise, and then ask them to reconstruct the original image from the noise. Image generation is a side effect of denoising. It makes sense that they memorize the images, because there’s no way to go back to the originals unless you’ve memorized them.
We like to think of generalization as a human kind of generalization. But suppose I’m learning how to play baseball. To an ML researcher, generalization is not getting good at baseball. Rather, it that means that if my parents take me to the same field and throw the ball the same way, I can still hit the ball. Change the field or the way of throwing the ball, the model will fail.
Some policy-space observers argue that the copying from any given input is minimal. So we did experiments to show that models do in fact memorize substantial portions of their inputs. We found examples of people’s personal information, which the model leaks to anyone who wants. Yes, your information is still online anyway, but the model is still doxing you. Several thousand lines of code is not de minimis.
Circa 2020, we knew that text models did this. It turns out that image models also do this. Stable Diffusions memorizes specific input images. It turns out that the different between those output images and the originals is smaller than the difference from compressing it into a JPEG. We can do this for a lot of images.
Takeaways. There are three worlds we might have been in:
We are in the second world, the blurry world where models sometimes emit training data, and sometimes don’t.
Let’s take for granted that ML models are likely to infringe. In the U.S., that requires access plus substantial similarity. So one response is to fix access: remove any copyrighted image from the training set inputs. But it might not be obvious what is copyrighted. Another response is to fix substantial similarity: filter the outputs to remove substantially similar images. Again, though, we don’t have a clear definition of substantial similarity.
So let’s relax a requirement. Access-freeness may be hard to guarantee, and may be too strict. So we’ll go for near access-freeness (as defined by Vyas, Kakade, and Barak) We’ll try to use a model that is close to one that didn’t have access to the copyrighted work.
Is this kosher? Would it hold up in a court of law? I have no idea; I’m not a lawyer. But if it does, it will help with some of the previous challenges. No hard removal problems for training data.
So how do you do this? Let’s turn to differential privacy for inspiration. We feed a dataset into an algorithm to produce a model. Imagine, however, that we had a dataset that was different in one entry. An adversary is trying to tell which of the two datasets we trained on. If they can’t do much better than random guessing, then the training procedure was differentially private. The procedure is differentially private when the model distributions don’t change much when adding or removing one point. DP is widely used in government and industry.
Concretely, DP protects against membership inference (is a single data point in the dataset?), and against revealing any information that wouldn’t be present if a particular data point weren’t trained on.
We can train a differentially private model on training data. Now consider any arbitray copyrighted point. The resulting model is close to one trained on the same dataset minus that point. So differential privacy is very close to near access freeness.
Note that copyright and privacy violations are not identical. Consider also lawsuits claiming copying of artists’ styles. Differential privacy might not help here, because style is not a property of a single data point.
Lee: What is privacy?
Balkin: Dan Solove wrote a famous taxonomy of privacy. Take your pick. Sometimes it’s about control of information. Sometimes it’s about manipulation, or disclosure. It’s a starfish concepts.
Brundage: I’ve read various definitions but I don’t want to regurgitate one I’ve memorized.
Carlini: You know it when you see it. I try to do most of my work with things that are obvious violations that everyone agrees on. For most attack research, you don’t need a working definition.
Kamath: Computer scientists often like to work with specialized definitions. I work on DP which is a very specific notation.
Vaccaro: I spent a lot of time trying to explain privacy guarantees to people and see what they understand. I’m also interested in networked conceptions of privacy. When you share a picture of yourself, you’re also sharing information about your friends, family, and associates.
Ganguli: What are the most important privacy concerns you worry about in your work? There’s a show on Netflix called Deep Fake Love in which people are shown deep fakes of their partners cheating. As a community, what types of privacy should we be thinking about?
Balkin: People most worry about the loss of control and the obliteration of circles of trust. Another possibility is about false light: the deep fakes problem. You can construct what looks like the person in almost any situation. The method of collection can be one of the most serious violations. We could think about privacy at each stage.
Carlini: I’m mostly worried about data leakage about individual people. If we can’t solve that, what are we even doing?
Balkin: Vulnerability to attack might be an independent privacy problem. It’s a new kind of privacy problem.
Lee: Is generative AI different than search or other kinds of data collection? Take the personal-info leak. Is this worse than Google Search?
Carlini: The reason we do what we do on models trained online is so that we can do it without violating privacy. We could do this on models trained on hospital records, but I don’t want to violate medical privacy. The takeaway is that if this model was trained on your private emails, it would have leaked them, so we should be very careful about training on more private data.
Brundage: It’s important to compare deployed systems to deployed systems; think about ChatGPT refusing certain requests.
Vaccaro: A question I still had at the end of the morning is about the power dynamics. Obviously I spend a lot of time working on social media. People post things there with certain expectations about how they’re going to be used. With generative AI, people have a sense that there is something larger coming out of these uses.
Kamath: I think we don’t have a very good idea of what we agree to and where our data winds up. People click to agree and think things will be fine. Maybe with generative AI, things won’t be fine.
Balkin: Suppose I’ve trained a model And then I put in a picture of this guy here, and ask the model to predict his medical situation and intimate details, which we would ordinarily consider private? (Asking about true predictions, not hallucinations.)
Kamath: You could tell name, demographics, and some broad predictions based on that (e.g. of Indian descent so of higher diabetes risk, but young so maybe not).
Balkin: So a possible future problem is that you can infer sensitive information, not just memorize it.
Ganguli: I used to ask who is Deep Ganguli, and the model would say “I don’t know who that is and I don’t know anything about anyone who works at Anthropic.” Now it says I work at the University of Calcutta, because Ganguli is a Brahmin last name and it knows I do some kind of research.
Balkin: A model of privacy violation is that information is given to a trusted person (e.g. a doctor) and then it leaks out to the wrong person who uses it for an inappropriate purpose. But suppose I could generate that information without any individual leaking it, that solely through different pieces of information it could infer things about me. That seems to be the new thing that would be quite different from search. Once you can do it for one person, you can do it for lots of people.
Lee: This reminds me of recommendation algorithms. Any one product might not lead to an inference, but a collection of them might.
Balkin: We have the same thing with political behavior.
Kamath: Suppose you see a person smoking one cigarette, then another. You say to them, “I think you might develop cancer.” Is that a privacy violation? Now consider a machine learning model that draws complex interfaces. Does that become a privacy violation?
Ganguli: How should we think about malicious access versus incidental access? Is this a useful distinction for generative AI?
Carlini: Suppose I had a model that would reveal all kinds of personal information about me if you asked “Who is Nicholas Carlini?” That would be much worse than a model that would require a very specialized prompt to generate these outputs. We had this discussion in the copyright context as well – a model that infringes when prompted very specifically for infringement, versus one that does it on its own. RLHF does fairly well at preventing the latter, but has a much harder time preventing the first type.
Brundage: Alternatively, at what point is it sufficiently difficult that it’s not an attractive alternative than just Googling it?
Vaccaro: It’s also buying personal data on the market, not just search.
Lee: Does malicious versus incidental changes how you think about the legal analysis?
Balkin: You’re interested in both. If you know there are bad actors, you might have some responsibility to fine-tune to avoid that. Suppose I want to make some money off of vulnerable individuals. I know there are people who are financial idiots who will buy worthless coins. Can I use LLMs to identify these people? Do we have the capacity to do that?
Vaccaro: We don’t need generative models for this. You can do this with regular ML. And in fact you can buy lists of people with vulnerable attributes.
Balkin: Do these new technologies make it easier or more plausible to prey on people? Are they useful for even more efficient preying?
Kamath: How much do these datasets cost?
Carlini: Datasets with credit-card numbers are about $5 to $20 per person. But that’s illegal.
Vaccaro: The legal lists are very cheap. Are there ways for generative models to extract more information? Frankly, social media is very good at this.
Balkin: In the world of social media, the dirty little secret is that the companies promise that their algorithms are very good, and they’re bilking their advertisers. But could these generative AI technologies actually make good on these promises?
Kamath: Do we consider targeted advertising a privacy violation?
Balkin: It has to do with the vulnerability of the target.
Lee: It kind of sounds like we’re talking about traditional ML here. Generative AI has different media: text, audio, video. So you could spoof people by pretending to be someone they know using their voice.
Kamath: I’m not sure copying a voice is a privacy violation.
Carlini: I don’t see this as a privacy attack; that’s a security attack.
Vaccaro: At the same time, this does expose the fact that there should be even more hackers working on these models. You could be very creative at misusing these models.
Carlini: We can go to automated spearphishing.
Balkin: Our conversation is about why we care about privacy. If you think the harm is the creation of vulnerability, then you can see the privacy connection to exploitation.
Question: With generative AI, we have huge pre-trained models. Do you think it’s better to partition the model and only allow access through a service, or is it better to have a public model like Meta has done.
Carlini: I want public models. I think holding things secret only delays the problems. Only don’t release it or actually release it and let people do the security analysis. Trying to hold it back is not going to work. Maybe the OpenAI person feels differently.
Brundage: I think it’s a complex issue. I worry a lot about the implications of big jumps in capabilities. I worry about “just release it” especially when there is a Cambrian Explosion of pairing models up with tools.
Kamath: A classic issue in image classification is adversarial examples. We tried to solve them. And then there was recently a talk showing that you can do the same things against LLMs. Seems like the stakes have been raised. It’s helpful if we can see these issues in the simplest models to predict future issues.
Paul Ohm: Invitation to the Privacy Law Scholars Conference, which happens every June, where we have these conversations. Could you train a generative AI that would never say anything about any individual person ever? You’d be very confident that certain privacy harms wouldn’t be achievable. Would that be tractable?
Fatemehsadat Mireshghallah: You could use theory of mind to keep track of this. (There was a theory of mind workshop here yesterday.) You have to figure out who is a person and reason about what the user is asking about them.
Lee: This is another thing you can try to align a model to do.
Question: I come from Los Alamos, where the concept of putting a technology out there is starting to feel a bit poignant. We have fictional examples of people like Sherlock Holmes who can infer personal details from small observations. From a philosophical/legal standpoint, whats’ the difference between having a few people like that walking around and having a high-volume tool?
Carlini: There’s a difference between standing on the street and looking in their window and looking in through a telescope from far away. But technically they’re the same thing. This is mostly a legal quesiton.
Balkin: The difference has to do with power relations. Sherlock Holmes works with Scotland Yard. If he were to use his powers for blackmail, that would be different. Another difference is that there’s only one of him. If he existed, what restrictions would be put on him? In the law, we’re very worried about concentrations of extreme power. A lot of privacy law is about this.
Kamath: We do have large teams of Holmeses running around. The descendants of the Pinkertons attack unions by collecting intrusive information.
Balkin: That’s not necessarily legal; the problem is often that they’re not called to account for the illegal things they do. Privacy is about the use of information power to take unfair advantage of others and treat them unjustly.
Vaccaro: In San Diego we’re fighting about license plate readers.
Balkin: Unfortunately, this probably isn’t a Fourth Amendment violation.
Vaccaro: Maybe meeting the law is way too low a bar.
Lee: This is also a problem for making a product. User perceptions of privacy matter a lot, too.
Question: I feel like privacy and AI is a problem for rich people. Take GDPR. The big companies benefitted from this regulation, because they had the tools and competencies and scale to implement them. Same thing with AI Act in EU.
Brundage: Privacy is for everyone.
Vaccaro: My concern is that you can pay for privacy. If I want email without Google reading it, I can pay more for it and set up my own server.
Kamath: It’s a tradeoff. We can ask with DP about how much society benefits, and over time there may be more expertise and tools.
Andres Guadamuz: I have to strongly disagree. GDPR and the AI Act aren’t just being pushed by large companies. They’re fighting the enforcement of these privacy laws. But GDPR benefits citizens; I’ve used it personally. The AI Act is going to be a disaster, but that’s not being pushed by the big companies.
Question: You (Balkin) said AI companies will make free-speech arguments. Why will their arguments be weaker than social media companies’ arguments? Will AI alignment efforts improve their First Amendment arguments?
Balkin: The immediate analogy they will make is that fine-tuning to prevent hate speech is a kind of editorial function, like newspapers, and that it’s no different than content moderation at social-media companies. But no one doubts that social media hosts people’s speech. Does that mean that the AI company is conceding that it hosts the speech of the AI? That would put them in a different litigation posture than claiming it’s just a product. That’s the puzzle. (There are people who think that what social media companies do is not editorial.) It would follow from the editorial-function argument that an AI company would have a First Amendment right not to fine-tune.
Question: From an EU perspective, data protection has a lot of overlap with privacy but is considered a distinct right. Under the GDPR, a lot of targeted advertising and cookie targeting is unlawful. It might be that they could be done lawfully, but currently the IAB’s consent framework legally falls short. So when you take these profiling technologies, and you layer on top of that generative AI, what happens when the model not only predicts the most likely token, but also predicts what the person wants to hear?
Kamath: So is model personalization a privacy violation?
Carlini: I don’t see how this being an LLM makes it different.
Lee: The harms are potentially worse because it’s interactive.
Carlini: Of course, if you see my personalized search results, that’s the same problem. I don’t see how the LLM makes it fundamentally a different object.
Lee: We’ll end there, on the question of whether generative AI changes the privacy lawsuits.