The Laboratorium (3d ser.)

A blog by James Grimmelmann

Soyez réglé dans votre vie et ordinaire afin
d'être violent et original dans vos oeuvres.

Posts about law

Scholars' Amicus Brief in the NetChoice Cases

Yesterday, along with twenty colleagues — in particular Gautam Hans, who served as counsel of record — I filed an amicus brief in the Supreme Court’s cases on Florida and Texas’s anti-content-moderation social-media laws, Moody v. NetChoice and NetChoice v. Paxton. The cases involve First Amendment challenges to laws that would prohibit platforms from wide swaths of content moderation. Florida’s prohibits removing or downranking any content posted by journalistic enterprises or by or about candidates for public office; Texas’s prohibits any viewpoint-based moderation of any content at all.

Our brief argues that these laws are unconstitutional restrictions on the rights of social-media users to find and receive the speech that they want to listen to. By prohibiting most content moderation, they force platforms to show users floods of content those users find repugnant, or are simply not interested in. This, we claim, is a form of compelled listening in violation of the First Amendment.

Here is the summary of our argument:

This case raises complex questions about social-media platforms’ First Amendment rights. But Florida Senate Bill 7072 (SB 7072) and Texas House Bill 20 (HB 20) also severely restrict platform users’ First Amendment rights to select the speech they listen to. Any question here is straightforward: such intrusions on listeners’ rights are flagrantly unconstitutional.

SB 7072 and HB 20 are the most radical experiments in compelled listening in United States history. These laws would force millions of Internet users to read billions of posts they have no interest in or affirmatively wish to avoid. This is compulsory, indiscriminate listening on a mass scale, and it is flagrantly unconstitutional.

Users rely on platforms’ content moderation to cope with the overwhelming volume of speech on the Internet. When platforms prevent unwanted posts from showing up in users’ feeds, they are not engaged in censorship. Quite the contrary. They are protecting users from a neverending torrent of harassment, spam, fraud, pornography, and other abuse — as well as material that is perfectly innocuous but simply not of interest to particular users. Indeed, if platforms did not engage in these forms of moderation against unwanted speech, the Internet would be completely unusable, because users would be unable to locate and listen to the speech they do want to receive.

Although these laws purport to impose neutrality among speakers, their true effect is to systematically favor speakers over listeners. SB 7072 and HB 20 pre- vent platforms from routing speech to users who want it and away from users who do not. They convert speakers’ undisputed First Amendment right to speak without government interference into something much stronger and far more dangerous: an absolute right for speakers to have their speech successfully thrust upon users, despite those users’ best efforts to avoid it.

In the entire history of the First Amendment, listeners have always had the freedom to seek out the speech of their choice. The content-moderation restrictions of SB 7072 and HB 20 take away that freedom. On that basis alone, they can and should be held unconstitutional.

This brief brings together nearly two decades of my thinking about Internet platforms, and while I’m sorry that it has been necessary to get involved in this litigation, I’m heartened at the breadth and depth of scholars who have joined together to make sure that users are heard. On a day when it felt like everyone was criticizing universities over their positions on free speech, it was good to be able to use my position at a university to take a public stand on behalf of free speech against one of its biggest threats: censorious state governments.



Katherine Lee and A. Feder Cooper

Welcome! Thank you all for the wonderful discussions that we’ve already been having.

This is a unique time. We’re excited to bring technical scholars and legal scholars together to learn from each other.

Our goals:

  1. Have a common language. When we leave, we’ll all agree on what a “model” is. Today we can tease apart differences in definitions.
  2. Have some shared research directions.

Today, we’re going to focus on IP and on privacy. In between, we’ll have posters.

What are we talking about today? There are many ways to break down the problem. Dataset creation, model training, and model serving all involve choices.

For dataset creation: what type of data, how much data, web data, which web data, and what makes data “good?” Should we use copyrighted data? Private data? What even makes data private? (We delve into some of these in the explainers on the conference website.)

For model training, questions include whether we will use differential privacy? A retrieval model?

For model serving, that includes prompting, generation, fine-tuning, and human feedback. Fine-tuning does more training on a base or foundation model to make it more useful for a particular domain. Human feedback asks users whether a generation is good or bad, and that feedback can be used to further refine the model.

The recurring theme: there are lots of choices. All of these choices implicate concrete research questions. ML researchers have the pleasure of engaging with these questions. These choices also implicate values beyond just ML. The lawsuits against AI companies raise significant questions. The research we do is incorporated into products that serve the public, which has broader implications, such as copyright.

The big theme for today: bring together experts working on both sets of questions to share knowledge. How do concrete legal issues affect the choices ML researchers make and the questions ML researchers pursue?

##Some Nonobvious Observation About Copyright’s Scope For Generative AI Developers ##

Pamela Samuelson

Big questions:

  1. When does making copies of works as training data infringe copyright. (Mark Lemley will address.)
  2. When are outputs of a generative AI infringing derivative works. (This talk will address.)
  3. When is the alteration or removal of copyright management information (CMI) illegal? (This talk will address briefly to get it out of the way.)

Section 1202 makes it illegal to remove or alter CMI, knowing that it will facilitate copyright infringement. It was enacted in 1998 out of fear that hackers would strip out CMI from works, or that infringers would use false CMI to offer up copyrighted works as their own. Congress put in a $2500 minimum statutory damages provision, so these are billion-dollar lawsuits.

Getty is claming that its watermarks on it stock photographs are CMI. Stability is the defendant in two of these cases. One is brought by Getty; the other is a class-action complaint in which Sarah Anderson is the lead plaintiff. OpenAI is being sued twice by the same lawyer (the Saveri group) in the Silverman and Tremlay lawsuits. Meta is being sued by that group as well, and another legal group is suing Alphabet and DeepMind. There are also a substantial number of non-copyright lawsuits. The most significant is the suit by four John Does against OpenAI, GitHub, and Microsoft over GitHub Copilot (which doesn’t actually include a copyright claim).

In DC, Congress is holding hearings, and the Copyright Office is holding listening sessions, and then it will have a notice of inquiry this fall. People in the generative AI community should take this seriously.

Why are these lawsuits being brought? The lawyers in class actions may take as much as a third of any recovery. And many authors fiercely object to the use of their works as training data (even things that were posted on the Internet).

Copyright only protects original expression. The melody of a musical work, words in a poem, lines of source code. There is also a recognition that two works can be substantially similar (not identical) even though they don’t have exact literal identity. Often the ultimate question is from the perspective of a “lay observer/listener/viewer.”

Some works like visual art are highly expressive, while others like factual compilations have a lot of unprotectable (e.g. factual) elements. Courts will often filter out the unprotectable elements from these “thin” copyrights. Example: Saul Steinberg’s “New Yorker’s View of the World,” which was infringed by the movie poster for Moscow on the Hudson). There are lots of differences, but the overall appearance is similar. A more recent example, where there are differences of opinion, is Warhol’s _Orange Prince vs. Lynn Goldsmith’s photograph of Prince, on which it was based.

Another example: Ho v. Taflove. Chang started working for Ho, then switched to Taflove. Ho developed a new mathematical model of electron behavior. Chang and Taflove published a paper drawing from Ho’s work, causing Ho’s work to be rejected (since the model had already been published). But Ho lost his copyright infringement lawsuit (even though it was academic misconduct) because it was an idea, not expression. It didn’t matter how creative it was, it wasn’t part of the expression in the work. Under the merger doctrine of Baker v. Selden, copyright gives no monopoly on scientific ideas.

The Copyright Act gives exclusive rights including the derivative works right. It’s defined in an open-ended way. Can a generative AI system infringe? The Silverman complaint argues that ChatGPT can produce a detailed summary of her book, which is a derivative work. And training data can contain may copies of a work, so the AI model can essentially memorize them. (Slide with lots of examples of Snoopy)

Is the person who puts in the prompt “Snoopy doghouse at Christmas” the infringer? Is the AI system a direct infringer? I don’t think so, because there is no human volition to bring about that image, making the user the infringer. But there is indirect infringement, so perhaps Midjourney could be the indirect infringer! There are four different categories. The most relevant is “vicarious” infringement: does the entity have the “right and ability to control” users’ conduct, and do they have a financial benefit from the user’s infringement?

In general, the outputs will not be substantially similar to expressive elements in particular input works. Insofar as there isn’t infringement of a specific work, there isn’t infringement. The Anderson complaint all but admits this. The GitHub complain reports that 1% of outputs included code matching training data. Is that enough to make Copilot a direct or indirect infringer? The Getty complaint includes a specific Stable Diffusion output, which has enough differences that it’s probably not substantially similar.

These lawsuits are in the very early stages. Note that challenges to a lot of other new technologies have failed, while others have succeeded.

##Is Training AI Copyright Infringement? ##

Mark Lemley

I come to you as a time traveller from a time before PowerPoint. Training data is, as Pam said, the big kahuna. Even if these systems can be prompted to create similar outputs, it doesn’t mean the end of the enterprise. But if it is illegal to train on any work created after 1928 (the cutoff for copyright), training is in big trouble. This is true of all kinds of AI, not just generative AI. To train a self-driving car, you need to train on copyrighted photos of stop signs.

Some incumbents happen to have big databases of training data. Google has its web index, Facebook has billions of uploaded photos. The most common approach is to look at the web – Common Crawl has gathered it together (subject to robots.txt). The LAION image database is built on Common Crawl. All of this is still copyrighted, even if it’s out there on the web. And you can’t do anything with a computer that doesn’t involve making new copies. The potential for copyright liability is staggering if all of those copies of all of those works are infringing.

I will focus on fair use. Fair use allows some uses of works that copyright presumptively forbid, when the use serves a valuable social purpose, or doesn’t interfere with the copyright owner’s market. There are some cases holding that making temporary copies or internal copies for noninfringing uses is okay. The closest analog is the Google Book Search case. Google scanned millions of books and made a search engine. It wouldn’t give you the whole book text, but it would give you a small four-line snippet around the search result. Authors and publishers sued, but the courts said that Google’s internal copies were fair use because the purpose was not to substitute for the books. It didn’t interfere with a market for book snippets; there is no such market.

Other analogies: there are video-game cases. A company made its game to run on a platform, or vice-versa, and in both cases you have to reverse-engineer a copyrighted work (the game or the console). The product is noninfringing (a new game) but the intermediate step involved copying. This was fair use.

The thing about law is that it is backward-looking. When you find a new question, you look for the closest old question that it’s an analogy to. Is this closer to cases finding infringement, or to cases finding no infringement? Most of these cases in prior eras were clear that internal uses are fair use.

Now, past performance is no guarantee of future results. We’re in the midst of a big anti-tech backlash. “AI will destroy the world, and it will also put me out of a job.” This might affect judges; there is a potential risk here.

Fair use has always also worried about displacing markets. So if there were a market for selling licenses to train on their data, and I took it for free instead of paying, that would be less likely to be a fair use. There hasn’t been such a market, but you could imagine it developing. Suppose that Getty and Shutterstock developed a licensing program. That could be harder.

Mark is a lawyer for Stability AI in one of these cases. But it’s notable to him that Stable Diffusion is trained on two billion images. What’s the market price for one of those images? Is it a dollar? If so, there goes the entire market cap of Stability AI. Is it a hundredth of a cent? That doesn’t sound like a market that’s likely to happen?

Finally, fair use is a U.S. legal doctrine. Even if you’re developing your technology in the U.S., you’re probably using it somewhere else. Other countries don’t have fair use as such. Fair dealing is narrower; some countries have specific exceptions for text and data mining (e.g. Israel, Japan, U.K.). And there are lots of countries that just haven’t thought about it. So while the lawsuits have been filed here, and that’s where the money is, the bigger legal risks could be in other jurisdictions, like Europe or India or China.

I think the law will and should conclude that AI training is fair use. We get better AI: less bias, better speech recognition, safer cars. It’s unlikely that Getty’s image set alone is broad and representative. But this is by no means a sure thing.

Just a couple of other things. The output question is a harder question for copyright law. It’s a much more occasional question, because the vast majority of outputs are not infringing. But it’s still interesting and important, and system design can have a major imapct on it. If you find your system produces an output that is very similar to a specific input. Why?

  1. Overtraining on one specific input: it was copied into the training dataset lots of times. Solvable with better deduplication.
  2. Users deliberately triggering infringement. If you ask ChatGPT to write you a story about children at a wizarding school. But if give it the first paragraph of Harry Potter, it will spit out the first several chapters because it knows that the user wants something specific. (Interesting question about who is responsible for this.)
  3. Why is Snoopy so easy to generate images of? Because the system has identified “Snoopy” as a category. Like Baby Yoda, he shows up in lots of images. Usually you can’t own categories, but in a few cases – cartoon characters, superheroes, etc. – you can. Maybe this is worth adding filters to limit generations.

One of the things that computer scientists can do is to help us clearly articulate the ways that the technology work. In Anderson, the plaintiffs say that Stable Diffusion is a “collage” technology. That’s a bad metaphor, and it misleads. We need the technical community to outline good ways of understanding the math behind it.

Where and when does the law fit into AI development and deployment?

Miles Brundage

I’m not a lawyer; my background is in STS. I lead an interdisiplinary team that focuses on the impacts of OpenAI’s work.

My goal is to clarify things that may not be obvious about “how things work” in AI development and deployment. Insights about where things happen rather than the technical details.

AI development vs. deployment = building the thing vs. shipping the thing. Development is pre-training and fine-tuning; deployment is exposing those artifacts to users and applying them to tasks.

Lots of focus is on the model, but other levels of abstraction also matter, such as the components and systems built on top of those models. There are typically many components to a system or a platform. ChatGPT+ has plugins, a moderation API, usage policies, Whisper for speech recognition, the GPT4 model itself (fine-tuned), and custom instructions.

These distinctions matter. The legal issues in AI are not all about the base model. You can imagine a harmful-to-helpful spectrum. You might improve the safety or legality of a specific use case, but things are much more complex for a model that has a wide range of uses. It has a distribution on the spectrum, so policies shift and modify the distribution.

Development and deployment are non-linear. Lots of learning and feedback loops.

GPT-4 was a meme from before it existed. Then there were small-scale experiments, followed by pre-training until August 2022, testing and red-teaming, fine-tuning and RLHF, early commercial testing, later-stage red-teaming, system card and public announcement, scaling up of the platform, plugins, and continued iteration on the model and the system. This is an evolving service in many versions. An in parallel with all of this, OpenAI iterated on its platform and usage policies, and launched ChatGPT4. The system card included an extensive risk assessment.

You can imagine a wide variety of use cases, and layers raise legal issues that are different from base models. The moderation endpoint, for example, is a protective layer on top of the base model. Fine-tuning and the system includes attempts to insert disclaimers and refuse certain requests, and there are mechanisms for user feedback.

Lesson 1: It’s not all about the base model. It’s important but not everything. Outside of specific context like open-source releases, it’s not the only thing you have to think about. Example: regulated advice (e.g. finance, legal, medical). This is much more a function of fine-tuning and moderation and filters and disclaimers than it is of the base model. Any reasonably competent base model will know something about some of these topics, but that’s different from allowing it to proclaim itself as knowledgable. Another example: using tools to browse or draw from a database.

Lesson 2: David Collingridge quote. Important insight is that comprehensive upfront analysis is not possible. Many components and tools don’t exist when you would do that; they were adapted in response to experience with how people use it. The “solution” is reversability and iteration. Even with an API, it is not trivial to reverse changes (let alone with an open-source model). But still, some systems can be iterated on more easily than others. Example of a hard-to-anticipate use case: simplifying surgical consent forms. Example: fine-tuning is massively cheaper than pre-training. And use-case policies can be changed much more easily than retraining. System-level interventions for responding to bad inputs/outputs can be deployed even if retraining the model is infeasible.

Lesson 3: Law sometimes provides little guidance for decision-making. There is wide latitude as to which use cases a company can allow. That’s not unreasonable. There’s limited clarity as to what is an appropriate degree of caution (not just in relation to IP and privacy). Examples: human oversight is important sometimes, but norms of what is an appropriate human-in-the-loop can be loose. Many companies go beyond what is obviously legally required. Similarly, companies need to disclose limitations and dangers of the system. But sometimes, when the system behaves humbly, it can paradoxically make users think it’s more sophisticated. Disclosure to the general public – a challenge is that companies radically disagree about what the risks are and where the technology is going.

To pull this all together, the law fits in at lots of places and lots of times in the AI development process. Even when it does fit in, the implications aren’t always clear. So what to do?

  • Broaden the scope of the research on the legal implications. More layers of the system (use cases, content filters, fine-tuning, development process).
  • Make regulation adaptive.
  • Private actors should solicit public input so they’re not making these decisions on their own.

Panel: IP

Moderators: A. Feder Cooper and Jack Balkin Panelists: Pamela Samuelson, Mark Lemley Luis Villa, and Katherine Lee

Cooper: Can you talk a bit about the authorship piece of generative AI?

Samuelson: Wrote about this 35 years ago when AI was hot. Then it died down again. She thought most of the articles then were dumb, because people said “the computer should own it.” She went through the arguments people made, and said, “The person who is generating the content can figure out whether it is something that has commercial viability.” And that’s not very far from where the Copyright Office is today. If you have computer-generated material in your work, you need to identify it and disclaim the computer-generated part.

Lemley: I agree, but this isn’t sustainable. It’s not sustainable economically, because there will be a strong demand for someone to own these works with exact value. They’ll say “if value then right.” We should resist this pressure, but it’s out there. But also, in the short term, people will provide creativity by prompt creation. Short prompts won’t get copyright, but if the prompt becomes detailed enough, at some point they will say my contributions reach the threshold of creativity. That threshold in copyright law is very low. People will try to navigate that line by identifying their contributions.

But that will create problems for copyright law. AI inverts the usual idea-expression dichotomy because the creative expressive details are easy now, even though my creativity seems to come out in the “idea” (the prompt). This will also create problems for infringement, because we test substantial similarity at the levels of the output works. Maybe my picture of penguins at a beach looks like your picture of penguins at a beach because we both asked the same software for “penguins at the beach.” Similarity of works will no longer be evidence of copying.

Villa: Push back on this. I’m the open-source guy. We’ve seen in the past twenty years a lot of pushback against “if value then right.” Open source peole want to share, and that has created a constituency for making it easier to give things away. We’re seeing this with the ubiquity of photography. Is there protectable expression if we all take photographs of the same thing. Wikipedia was able to get the freedom of panorama into European law. If you took a picture of the Atomium, you were infringing the copyright in the building. There are now constituencies that reject “if value then right.”

Balkin: Go back to the purposes of copyright. In the old days, we imagined a world where there would be lots of public-domain material. We’re in that world now. There are lots of things that will potentially be free of copyright. What’s the right incentive structure for the people who still want to make things?

Lemley: Clearly, no AI would create if it didn’t get copyright it its outputs. That’s what motivates it. One response is that yeah, there will be more stuff that’s uncopyrighted. People will be happy to create without copyright. People make photos on their phones not because of copyright but because they want the pictures. People make music and release it for free. These models can coexist with proprietary models. But one reason we see the current moral panic around generative AI is that there is a generation of people who are facing automated competition. They are not the first people to worry about competing with free. My sense is, it works out. Yes, there is competition from free stuff, but they also reduce artists’ costs of creation, so artists will use generative AI just as they use Photoshop. It’s cheaper to get music and to make music. And people who create don’t do it because it’s the best incentive, but because they’re driven to. (The market price of a law-review article is $0.) But that’s not a complete answer, because they still have to make a living. A third answer: there’s a desire for the story, the authenticity, the human connection. So artisinal stuff coexists with mass-produced stuff in every area where we see human creativity. There will be disruption, but human creativity won’t disappear.

Lee: The way that we conceive of generative AI in products today isn’t the way that they will always work. How will creative people work with AIs? It will depend on those AI products’ UIs.

Samuelson: One of the things that happened in the 1980s when there was the last “computer-generated yada yada” was that the U.K. passed a sui generis copyright-like system for computer works. That’s an option that will get some attention. There are people in Congress who are interested in these issues. So if there’s a perception that there needs to be something, that’s an option. Also, we could consider what copyright lawyers call “formalities”: notice meant you had to opt in to copyright. Anything else is in the public domain. We used to think that was a good thing. Creators can also take comfort that many computer-generated works will be within a genre. So the people who do “real art” will benefit because they can create things that are better and more interesting.

Lemley: There’s a good CS paper: if you train language models on content generated by language models, they eat themselves.

Villa: We may have been in a transitory period where lots of creativity was publicly shared collaboratively in a commons. Think of Stack Overflow, Reddit, etc. ChatGPT reduced StackOverflow posts. Reddit is self-immolating for other reasons.

Lemley: They’re doing it because they want to capture the value of AI training data.

Villa: We’ve taken public collaborative creativity for granted. But if we start asking AIs to produce these things, it could damage or dry up the commons we’ve been using for our training. Individual rational choices could create collective action challenges.

Lee: The world of computer-generated training data is very big. There may be some valid uses, but some have to be very careful.

Villa: Software law has often been copyright and speech law. We’ve had to add in privacy law. But the combinatorial issues in machine learning are fiendishly complex. The legal profession is not ready for it. But being an ML lawyer will be extremely challenging. The issues connect, and being “only” a copyright lawyer is missing important context. See Amanda Levendowski’s paper on fair use and ML bias.

Lee: It’s hard even to define what regurgitation or memorization is. “Please like and subscribe” is probably fine, but other phrases might be different and problematic in context.

Lemley: There’s a fundmaental tension here. If we’re worried about and false speech, a solution would be to repeat training data exactly. That’s not what copyright law prefers. We may have a choice between hallucination and memorization. [JG: he put it much better.]

Samuelson: Lots of lawyers operate in silos. But when we teach Internet law, we have to get much broader. [JG: did someone say Internet Law?] There is a nice strand of people in our space who know something about the First Amendment, about jurisdiction, etc. etc.

Lemley: Academia is probably better about this than practice.

Balkin: There is a distinction between authenticity (a conception of human connection) and originality (this has never been seen before). A real difference between what we do to form connection, and where people do stuff that’s entertaining to them. Just because what comes out of these systems right now is not every interesting, maybe it will be more so in a few years. But that won’t solve the problem of human connection, which is one of the purposes of art. Just because people want to create doesn’t tell you what the economics of creation will be.

Lemley: One point of art is to promote human connection. AI doesn’t do that. Another point is to provoke and challenge; AI can sometimes do that better than humans because it comes from a perspective that no one has. An exciting thing about generative AI is that it can do inhuman things. It has certainly done that in scientific domains. My view here is colored by 130 years of history. Every time this comes up, creators say this new technology will destroy art.

Balkin: Let’s go further back to patronage, where dukes and counts subsidize creativity. We moved to democracy plus markets as how we pay for art.

Lemley: Every time these objections have been made, they have been wrong. Sousa was upset about the phonograph because people wouldn’t buy his sheet music. But technology has always made it easier to create; I think AI will do the same. I have no artistic talent but I can make a painting now. The business model of being one of the small people who can make a good painting is now in trouble. But we have broadened the universe of people who can create.

Lee: We should talk about the people who are using and the people are providing the sources. Take Stack Overflow, where people post to help.

Villa: Look at Wikipedia and open-source software. Wikipedia is in some sense anti-human-connection in its tone. But there is a huge amount of connection in its creation rather than in consumption. Maybe there will be alternative means of connection. I’m not as economically optimistic as Mark is. But we have definitely now seen new patronage models based on the idea that everyone will be famous to fifteen people. Part of being an artist is the performance of authenticity. That works for some artists and not for others. It’s the Internet doing a lot of this, not ML.

Lee: Some people will create traditionally, a larger group will use generative AI. Will we need copyright protections for people using generative AI to create art?

Lemley: This is the prompt-engineering question again. I think we will go in that direction, but I don’t know that we need it. The Internet, 3D printing, generative AI – these shrink the universe of people who need copyright. We’re better off with a world where the people who need it participate in that model and the larger universe of people don’t need to.

Samuelson: People like Kris Kashtanova want to be recognized as authors. The Copyright Office issued a certificate, then cancelled it after learning about Kashtanova’s use of Midjourney and amended it to narrow it to reject prompt engineering as basis for copyright. We as people can recognize someone as an author even if the copyright system doesn’t. (Kashtanova has another application in.)

I like the metaphor of “Copilot.” It’s a nice metaphor for the optimistic version of what AI could be. I’ll use it to generate ideas and refine them, and use it as inspiration for my creative vision. That’s different from thinking about it as something that’s destroying my field. Some writers and visual artists are sincere that they think these technologies will kill their field.

Villa: The U.S. copyright system assumes that it’s a system of incentives. The European tradition has a moral-rights focus that even if I have given away everything, I have some inalienable rights and would be harmed if my work is exploited in certain ways. We’re seeing that idea in software now. 25 years ago, it was a very technolibertarian space. But AI developers now worry that the world can be harmed by their code. They want to give it away at no cost, but with moral strings attached. Every lawyer here has looked at these ethical licenses and cringed, because American copyright is not designed as a tool to make these kinds of ethical judgments. There is a fundamental mismatch between copyright and these goals.

Paul Ohm: I’m on team law. Do you want to defend the proposition that copyright is the first think we talk about, or one of the major things we talk about? There are lots of vulnerable people that copyright does nothing for.

Villa: Software developers have been told for 25 years that copyright licenses are how you create ethical change in the world. Richard Stallman told me that this is how you create that change. The open-source licensing community is now saying to developers “No, but not like that!”

Lee: Is this about copyright in models?

Villa: People want to copyright models, datasets, and put terms of use on services. Open-source sofwtare means you can use it for whatever you want. If a corporation says “You can use it for everything except X,” it’s not open source. We as a legal community have failed to provide other tools.

Balkin: In the early days of the Internet, the hottest issues were copyright. Five years ago it was all speech. Now it’s copyright again. Why? It’s because copyright is a kind of property, and property is central to governance and control. I say “You can’t do this because it’s my property.” Property is often a very bad tool, but in a pinch it’s the first thing you turn to and the only thing you have.

Consider social media. You get a person – Elon Musk – who uses not copyright but control over a system to get people to do what he wants. But that’s a mature development of an industry. But we’re not at that point in the development of AI. It’s very early on.

Samuelson: It’s because copyright has the most generous remedy toolkit of any law in the universe, including $150,000 statutory damages per infringed work. Contract law has lots of limitations on its remedies. Breach of an open-source license is practically worthless without a copyright claim attached, because you probably can’t get an injunction. It’s the first thing out there because we’ve been overly generous with the remedy package. It applies automatically, lasts life plus 70, and has very broad rights. There are dozens of lawsuits that claim copyright because they want those tools, even though the harm is not a copyright harm. You use copyright to protect privacy, or deal with open-source license breach, or suppress criticism. Copyright has blossomed (or you could use a cancer metaphor) into something much bigger. Courts are good at saying, you may be hurt by this, but this is not a copyright claim. But you can see from these lawsuits why copyright is so tempting.

Villa: I’m shocked there are only ten lawsuits and not hundreds.

Corynne McSherry: What about rights of publicity? That was the focus in the congressional hearing two weeks ago.

Samuelson: The impersonation (Drake + The Weeknd) is a big issue because you can use their voices to say things they didn’t say and sing things they didn’t sing. I think impersonation might be narrower than the right of publicity. (The full RoP is a kind of quasi-property right in a person’s name, likeness, and other attributes of their personality. You’re appropriating that, and sometimes implying that they sponsor or agree with you.

Lemley: We should prohibit some deepfakes and impersonation, but current right of publicity law is a disaster. My first instinct is that a federal law might not be bad if it gets rid of state law. My second instinct is that it’s Congress, so I’m always nervous. If it becomes a right to stop people from criticizing me or publicizing accurate things about me, that’s much worse.

Villa: Copyright is standardized globally by treaty. A right of publicity law doesn’t even work across all fifty states. That will be a major challenge for practitioners. All of these additional legal layers are not standardized across the world. Implementation will be a real challenge for creative communities that cross borders.

Samuelson: It will all be standardized through terms of service, so they will essentially become the law. And that isn’t a good thing either.

Artificial Intelligence and the First Amendment##

Jack Balkin

I’m going to talk about free speech and generative AI.

AI systems require huge amounts of data, which must be collected and analyzed. We want to start by asking the extent to which the state can regulate the collection and analysis of the data. Here, I’m going to invoke Privacy’s Great Change of Being: collection, collation, use, distribution, sale, or destruction of data. The First Amendment is most concerned with the back end: sale, distribution, and destruction. On the front end, it has much less to say. You can record a police officer in a public place, but otherwise it doesn’t say much. Privacy law can limit the collection of data.

The first caveat is that lots of data is not personal data. You can’t use privacy law as a general-purpose regulatory tool, just like you can’t use IP. The company might say, “We think what we’re doing is an editorial function like at a newspaper.” That would be a First Amendment right to train and tune however they want. That argument hasn’t been made yet, but it will come.

A very similar argument is being made for social media: is there a First Amendment right to program the algorithms however you want? I think the First Amendment argument is weaker here in AI than it is in social media. But the law here is uncertain.

If you are going to claim that you the company have First Amendment interests, then you are claiming that you are the speaker or publisher or the AI’s outputs. And that’s important for what we’re going to turn to: the relation between the outputs of AI models and the First Amendment.

First, does the AI have rights on its own? No. It’s not a person, it’s not sentient, it doesn’t have the characteristics of personhood that we use to accord First Amendment rights. But there is a wrinkle: the First Amendment protects the speech of artificial persons, like churches and corporations. So maybe we should treat AI systems as artificial persons.

I think that this isn’t going to work. The reason the law gives artificial entities rights is that it’s a social device for people’s purposes. That’s not the case for AI systems. OpenAI has First Amendment rights; that makes sense because people organize. But ChatGPT is not a device for achieving the ends of the people in the company.

Second, whose rights are at stake here? Which people have rights and what rights are they, and who will be responsible for the harms AI causes? Technology is a means of mediating relationships of social power and changes how people can exercise power over each other. Who has the right to use this power, and who can get remedies for harms?

The company will claim that the AI speech is their speech, or the user who prompts the AI will claim that the speech is theirs. In this case, ordinary First Amendment law applies. A human speaker is using this technology as a tool.

The next problem is where someone says, “This is not my speech” but the law says it is. (E.g., someone who repeats a defamatory statement.) But in defamation law, you don’t just have to show that it’s harmful, but you also have to show willfulness. So, for example, in Counterman v. Colorado, you have to show that the person making a threat knew that they were making one.

You can see the problem here. What is my intention here when I provide an AI that hallucinates threats? When you have the intermediation of an AI system that engages in torts. Here’s another problem: The AI system incites a riot. Here again we have a mens rea problem. The company doesn’t have the intent, even though the effects are the same. We’ll need new doctrine, because otherwise the company is insulated from liability.

The second case is an interesting one. First Amendment law is interested in the rights of listeners as well as speakers. Suppose I want to read the Communist Manifesto. Marx and Engels don’t have First Amendment rights: they’re overseas non-Americans who are also dead. So it must be that listeners have a right to access information.

Once again, we’ll have to come up with a different way of thinking about it. Here’s a possibility. There’s an entire body of law organized for listener rights: it’s commercial speech. It’s speech whose purpose is to tell you whether to buy a product. The justification is not that speakers have rights here, it’s that listeners have a right to access true useful information. So we could treat AI outputs under rules designed to deal with listener rights.

There will be some problems with this. Sometimes we wouldn’t want to say that the fact that speech is false is a reason to ban it, e.g. in matters of opinion. So commercial speech is not an answer to the problem, but it could be an inspiration.

Another possibility. There’s speech, but it’s part of a product. Take Google Maps. You push a button. You don’t have a conversation, it gives you directions. It’s simply a product that provides information upon request. The law has treated this as an ordinary case of products liability. But if it’s anything beyond this – if it’s an encyclopedia or a book – the law will treat it as a free-speech case. That third example will be very minor.

In privacy, collection, collation, and use regulations are consistent with the First Amendment. But that’s not a complete solution because most data is not personal data. In the First Amendment, the central problem is the mens rea problem, and that’s a problem whether or not someene claims the speech as their own. In both cases, we’ll need new doctrines.

Spotlight Talks

Colin Doyle, The Restatement (Artificial) of Torts

LLMs seem poised to take over legal research and writing tasks. This article proposes that we can task them with creating Restatements of law, and use that as testing grounds for their performances.

The Restatements are treatises designed to synthesize U.S. law and provide clear concise summaries of what the laws are. Laws differ from state to state, so that the Restatements are both descriptive and normative projects that aim to clarify the law.

How can we do this? The process I’m using is similar to how a human would craft a Restatement. We give the LLM a large number of cases on a topic. Its first step is to write a casebrief for each case in the knowledge base. Its second task is loop through the casebriefs and copy the relevant rules from each brief. Its third iteration is to distill the rules into shorter ones, group them together, and have the language model mark the trends in the law. THen use those notes to write a blackletter law provision, lifted from the American Law Institute’s model for how to write Restatement provisions. Then it writes comments, illustrations, and a Reporter’s Note listing the cases it was derived from. I also asked it to apply the ALI style manual.

It does produce something credible. We get accurate blackletter provisions. We get sensible commentary. We get Reporter’s Notes that cite to the right cases. The comments can be generic, and it breaks down when there are too many cases (keeping the list of cases spills out of the context window).

I’m excited about the possibility of comparing the artificial restatement with the human restatement. Where there’s a consensus, the two are identical. But where the rules get more complicated, we have a divergence. Also, there’s not just one artificial Restatement – we can prime a language model to produce rules with different values and goals.

Shayne Longpre, The Data Provenance Project:

I hope to convince you that we’re experiencing a crisis in data transparency. Models train on billions of tokens from tens of thousands of sources and thousands of distilled datasets. That leads us to lose a basic understanding of the underlying data, and to lose the provenance of that data.

Lots of data from large companies is just undocumented. We cannot properly audit the risks without understanding the underlying data. Reuse, biases, toxicities, all kinds of unintended behavior, and poor-quality models. The best repository we have is HuggingFace. We took a random sample of datasets and found that the licenses for 50% were misattributed.

We’re doing a massive dataset audit to trace the provenance from text sources to dataset creators to licenses. We’re releasing tools to let developers filter by the properties we’ve identified so they can drill down on the data that they’re using. We’re looking not just at the original sources but also at the datasets and at the data collections.

Practitioners seem to fall into two categories. Either they train on everything, or they are extremely cautious and are concerned about every license they look at.

Rui-Jie Yew, Break It Till You Make It:

When is training on copyrighted data legal? One common claim is that when the final use is noninfringing, then the intermediate training process is fair use.This presumes a close relationship between training and application. This is a contrast to a case where there’s an expressive input and an infringing expressive output, e.g., make a model of one song and then make an output that sounds like it.

The real world is much messier. Developers will make one pre-trained model that can be used for many applications, some of which are expressive and some of which are non-expressive. So if there’s a pretrained model that is then fine-tuned to classify images and also to generate music, the model has both expressive and non-expressive uses. This complicates liability attribution and allocation.

This also touches on the point that different architectures introduce different legal pressures. Pre-training is an important part of the AI supply chain.

Jonas A. Geiping, Diffusion Art or Digital Forgery?:

Diffusion models copy. If you generate lots of random images, and then compare them to the training dataset, you find lots of close images. Sometimes there are close matches for both content and style, sometimes they match more on style and less on the content. We find that about 1.88% of generated images are replications.

What causes replication? One cause is data duplication, where the same image is in the dataset with lots of variations (e.g. a couch with different posters in the background). These duplications can be partial, because humans have varied them.

Another cause is text conditioning, where the model associates specific images with specific captions. If you see the same caption, you’re much more likely to generate the same image. If you break the link – you train on examples where you resample when the captions are the same – this phenomenon goes away. The model no longer treats the caption as a key.

Mitigations: introduce variation in captions during training, or add a bit of noise to captions at test time.

Rui-Jie Yew (for Stephen Casper), Measuring the Success of Diffusion Models at Imitating Human Artists:

Diffusion models are getting better at imitating specific artists. How do we evaluate how effectively diffusion models imitate specific artists?

  1. Human judgment. That’s subjective and difficult to apply consistently.
  2. Using training data. But models are increasingly the result of complicated processes.
  3. Using AI image classification.

Our goal is not to automate determinations of infringement. Two visual images can be similar in non-obvious ways that are relevant to copyright law.

We start by identifying 70 artists and reclassify them based on Stable Diffusion’s imitation of their works. Most of them were successfully identified. We also used cosine different to measure similarity between artists’ images and Stable Diffusion imitations of their work versus other artists. Again, there are statistically significant similarities with the imitations.

So yes, Stable Diffusion is successful at imitating human artists!

A Brief Introduction to Machine Learning and Monetization

Nicholas Carlini

Machine learning is a really simple thing that tries to train a model to do something useful. Consider trying to become a lawyer. One way to train them is to go to law school, learn from professors, take exams, etc. Another way is to memorize every test you can and memorize all of the answers. At the end of both, you can probably pass the bar. But one way is actually useful; the other is only useful for passing the bar.

We train machine learning models by doing the second thing. We show them all the tests in the world, and ask them to memorize all the answers at the same time. Any amount of generalization is by luck alone. When we train models on text, we train them by asking them to predict the last blank in a sequence. When we train models on images, we add noise, and then ask them to reconstruct the original image from the noise. Image generation is a side effect of denoising. It makes sense that they memorize the images, because there’s no way to go back to the originals unless you’ve memorized them.

We like to think of generalization as a human kind of generalization. But suppose I’m learning how to play baseball. To an ML researcher, generalization is not getting good at baseball. Rather, it that means that if my parents take me to the same field and throw the ball the same way, I can still hit the ball. Change the field or the way of throwing the ball, the model will fail.

Some policy-space observers argue that the copying from any given input is minimal. So we did experiments to show that models do in fact memorize substantial portions of their inputs. We found examples of people’s personal information, which the model leaks to anyone who wants. Yes, your information is still online anyway, but the model is still doxing you. Several thousand lines of code is not de minimis.

Circa 2020, we knew that text models did this. It turns out that image models also do this. Stable Diffusions memorizes specific input images. It turns out that the different between those output images and the originals is smaller than the difference from compressing it into a JPEG. We can do this for a lot of images.

Takeaways. There are three worlds we might have been in:

  1. Models always emit training data.
  2. Models sometimes emit training data.
  3. Models never omit training data.

We are in the second world, the blurry world where models sometimes emit training data, and sometimes don’t.

Gautam Kamath

Let’s take for granted that ML models are likely to infringe. In the U.S., that requires access plus substantial similarity. So one response is to fix access: remove any copyrighted image from the training set inputs. But it might not be obvious what is copyrighted. Another response is to fix substantial similarity: filter the outputs to remove substantially similar images. Again, though, we don’t have a clear definition of substantial similarity.

So let’s relax a requirement. Access-freeness may be hard to guarantee, and may be too strict. So we’ll go for near access-freeness (as defined by Vyas, Kakade, and Barak) We’ll try to use a model that is close to one that didn’t have access to the copyrighted work.

Is this kosher? Would it hold up in a court of law? I have no idea; I’m not a lawyer. But if it does, it will help with some of the previous challenges. No hard removal problems for training data.

So how do you do this? Let’s turn to differential privacy for inspiration. We feed a dataset into an algorithm to produce a model. Imagine, however, that we had a dataset that was different in one entry. An adversary is trying to tell which of the two datasets we trained on. If they can’t do much better than random guessing, then the training procedure was differentially private. The procedure is differentially private when the model distributions don’t change much when adding or removing one point. DP is widely used in government and industry.

Concretely, DP protects against membership inference (is a single data point in the dataset?), and against revealing any information that wouldn’t be present if a particular data point weren’t trained on.

We can train a differentially private model on training data. Now consider any arbitray copyrighted point. The resulting model is close to one trained on the same dataset minus that point. So differential privacy is very close to near access freeness.

Design choices:

  • What value of epsilon is sufficient? There’s no free lunch; the smaller you set epsilon, the lower the utility of the model.
  • What is a point? A sentence? A document? An image? A portion of one? All images in a series? All of an artist’s works?

Note that copyright and privacy violations are not identical. Consider also lawsuits claiming copying of artists’ styles. Differential privacy might not help here, because style is not a property of a single data point.

Panel: Privacy

Moderators: Katherine Lee and Deep Ganguli
Panelists: Kristen Vaccaro, Nicholas Carlini, Miles Brundage, Gautam Kamath, and Jack Balkin

Lee: What is privacy?

Balkin: Dan Solove wrote a famous taxonomy of privacy. Take your pick. Sometimes it’s about control of information. Sometimes it’s about manipulation, or disclosure. It’s a starfish concepts.

Brundage: I’ve read various definitions but I don’t want to regurgitate one I’ve memorized.

Carlini: You know it when you see it. I try to do most of my work with things that are obvious violations that everyone agrees on. For most attack research, you don’t need a working definition.

Kamath: Computer scientists often like to work with specialized definitions. I work on DP which is a very specific notation.

Vaccaro: I spent a lot of time trying to explain privacy guarantees to people and see what they understand. I’m also interested in networked conceptions of privacy. When you share a picture of yourself, you’re also sharing information about your friends, family, and associates.

Ganguli: What are the most important privacy concerns you worry about in your work? There’s a show on Netflix called Deep Fake Love in which people are shown deep fakes of their partners cheating. As a community, what types of privacy should we be thinking about?

Balkin: People most worry about the loss of control and the obliteration of circles of trust. Another possibility is about false light: the deep fakes problem. You can construct what looks like the person in almost any situation. The method of collection can be one of the most serious violations. We could think about privacy at each stage.

Carlini: I’m mostly worried about data leakage about individual people. If we can’t solve that, what are we even doing?

Balkin: Vulnerability to attack might be an independent privacy problem. It’s a new kind of privacy problem.

Lee: Is generative AI different than search or other kinds of data collection? Take the personal-info leak. Is this worse than Google Search?

Carlini: The reason we do what we do on models trained online is so that we can do it without violating privacy. We could do this on models trained on hospital records, but I don’t want to violate medical privacy. The takeaway is that if this model was trained on your private emails, it would have leaked them, so we should be very careful about training on more private data.

Brundage: It’s important to compare deployed systems to deployed systems; think about ChatGPT refusing certain requests.

Vaccaro: A question I still had at the end of the morning is about the power dynamics. Obviously I spend a lot of time working on social media. People post things there with certain expectations about how they’re going to be used. With generative AI, people have a sense that there is something larger coming out of these uses.

Kamath: I think we don’t have a very good idea of what we agree to and where our data winds up. People click to agree and think things will be fine. Maybe with generative AI, things won’t be fine.

Balkin: Suppose I’ve trained a model And then I put in a picture of this guy here, and ask the model to predict his medical situation and intimate details, which we would ordinarily consider private? (Asking about true predictions, not hallucinations.)

Kamath: You could tell name, demographics, and some broad predictions based on that (e.g. of Indian descent so of higher diabetes risk, but young so maybe not).

Balkin: So a possible future problem is that you can infer sensitive information, not just memorize it.

Ganguli: I used to ask who is Deep Ganguli, and the model would say “I don’t know who that is and I don’t know anything about anyone who works at Anthropic.” Now it says I work at the University of Calcutta, because Ganguli is a Brahmin last name and it knows I do some kind of research.

Balkin: A model of privacy violation is that information is given to a trusted person (e.g. a doctor) and then it leaks out to the wrong person who uses it for an inappropriate purpose. But suppose I could generate that information without any individual leaking it, that solely through different pieces of information it could infer things about me. That seems to be the new thing that would be quite different from search. Once you can do it for one person, you can do it for lots of people.

Lee: This reminds me of recommendation algorithms. Any one product might not lead to an inference, but a collection of them might.

Balkin: We have the same thing with political behavior.

Kamath: Suppose you see a person smoking one cigarette, then another. You say to them, “I think you might develop cancer.” Is that a privacy violation? Now consider a machine learning model that draws complex interfaces. Does that become a privacy violation?

Ganguli: How should we think about malicious access versus incidental access? Is this a useful distinction for generative AI?

Carlini: Suppose I had a model that would reveal all kinds of personal information about me if you asked “Who is Nicholas Carlini?” That would be much worse than a model that would require a very specialized prompt to generate these outputs. We had this discussion in the copyright context as well – a model that infringes when prompted very specifically for infringement, versus one that does it on its own. RLHF does fairly well at preventing the latter, but has a much harder time preventing the first type.

Brundage: Alternatively, at what point is it sufficiently difficult that it’s not an attractive alternative than just Googling it?

Vaccaro: It’s also buying personal data on the market, not just search.

Lee: Does malicious versus incidental changes how you think about the legal analysis?

Balkin: You’re interested in both. If you know there are bad actors, you might have some responsibility to fine-tune to avoid that. Suppose I want to make some money off of vulnerable individuals. I know there are people who are financial idiots who will buy worthless coins. Can I use LLMs to identify these people? Do we have the capacity to do that?

Vaccaro: We don’t need generative models for this. You can do this with regular ML. And in fact you can buy lists of people with vulnerable attributes.

Balkin: Do these new technologies make it easier or more plausible to prey on people? Are they useful for even more efficient preying?

Kamath: How much do these datasets cost?

Carlini: Datasets with credit-card numbers are about $5 to $20 per person. But that’s illegal.

Vaccaro: The legal lists are very cheap. Are there ways for generative models to extract more information? Frankly, social media is very good at this.

Balkin: In the world of social media, the dirty little secret is that the companies promise that their algorithms are very good, and they’re bilking their advertisers. But could these generative AI technologies actually make good on these promises?

Kamath: Do we consider targeted advertising a privacy violation?

Balkin: It has to do with the vulnerability of the target.

Lee: It kind of sounds like we’re talking about traditional ML here. Generative AI has different media: text, audio, video. So you could spoof people by pretending to be someone they know using their voice.

Kamath: I’m not sure copying a voice is a privacy violation.

Carlini: I don’t see this as a privacy attack; that’s a security attack.

Vaccaro: At the same time, this does expose the fact that there should be even more hackers working on these models. You could be very creative at misusing these models.

Carlini: We can go to automated spearphishing.

Balkin: Our conversation is about why we care about privacy. If you think the harm is the creation of vulnerability, then you can see the privacy connection to exploitation.

Question: With generative AI, we have huge pre-trained models. Do you think it’s better to partition the model and only allow access through a service, or is it better to have a public model like Meta has done.

Carlini: I want public models. I think holding things secret only delays the problems. Only don’t release it or actually release it and let people do the security analysis. Trying to hold it back is not going to work. Maybe the OpenAI person feels differently.

Brundage: I think it’s a complex issue. I worry a lot about the implications of big jumps in capabilities. I worry about “just release it” especially when there is a Cambrian Explosion of pairing models up with tools.

Kamath: A classic issue in image classification is adversarial examples. We tried to solve them. And then there was recently a talk showing that you can do the same things against LLMs. Seems like the stakes have been raised. It’s helpful if we can see these issues in the simplest models to predict future issues.

Paul Ohm: Invitation to the Privacy Law Scholars Conference, which happens every June, where we have these conversations. Could you train a generative AI that would never say anything about any individual person ever? You’d be very confident that certain privacy harms wouldn’t be achievable. Would that be tractable?

Fatemehsadat Mireshghallah: You could use theory of mind to keep track of this. (There was a theory of mind workshop here yesterday.) You have to figure out who is a person and reason about what the user is asking about them.

Lee: This is another thing you can try to align a model to do.

Question: I come from Los Alamos, where the concept of putting a technology out there is starting to feel a bit poignant. We have fictional examples of people like Sherlock Holmes who can infer personal details from small observations. From a philosophical/legal standpoint, whats’ the difference between having a few people like that walking around and having a high-volume tool?

Carlini: There’s a difference between standing on the street and looking in their window and looking in through a telescope from far away. But technically they’re the same thing. This is mostly a legal quesiton.

Balkin: The difference has to do with power relations. Sherlock Holmes works with Scotland Yard. If he were to use his powers for blackmail, that would be different. Another difference is that there’s only one of him. If he existed, what restrictions would be put on him? In the law, we’re very worried about concentrations of extreme power. A lot of privacy law is about this.

Kamath: We do have large teams of Holmeses running around. The descendants of the Pinkertons attack unions by collecting intrusive information.

Balkin: That’s not necessarily legal; the problem is often that they’re not called to account for the illegal things they do. Privacy is about the use of information power to take unfair advantage of others and treat them unjustly.

Vaccaro: In San Diego we’re fighting about license plate readers.

Balkin: Unfortunately, this probably isn’t a Fourth Amendment violation.

Vaccaro: Maybe meeting the law is way too low a bar.

Lee: This is also a problem for making a product. User perceptions of privacy matter a lot, too.

Question: I feel like privacy and AI is a problem for rich people. Take GDPR. The big companies benefitted from this regulation, because they had the tools and competencies and scale to implement them. Same thing with AI Act in EU.

Brundage: Privacy is for everyone.

Vaccaro: My concern is that you can pay for privacy. If I want email without Google reading it, I can pay more for it and set up my own server.

Kamath: It’s a tradeoff. We can ask with DP about how much society benefits, and over time there may be more expertise and tools.

Andres Guadamuz: I have to strongly disagree. GDPR and the AI Act aren’t just being pushed by large companies. They’re fighting the enforcement of these privacy laws. But GDPR benefits citizens; I’ve used it personally. The AI Act is going to be a disaster, but that’s not being pushed by the big companies.

Question: You (Balkin) said AI companies will make free-speech arguments. Why will their arguments be weaker than social media companies’ arguments? Will AI alignment efforts improve their First Amendment arguments?

Balkin: The immediate analogy they will make is that fine-tuning to prevent hate speech is a kind of editorial function, like newspapers, and that it’s no different than content moderation at social-media companies. But no one doubts that social media hosts people’s speech. Does that mean that the AI company is conceding that it hosts the speech of the AI? That would put them in a different litigation posture than claiming it’s just a product. That’s the puzzle. (There are people who think that what social media companies do is not editorial.) It would follow from the editorial-function argument that an AI company would have a First Amendment right not to fine-tune.

Question: From an EU perspective, data protection has a lot of overlap with privacy but is considered a distinct right. Under the GDPR, a lot of targeted advertising and cookie targeting is unlawful. It might be that they could be done lawfully, but currently the IAB’s consent framework legally falls short. So when you take these profiling technologies, and you layer on top of that generative AI, what happens when the model not only predicts the most likely token, but also predicts what the person wants to hear?

Kamath: So is model personalization a privacy violation?

Carlini: I don’t see how this being an LLM makes it different.

Lee: The harms are potentially worse because it’s interactive.

Carlini: Of course, if you see my personalized search results, that’s the same problem. I don’t see how the LLM makes it fundamentally a different object.

Lee: We’ll end there, on the question of whether generative AI changes the privacy lawsuits.

Some Skepticism About NFT Copyright

I am participating today in the Copyright Office’s NFT Roundtable. Here is the text of my (brief) opening remarks:

Good morning, and thank you. I’m a professor at Cornell Law School and Cornell Tech. I would like to make one point, at a high level of abstraction. It may seem obvious, but I think it is urgent not to lose sight of.

One promise of blockchain is that it is a perfect paper trail. But all paper trails can fail. Sometimes this is because of technical failures in a record-keeping system itself, which is the problem that blockchain attempts to address. But more often it is because the information that parties attempt to record never corresponded to reality in the first place.

Some transactions that look valid on paper are not, because of forgery, fraud, duress, or mistake. And some otherwise valid transactions are imperfectly documented: perhaps the contract was signed in the wrong place. If a transactional form is used enough times, everything that can go wrong with it eventually will.

The legal system deals with these cases all the time. When the facts on the ground and the records in the books get out of sync, a court or agency will step in to bring them back into alignment. Sometimes in such cases the paper trail prevails. But often it does not, and the legal system will ignore the records, or correct them to match reality.

The application to NFTs should be apparent. The transfer of an NFT by entering a smart-contract transaction on a blockchain is a kind of paper trail, and all paper trails can fail. Some intended NFT transfers will not go through, and some NFT owners will lose control of their NFTs without giving legally valid consent.

This means that if the state of copyright ownership or licensing is tied to ownership of an NFT, one of two propositions must be true. Either the legal system must have some mechanism for correcting the blockchain when its records are in error, or else in some cases copyright owners will lose legal control of their works through preventable forgery, fraud, duress, or mistake.

It is sometimes said that the advantage of a blockchain is that on-chain records are immutable and authoritative. That is precisely why I am skeptical of blockchains in the copyright space. To quote Douglas Adams, “The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair.”

Reading Without Footprints

The phrase “I’m not going to link to it” has 385,000 results on Google. The idea is usually that the author wants to explain how someone is wrong on the Internet, but doesn’t want to reward that someone with pageviews, ad impressions, and other attention-based currencies. “Don’t feed the trolls,” goes the conventional wisdom, telling authors not to write about them. But in an age when silent analytics sentinels observe and report everything anyone does online, readers can feed the trolls without saying a word.

Actually, the problem is even worse. You can feed the trolls without ever interacting with them or their websites. If you Google “[person’s name] bad take,” you tell Google that [person’s name] is important right now. If you click on a search result, you reward a news site for writing an instant reaction story about the take. Every click teaches the Internet to supply more car crashes.

Not linking to the bad thing is usually described as a problem of trolls, and of social media, and of online discourse. But I think that it is also a problem of privacy. Reader privacy is well-recognized in law and in legal scholarship, and the threats it faces online are well-described. Not for nothing did Julie Cohen call for a right to read anonymously. Surveillance deters readers from seeking out unpopular opinions, facilitates uncannily manipulative advertising, and empowers the state to crush dissent.

To these I would add that attention can be a signal wrapped in an incentive. Sometimes, these signals and incentives are exactly what I want: I happily invite C.J. Sansom to shut up and take my money every time he publishes a new Shardlake book. But other times, I find myself uneasily worrying about how to find out a thing without causing there to be more of it in the world. There’s a weird new meme from an overrated TV show going around, and I want to know what actually happened in the scene. There’s a book out whose premise sounds awful, and I want to know if it’s as bad as I’ve been told. Or you-know-who just bleated out something typically terrible on his Twitter clone, and I don’t understand what all the people who are deliberately Not Linking To It are talking about.

We are losing the ability to read without consequences. There is something valuable about having a realm of contemplation that precedes the realm of action, a place to pause and gather one’s thoughts before committing. Leaving footprints everywhere you roam doesn’t just allow people to follow you. It also tramples paths, channeling humanity’s collective thoughts in ways they should perhaps not go.

I Do Not Think That NFT Means What You Think It Does

I recently tweeted that every sentence of this “explanation” of blockchain-based non-fungible tokens (NFTs) from the Harvard Business Review is false:

NFTs have fundamentally changed the market for digital assets. Historically there was no way to separate the “owner” of a digital artwork from someone who just saved a copy to their desktop. Markets can’t operate without clear property rights: Before someone can buy a good, it has to be clear who has the right to sell it, and once someone does buy, you need to be able to transfer ownership from the seller to the buyer. NFTs solve this problem by giving parties something they can agree represents ownership. In doing so, they make it possible to build markets around new types of transactions — buying and selling products that could never be sold before, or enabling transactions to happen in innovative ways that are more efficient and valuable.

In a follow-up thread, I expanded on why I am so skeptical about NFTs. I thought it would be useful to clean up and collect my thoughts in one place. I am a law professor who thinks a lot about digital property and about decentralized systems, and I think the idea that NFTs are about to revolutionize property law misunderstands how property law actually works.

Loosely speaking, there are three kinds of property you could use an NFT to try to control ownership of: physical things like houses, cars, or tungsten cubes; information like digital artworks; and intangible rights like corporate shares.

By default, buying an NFT “of” one of these three things doesn’t give you possession of them. Getting an NFT representing a tungsten cube doesn’t magically move the cube to your house. It’s still somewhere else in the world. If you want NFTs to actually control ownership of anything besides themselves, you need the legal system to back them up and say that whoever holds the NFT actually owns the thing.

Right now, the legal system doesn’t work that way. Transfer of an NFT doesn’t give you any legal rights in the thing. That’s not how IP and property work. Lawyers who know IP and property law are in pretty strong agreement on this.

It’s possible to imagine systems that would tie legal ownership to possession of an NFT. But they’re (1) not what most current NFTs do, (2) technically ambitious to the point of absurdity, and (3) profoundly dystopian. To see why, suppose we had a system that made the NFT on a blockchain legally authoritative for ownership of a copyright, or of an original object, etc. There would still be the enforcement problem of getting everyone to respect the owner’s rights.

There are two ways to enforce NFT “ownership.” The first is to get the legal system to do it. Judges would issue orders saying you own this widget because you have the Widget NFT, and then county sheriffs would show up to take possession of the widget and give it to you. The thing is, if you’re going to do that, there’s no point to the blockchain. We already have land registries, the DMV, and the Copyright Office. The blockchain is just an inefficient way of telling judges and sheriffs to do the same thing.

The other is to enforce everything digitally, by linking the physical world to the blockchain using secure digital hardware devices. That way, your car won’t start unless you prove ownership of the YourCar NFT. There are some serious downsides here. When your computer gets hacked, you also lose ownership of your car!

Sometimes, NFT advocates avoid dealing with the inconvenient fact that the physical world doesn’t run on a blockchain by shifting to a future in online spaces that do. They propose a blockchain-based metaverse, or online games with NFT-based economies, etc. The thing is that we’ve had digital property in those virtual spaces for decades. None of them needed a blockchain to work.

The bottom line is that almost1 everything NFT advocates want to do on a blockchain can be done more easily and efficiently without one, and the legal infrastructure needed to make NFTs work defeats the point of using a blockchain in the first place.

I say “almost” everything because NFT art may be an exception. A lot of the current hype around NFTs consists of the belief that the rest of the world will follow the same rules as NFT art. But of course part of the point of art is that it doesn’t follow the same rules as the rest of the world. ↩︎

Some Mistakes I Have Personally Found in Published Federal Judicial Opinions

Applying these principles, the court [in Armstrong v. Eagle Rock Entm’t, Inc., 655 F. Supp. 2d 779, 786 (E.D. Mich. 2009)] found that Eagle Rock Entertainment’s decision to use Louis Armstrong’s picture on the cover liner of its DVD entitled, ‘Mahavishnu Orchestra, Live at Montreux, 1984, 1974,’ without consent was protected by the First Amendment. Rosa & Raymond Parks Inst. for Self Development v. Target Corp., 90 F. Supp. 3d 1256, 1264 (2015).

Armstrong involved Ralphe Armstrong, not Louis Armstrong, who died in 1971.

The examiner’s final rejection, repeated in his Answer on appeal to the Patent and Trademark Office (PTO) Board of Appeals (board), was on the grounds that claims 1 and 2 are anticipated (fully met) by, and claim 3 would have been obvious from, an article by Kalabukhova and Mikheyew , Investigation of the Mechanical Properties of Ti-Mo-Ni Alloys, Russian Metallurgy (Metally) No. 3, pages 130-133 (1970) (in the court below and hereinafter called “the Russian article”) under 35 U.S.C. §§ 102 and 103, respectively. Titanium Metals Corp. of America v. Banner, 778 F.2d 775, 776 (Fed. Cir. 1985)

The author’s surname is Михеев, i.e., Mikheyev, not Mikheyew. There is no letter in the Cyrillic alphabet that transliterates to “w” under any commonly used system of Romanization.

GCC filed a trademark application for the mark GUANTANAMERA for use in connection with cigars on May, 14, 2001. When translated, “guantanamera” means “(i) the female adjectival form of GUANTANAMO, meaning having to do with or belonging to the city or province of Guantanamo, Cuba; and/or (ii) a woman from the city or province of Guantanamo, Cuba.” (Op. U.S.P.T.O. at 2.) Many people are also familiar with the Cuban folk song, Guantanamera, which was originally recorded in 1966. (Id. at 12-13.) Guantanamera Cigar Co. v. Corporacion Habanos, SA, 729 F. Supp. 2d 246, 250 (D.D.C. 2010)

The first recording of “Guantanamera” (lyrics adapted by Julián Orbón from poetry by José Martí, music by Joseíto Fernández) was probably sometime in the 1930s by Fernández. It was released in the United States in two well-known versions in 1963, one by the Weavers (from a 1955 concert) and another by Pete Seeger. All of these predate the 1966 easy-listening version by the Sandpipers.

The Copyright Law of Embedding Just Got a Lot More Interesting

Tim Lee has a remarkable story at Ars Technica about a remarkable copyright case, McGucken v. Newsweek. Its headline, “Instagram just threw users of its embedding API under the bus,” is not an exaggeration. (Disclosure: I am quoted in the story, and I learned about the case from being interviewed for it.) The facts are simple:

Photographer Elliot McGucken took a rare photo (perhaps this one) of an ephemeral lake in Death Valley. Ordinarily, Death Valley is bone dry, but occasionally a heavy rain will create a sizable body of water. Newsweek asked to license the image, but McGucken turned down their offer. So instead Newsweek embedded a post from McGucken’s Instagram feed containing the image.

This is the third case I am aware of in the Southern District of New York in the last two years on nearly identical facts. One of them, Sinclar v. Ziff Davis, held that the Mashable was not liable for an Instagram embed. The court reasoned that by uploading her photograph to Instagram, photographer Stephanie Sinclair agreed to Instagram’s terms of service, including a copyright license to Instagram to display the photograph – and also thereby allowed Instagram to sublicense the photograph to its users who used the embedding API. Thus, Mashable had a valid license from Sinclair by way of Instagram, so no infringement.

McGucken agrees with most of this reasoning, but stops just short of the crucial step.

The Court finds Judge Wood’s decision [in Sinclair] to be well-reasoned and sees little cause to disagree with that court’s reading of Instagram’s Terms of Use and other policies. Indeed, insofar as Plaintiff contends that Instagram lacks the right to sublicense his publicly posted photographs to other users, the Court flatly rejects that argument. The Terms of Use unequivocally grant Instagram a license to sublicense Plaintiff’s publicly posted content, and the Privacy Policy clearly states that “other Users may search for, see, use, or share any of your User Content that you make publicly available through” Instagram.

Nevertheless, the Court cannot dismiss Plaintiff’s claims based on this licensing theory at this stage in the litigation. As Plaintiff notes in his supplemental opposition brief, there is no evidence before the Court of a sublicense between Instagram and Defendant. Although Instagram’s various terms and policies clearly foresee the possibility of entities such as Defendant using web embeds to share other users’ content, none of them expressly grants a sublicense to those who embed publicly posted content. Nor can the Court find, on the pleadings, evidence of a possible implied sublicense. (citations omitted)

Lee did something smart with this dueling pair of cases: he got Facebook (Instagram’s owner) to go on record with its interpretation of its own terms of use.

“While our terms allow us to grant a sub-license, we do not grant one for our embeds API,” a Facebook company spokesperson told Ars in a Thursday email. “Our platform policies require third parties to have the necessary rights from applicable rights holders. This includes ensuring they have a license to share this content, if a license is required by law.”

In plain English, before you embed someone’s Instagram post on your website, you may need to ask the poster for a separate license to the images in the post. If you don’t, you could be subject to a copyright lawsuit.

This statement, I think it is fair to say, comes as a surprise to Mashable, to Judge Wood, and to all of the Instagram users who embed photos using its API. Major online services offer widely-used embedding APIs, and media outlets make extensive use of them. I would not say that it is universal, but it is certainly a widespread practice for which, it is widely assumed, no further license is needed. If that is not true, it is a very big deal, and a great many Internet users are now suddenly exposed to serious and unexpected copyright liability.

McGucken is not the end of the story. I would have said – and in fact I initially told Lee – that it is possible the court would reach a different conclusion at a later stage of the case, once it had more facts about Instagram’s terms of use. That … no longer seems likely. But it is still quite possible Newsweek could win and be allowed to use the embedded photograph. It raised a fair use defense, and might well prevail on that at a later stage. It might also be able to rely on the server rule.

The server rule, which can be traced to Perfect 10 v. from the Ninth Circuit in 2007, holds that only the person whose server transmits a copy of an image “displays” that image within the meaning of the Copyright Act. In an embedding case like Sinclair or McGucken, that would be Instagram, not Mashable or Newsweek – that is how embedding works. There is no dispute that Instagram is licensed to publicly display copies of these photographs; the photographers agreed as much when they uploaded them. So on the server test, no sublicense is needed; embeds are noninfringing.

The server test, although widely relied on by Internet users and Internet services, has also been criticized. The third SDNY embedding API case, Goldman v. Breitbart, held that the defendant websites could be liable for Twitter embeds of Goldman’s photograph. In a detailed opinion, the Goldman court considered and rejected the server test. (Side note: There was an important potential factual distinction in Goldman. There, unlike in Sinclair and in McGucken, the photograph had been uploaded to Twitter by unauthorized third parties, who could give no license to Twitter and thus none to the defendants. But this distinction played no part in Goldman’s legal analysis. While these facts could be relevant to the existence of a license, they don’t affect whether the image was displayed or by whom.)

To summarize, there are two possible routes to finding that API embeds of a photographer’s own uploads are allowed: either the service itself displays the image under the server rule, or the embedder displays it but has a valid sublicense. Goldman rejected the server rule, but did not consider the existence of a sublicense. Sinclair did not consider the server rule but held there was a sublicense. McGucken did not consider the server rule – inexplicably, Newsweek did not ask the court to hold that there was no direct infringement under the server rule – and held that there was no sublicense. No court has considered and ruled on both arguments together, despite the fact that they are joined at the hip.

A particularly careful and thorough critique of the server is Embedding Content or Interring Copyright: Does the Internet Need the “Server Rule”?, by Jane Ginsburg and luke Ali Budiardjo. They argue that the server rule misreads the Copyright Act and should, with Goldman, be rejected. They believe, however, that the sky will not fall, because licenses will fill any gaps that should be filled. They note that YouTube’s terms of service, for example, explicitly provide for a license grant from uploaders to YouTube’s users, and they predict that this practice will be common:

Therefore, it seems likely that platforms can (and will) utilize Terms of Service agreements that are sufficiently broad to protect themselves and their users from infringement claims based on user “sharing” of platform content through platform mechanisms.

I would have thought so, too. Hence my surprise at Instagram’s position. There are two possibilities here. One is that Instagram does not explicitly grant a license because it believes the server test is the law. That position has been risky ever since Goldman. The other is that Instagram is willing to expose its users to copyright liability when they use its system as intended. I think it is not unreasonable to describe this, as Ars does, as throwing its users under the bus.

One last twist. In late April, Sinclair filed a motion for reconsideration of the holding that Mashable had a sublicense from Instagram, including some challenges to the court’s interpretation of Instagram’s terms of use. The main brief in support of reconsideration could be clearer, but her reply brief puts the issue squarely: “Nowhere has Mashable put in the record any proof as to how Instagram ‘validly exercised’ its right in granting Mashable a sublicense of Plaintiff’s photo.” There things sat, until on June 2, Sinclair called the court’s attention to the McGucken order, and then today, June 4, called its attention to the Ars story published just hours before. I speculated to Lee that McGucken “is going to blow up the Sinclair case.” I shouldn’t have used the future tense. It already has.

A Few Thoughts on Cisco v. Beccela’s

Rebecca Tushnet blogged a trainwreck of a copyright opinion in Cisco Systems, Inc. v. Beccela’s Etc. from the Northern District of California. The software-licensing caselaw was not good, but this is one of the most confused opinions I’ve seen.

In brief, Cisco sells networking devices through a network of authorized dealers. The defendants allegedly sell Cisco devices outside of these authorized channels. Cisco sued on a variety of theories, including copyright infringement. In response, the defendants claimed they were making legal first sales.

Ninth Circuit caselaw (see Vernor, Psystar, and Christenson) has held that first sale doesn’t apply to software distributed on CD-ROMs or DVDs which are “licensed” rather than “sold,” and use a messy multi-factor test to determine whether a given shiny plastic disc is licensed or sold. The defendants here argued that the result should be different where the software is “embedded in hardware,” but the court disagreed that this made a difference. “The Ninth Circuit in these cases did not distinguish the first sale doctrine’s application between software and hardware … .” As a result, “[T]he first sale doctrine does not apply to software licensees even when the software is embedded in lawfully purchased hardware … .”

To which I can say only, what does the court think that software IS?

Zoolander: The files are in the computer?!

“Software” could refer to the information in a program – the sequence of bits or characters – or it could refer to a specific physical instantiation of the program – a chip, printout, or other object encoding that information. In copyright terms, the former is a “work”; the latter is a “copy.” Cisco has a copyright in the work, and we can assume that the copyright has never been validly licensed to the defendants. But in first sale, that’s irrelevant. If I’m “the owner of a particular copy … lawfully made,” then I can distribute that copy regardless of whether I have any license from the copyright owner. That’s what first sale is. The reason that Vernor and other cases rejected the application of first sale is that the copy had been licensed rather than sold: that messy multi-factor test tries to figure out what rights the possessor has over a particular shiny plastic disc. For example, does the copyright owner have the right to demand the shiny plastic disc back? If so, then the possessor may not be an “owner” of that “particular copy” and so first sale may not apply.

This reasoning doesn’t on its face distinguish between shiny plastic discs and computer hardware. But that doesn’t mean the two cases are the same. It’s right there in the Beccela’s opinion. In fact, it’s right there in the same sentence where the court announces its conclusion. Cisco’s software isn’t just “embedded in hardware”; it’s “embedded in lawfully purchased hardware,” in the court’s own phrase. That ought to end the case. If the hardware is lawfully purchased (note: “purchased” and not “licensed”), then the defendants are owners of the copies of the software and have full first sale rights. Remember: “copies” are “material objects … in which a work is fixed,” a definition that includes both shiny plastic discs and dense arrays of transistors.

The court here honestly seems to believe that software can somehow be “embedded” in hardware without the hardware being a copy of the software, as though a file were in the computer but not of it. But there is no such thing. That is what it means to store digital information in a thing: the physical structure of the thing becomes an encoding of the thing. Or, in copyright terms, a copy is a physical thing “from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.” That’s how software works.

To be fair, I don’t think that courts in previous first-sale and software-licensing cases have been terribly careful about the work/copy distinction or about what software is. The opinions cited in Beccela’s are full of sloppy language that seems to invite this result. But that language was unnecessary; you could come out the same way in a DVD software first sale case while being careful about your terminology. Beccela’s takes these unintelligible fictions about how software works and turns them into an actual holding that is essential to the outcome of the case. It is rare to see the confusion at the heart of modern software copyright licensing so plainly stated.