The Laboratorium (3d ser.)

A blog by James Grimmelmann

Soyez réglé dans votre vie et ordinaire afin
d'être violent et original dans vos oeuvres.

GenLaw 2024

I’m virtually attending the GenLaw 2024 workshop today, and I will be liveblogging the presentations.

Introduction

A. Feder Cooper and Katherine Lee: Welcome!

The generative AI supply chain includes many stages, actors, and choices. But wherever there are choices, there are research questions: how ML developers make those choices? And wherever there are choices, there are policy questions: what are the consequences for law and policy of those choices?

GenLaw is not an archival venue, but if you are interested in publishing work in this space, consider the ACM CS&Law conference, happening next in March 2025 in Munich.

Kyle Lo

Kyle Lo, Demystifying Data Curation for Language Models.

I think of data in three stages:

  1. Shopping for data, or acquiring it.
  2. Cooking your data, or transforming it.
  3. Tasting your data, or testing it.

Someone once told me, “Infinite tokens, you could just train on the whole Internet.” Scale is important. What’s the best way to get a lot of data? Our #1 choice is public APIs leading to bulk data. 80 to 100% of the data comes from web scrapers (CommonCrawl, Internet Archive, etc.). These are nonprofits that have been operating long before generative AI was a thing. A small percentage (about 1%) is user-created content like Wikipedia or ArXiv. And about 5% or less is open publishers, like PubMed. Datasets also heavily remix existing datasets.

Nobody crawls the data themselves unless they’re really big and have a lot of good programmers. You can either do deep domain-specific crawls, or a broad and wide crawl. A lot of websites require you to follow links and click buttons to get at the content. Writing the code to coax out this content—hidden behind JS—requires a lot of site-specific code. For each website, one has to ask whether going through this is worth the trouble.

It’s also getting harder to crawl. A lot more sites have robots.txt that ask not to be crawled or have terms of service restricting crawling. This makes CommonCrawl’s job harder. Especially if you’re polite, you spend a lot more energy working through a decreasing pile of sources. More data is now available only to those who pay for it. We’re not running out of training data, we’re running out of open training data, which raises serious issues of equitable access.

Moving on to transformation, the first step is to filter out low-quality pages (e.g., site navigation or r/microwavegang). You typically need to filter out sensitive data like passwords, NSFW content, and duplicates.

Next is linearization: remove header text, navigational links on pages, etc., and convert to a stream of tokens. Poor linearization can be irrecoverable. It can break up sentences and render source content incoherent.

There is filtering: cleaning up data. Every data source needs its own pipeline! For example, for code, you might want to include Python but not Fortran. Training on user-uploaded CSVs in a code repository is usually not helpful.

Ssing small-model classifiers to do filtering has side effects. There are a lot of terms of service out there. If you do deduplication, you may wind up throwing out a lot of terms of service. Removing PII with low-precision classifiers can have legal consequences. Or, sometimes we see data that includes scientific text in English and pornography in Chinese—a poor classifier will misunderstand it.

My last point: people have pushed for a safe harbor for AI research. We need something similar for open-data research. In doing open research, am I taking on too much risk?

Gabriele Mazzini

Gabriele Mazzini, Introduction to the AI Act and Generative AI.

The AI Act is a first-of-its-kind in the world. In the EU, the Commission proposes legislation and also implements it. The draft is send to the Council, which represents governments of member states, and to the Parliament, which is directly elected. The Council and Parliament have to agree to enact legislation. Implementation is carried out via member states. The Commission can provide some executive action and some guidance.

The AI Act required some complex choices: it should be horizontal, applying to all of AI, rather than being sector-specific. But different fields do have different legal regimes (e.g. financial regulation).

The most important concept in the AI Act is its risk-based approach. The greater the risk, the stricter the rules—but there is no regulation of AI as such. It focuses on use cases, with stricter rules for riskier uses.

  • From the EU’s point of view, a few uses–such as social scoring—are unacceptable risk and prohibited.
  • The high-risk category covers about 90% of the rules in the AI Act. This includes AI systems that are safety components of physical products (e.g. robotics). It also includes some specifically listed uses, such as recruitment in employment. These AI systems are subject to compliance with specific requirements ex ante.
  • The transparency risk category requires disclosures (e.g. that you are interacting with an AI chatbot and not a human). This is where generative AI mostly comes in: that you know that content was created by AI.
  • Everything else is minimal or no risk and is not regulated.

Most generative AI systems are in the transparency category (e.g. disclosure of training data). But some systems, e.g. those trained over a certain compute threshold, are subject to stricter rules.

Martin Senftleben

Martin Senftleben, Copyright and GenAI Development – Regulatory Approaches and Challenges in the EU and Beyond

AI forces us to confront the dethroning of the human author. Copyright has long been based on the unique creativity of human authors, but now generative AI generate outputs that appear as though they were human-generated.

In copyright, we give one person a monopoly right to decide what can be done with a work, but that makes follow-on innovation difficult. That was difficult enough in the past, when the follow-on innovation came from other authors (parody, pastiche, etc.). Here, the follow-on innovation comes from the machine. Copyright policy makes this complex right now. It’s an attempt to reconcile fair renumeration for human authors with a successful AI sector.

The copyright answer would be licensing—on the input side, pay for each and every piece of data that goes into the data set, and on the output side, pay for outputs. If you do this, you get problems for the AI sector. You get very limited access to data, with a few large players paying for data from publishers, but others getting nothing. This produces bias in the sense that it only reflects mainstream inputs (English, but not Dutch and Slovak).

If you try to favor a vibrant AI sector, you don’t require licensing for training and you make all the outputs legal (e.g. fair use). This increases access and you have less bias on the output, but you have no remuneration for authors.

From a legal-comparative perspective, it’s fascinating to see how different legislators approach these questions. Japan and Southeast Asian countries have tried to support AI developers, e.g. broad text and data mining (TDM) exemptions as applied to AI training. In the U.S., the discussion is about fair use and there are about 25 lawsuits. Fair use opens up the copyright system immediately because users can push back.

In the E.U., forget about fair use. We have the directive on the Digital Single Market in 2019, which was written without generative AI in mind. The focus was on scientific TDMs. That exception doesn’t cover commercial or even non-profit activity, only scientific research. A research organization can work with a private partner. There is also a broader TDM exemption that enables TDM unless the copyright owner has opted out using “machine-readable means” (e.g. in robots.txt).

The AI Act makes things more complex; it has AI-related components. It confirms that reproductions for TDM are still within the scope of copyright and require an exemption. It confirms that opt-outs must be observed. What about training in other countries? If you at a later stage want to offer your trained models in the EU, you must have evidence that you trained in accordance with EU policy. This is an intended Brussels effect.

The AI Act also has transparency obligations: specifically a “sufficiently detailed summary of the content used for training.” Good luck with that one! Even knowing what’s in the datasets you’re using is a challenge. There will be an AI Office, which will set up a template. Also, is there a risk that AI trained in the EU will simply be less clever than AI trained elsewhere? That it will marginalize the EU cultural heritage?

That’s where we stand the E.U. Codes of practice will start in May 2025 and become enforceable against AI providers in August 2025. If you seek licenses now, make sure they cover the training you have done in the past.

Panel: Data Curation and IP

Panelists: Julia Powles, Kyle Lo, Martin Senftleben, A. Feder Cooper (moderator)

Cooper: Julia, tell us about the view from Australia.

Julia: Outside the U.S., copyright law also includes moral rights, especially attribution and integrity. Three things: (1) Artists are feeling disempowered. (2) Lawyers gotten preoccupied with where (geographically) acts are taking place. (3) Governments are in a giant game of chicken of who will insist that AI providers comply. Everyone is waiting for artists to mount challenges that they don’t have the resources to mount. Most people who are savvy about IP hate copyright. We don’t show the concern that we show for the AI industry for students or others who are impacted by copyright. Australia is being very timid, as are most countries.

Cooper: Martin, can you fill us in on moral rights?

Martin: Copyright is not just about the money. It’s about the personal touch of what we create as human beings. Moral rights:

  • To decide whether a work will be made available to the public at all.
  • Attribution, to have your name associated with the work.
  • Integrity, to decide on modifications to the work.
  • Integrity, to object to the use of the work in unwanted contexts (such as pornography).

The impact on AI training is very unclear. It’s not clear what will happen in the courts. Perhaps moral rights will let authors avoid machine training entirely. Or perhaps they will apply at the output level. Not clear whether these rights will fly due to idea/expression dichotomy.

Cooper: Kyle, can you talk about copyright considerations in data curation?

Kyle: I’m worried about: (1) it’s important to develop techniques for fine-tuning, but (2) will my company let me work on projects where we hand off the control to others? Without some sort of protection for developing unlearning, we won’t have research on these techniques.

Cooper: Follow-up: you went right to memorization. Are we caring too much about memorization?

Kyle: There’s a simplistic view that I want to get away from: that it’s only regurgitation that matters. There are other harmful behaviors, such as a perfect style imitator for an author. It’s hard to form an opinion about good legislation without knowledge of what the state of the technology is, and what’s possible or not.

Julia: It feels like the wave of large models we’ve had in the last few years have really consumed our thinking about the future of AI. Especially the idea that we “need” scale and access to all copyrighted works. Before ChatGPT, the idea was that these models were too legally dangerous to release. We have impeded the release of bioscience because we have gone through the work of deciding what we want to allow. In many cases, having the large general model is not the best solution to a problem. In many cases, the promise remains unrealized.

Martin: Memorization and learning of concepts is one of the most fascinating and different problems. From a copyright perspective, getting knowledge about the black box is interesting and important. Cf. Matthew Sag’s “Snoopy problem.” CC licenses often come with a share-alike restriction. If it can be demonstrated that there are traces of this material in fully-trained models, those models would need to be shared under those terms.

Kyle: Do we need scale? I go back and forth on this all the time. On the one hand,I detest the idea of a general-purpose model. It’s all domain effects. That’s ML 101. On the other hand, these models are really impressive. The science-specific models are worse than GPT-4 for their use case. I don’t know why these giant proprietary models are so good. The more I deviate my methods from common practice, the less applicable my findings are. We have to hyperscale to be relevant, but I also hate it.

Cooper: How should we evaluate models?

When I work on general-purpose models, I try to reproduce what closed models are doing. I set up evaluations to try to replicate how they think. But I haven’t even reached the point of being able to reproduce their results. Everyone’s hardware is different and training runs can go wrong in lots of ways.

When I work on smaller and more specific models, not very much has change. The story has been to focus on the target domain, and that’s still the case. It’s careful scientific work. Maybe the only wrench is that general-purpose models can be prompted for outputs that are different than the ones they were created to focus on.

Cooper: Let’s talk about guardrails.

Martin: Right now, the copyright discussion focuses on the AI training stage. In terms of costs, this means that AI training is burdened with copyright issues, which makes training more expensive. Perhaps we should diversify legal tools by moving from input to output. Let the trainers do what they want, and we’ll put requirements on outputs and require them to create appropriate filters.

Julia: I find the argument that it’ll be too costly to respect copyright to be bunk. There are 100 countries that have to negotiate with major publishers for access to copyrighted works. There are lots of humans that we don’t make these arguments for. We should give these permissions to humans before machines. It seems obvious that we’d have impressive results at hyperscale. For 25 years, IP has debated traditional cultural knowledge. There, we have belatedly recognized the origin of this knowledge. The same goes for AI: it’s about acknowledging the source of the knowledge they are trained on.

Turning to supply chains, in addition to the copying right, there are authorizing, importing, and communicating, plus moral rights. An interesting avenue for regulation is to ask where sweatshops of people doing content moderation and data labeling take place.

Cooper: Training is resource-intensive, but so is inference.

Question: Why are we treating AI differently than biotechnology?

Julia: We have a strong physical bias. Dolly the sheep had an impact that 3D avatars didn’t. Also, it’s different power players.

Martin: Pam Samuelson has a good paper on historical antecedents for new copying technologies. Although I think that generative AI dethrones human authors and that is something new.

Kyle: AI is a proxy for other things; it doesn’t feel genuine until it’s applied.

Question: There have been a lot of talks about the power of training on synthetic data. Is copyright the right mechanism for training on synthetic data?

Kyle: It is hard to govern these approaches on the output side, you would really have to deal with it on the input side.

Martin: I hate to say this as a lawyer, but … it depends.

Question: We live in a fragmented import/export market. (E.g., the data security executive order

Martin: There have been predictions that territoriality will die, but so far it has persisted.

Connor Dunlop

Connor Dunlop, GPAI Governance and Oversight in the EU – And How You Might be Able to Contribute

Three topics:

  1. Role of civil society
  2. My work and how we fit in
  3. How you can contribute

AI operates within a complex system of social and economic structures. The ecosystem includes industry and more. AI and society includes government actors and NGOs exist to support those actors. There are many types of expertise involved here. Ada Lovelace is an organization that thinks abut how AI and data impact people in society. We aim for research expertise, promoting AI literacy, building technical tools like audits and evaluations. A possible gap in the ecosystem is strategic litigation expertise.

At Ada Lovelace, we try to identify key topics early on and ground them in research. We do a lot of polling and engagement on public perspectives. And we recognize nuance and try to make sure that people know what the known unknowns are and where people disagree.

On AI governance, we have been asking about different accountability mechanisms. What mechanisms are available, how are they employed in the real world, do they work, and can they be reflected in standards, law, or policy?

Sabrina Küspert

Sabrina Küspert, Implementing the AI Act

The AI Act follows a risk-based approach. (Review of risk-based approach pyramid.) It adopts harmonized rules across all 27 member states. The idea is that if you create trust, you also create excellence. If provider complies, they get access to the entire EU.

For general-purpose models, the rules are transparency obligations. Anyone who wants to build on a general-purpose model should be able to understand its capabilities and what it is based on. Providers must mitigate systemic risks with evaluation, mitigation, cybersecurity, incident reporting, and corrective measures.

The EU AI Office is part of the Commission and the center of AI expertise for the EU. It will facilitate a process to detail the rules around transparency, copyright, risk assessment, and risk mitigation via codes of practice. Also building enforcement structures. It will have technical capacity and regulatory powers (e.g. to compel assessments).

Finally, we’re facilitating international cooperation on AI. We’re working with the U.S. AI Safety Office, building an international network among key partners, and engaged in bilateral and multilateral activities.

Spotlight Poster Presentations

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI (Robert Hönig, Javier Rando, Nicholas Carlini, Florian Tramer): We investigated methods that artists can use to prevent AI training on their work, and found that these protections can often be disabled. These tools (e.g. Glaze) work by adding adversarial perturbations to an artist’s images in ways that are unnoticable to humans but degrade models trained on them. You can use an off-the-shelf HuggingFace model to remove the perturbations and recover the original images. In some cases, adding Gaussian noise or using a different fine-tuning tool also suffices to disable the protections.

Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law (Giorgio Franceschelli, Claudia Cevenini, Mirco Musolesi): Our motivation is the knowledge that models tend to memorize and regurgitate. We observe that model weights are smaller than the training data, so there is an analogy that training is compression. Given this, is a model a copy or derivative work of training data?

Machine Unlearning Fails to Remove Data Poisoning Attacks (Martin Pawelczyk, Ayush Sekhari, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Seth Neel): Real-world motivations for unlearning are to remove data due to revoked consent or to unlearn bad/adversarial data that impact performance. Typical implementations use likelihood ratio tests (LRTs) that involve hundreds of shadow models. We put poisons in part of the training data; then we apply an unlearning algorithm to our poisoned model and then ask whether the algorithm removed the effects of the poison. We add Gaussian poisoning to existing indiscriminate and targeted poisoning methods. Unlearning can be evaluated by measuring correlation between our Gaussians and the output model. We observe that the state-of-the-art methods we tried weren’t really successful at removing Gaussian poison and no method performs well across both vision and language tasks.

Ordering Model Deletion (Daniel Wilf-Townsend): Model deletion (a.k.a. model destruction or algorithmic disgorgement) is a remedial tool that courts and agencies can use that requires discontinuing use of a model trained on unlawfully used data. Why do it? First, in a privacy context, the inferences are what you care about, so just deleting the underlying data isn’t sufficient to prevent the harm. Second, it provides increased deterrence. But there are problems, including proportionality. Think of OpenAi vs. a blog post: if GPT-4 trains on a single blog post of mine, then I could force deletion, which is massively disproportionate to the harm. It could be unfair, or create massive chilling effects. Model deletion is an equitable remedy, and equitable doctrines should be used to enforce proportionality and tied to culpability.

Ignore Safety Directions. Violate the CFAA? (Ram Shankar Siva Kumar, Kendra Albert, Jonathon Penney): We explore the legal aspects of prompt injection attacks. We define prompt injection as inputting data into an LLM that cause it to behave in ways contrary to the model provider’s intentions. There are legal and cybersecurity risks, including under the CFAA, and a history of government and companies targeting researchers and white-hat hackers. Our paper attempts to show the complexity of applying the CFAA to generative-AI systems. One takeaway: whether prompt injection violates the CFAA depends on many factors. Sometimes they do, but there are uncertainties. Another takeaway: we need more clarity from courts and from scholars and researchers. Thus, we need a safe harbor for security researchers.

Fantastic Copyrighted Beasts and How (Not) to Generate Them (Luxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, ; Wang, Yue; Zettlemoyer, Luke; Zhang, Chiyuan; Chen, Danqi; Henderson, Peter): We have all likely seen models that generate copyrighted characters—and models that refuse to generate them. It turns out that using generic keywords like “Italian plumber” suffices. There was a recent Chinese case holding a service provider liable for generations of Ultraman. Our work introduces a copyrighted-characters reproduction benchmark. We also develop an evaluation suite that has consistency with user intent but avoids copyrighted characters. We applied this suite to various models, and propose methods to avoid copyrighted characters. We find that prompt rewriting is not fully effective on its own. But we find that using copyrighted character names as negative prompts increases effectiveness from about 50% to about 85%.

Matthew Jagielski and Katja Filippova

Matthew Jagielski and Katja Filippova, Machine Unlearning: [JG: I missed this due to a livestream hiccup, but will go back and fill it in.]

Kimberly Mai

Kimberly Mai, Data Protection in the Era of Generative AI

Under the GDPR, personal data is “any information relating to an identified or identifiable” person. That includes hash numbers of people in an experimental study, or license plate numbers. It depends on how easy it is to identify someone. The UK AI framework has principles that already map to data protection law.

Our view is that data protection law applies at every stage of the AI lifecycle. This makes the UK ICO a key regulator in the AI space. AI is a key area of focus for us. Generative AI raises some significant issues, and the ICO has launched a consultation.

What does “accuracy” mean in a generative-AI context? This isn’t a statistical notion; instead, data must be correct, not misleading, and where necessary up-to-date. In a creative context, that might not require factual accuracy. At the output level, a hallucinating model that produces incorrect outputs about a person might be inaccurate. We think this might require labeling, attribution, etc., but I am eager to hear your thoughts.

Now, for individual rights. We believe that rights to be informed and to access are crucial here. On the remaining four, it’s a more difficult picture. It’s very hard to unlearn, which makes the right to erasure quite difficult to apply. We want to hear from you how machine learning applies to data protection concepts. We will be releasing something on controllership shortly, and please share your thoughts with us. We can also provide advice on deploying systems. (We also welcome non-U.K. input.)

Herbie Bradley

Herbie Bradley, Technical AI Governance

Technical AI governance is technical analysis and tools for supporting effective AI governance. There are problems around data, compute, models, and user interaction. For example, are hardware-enabled compute governance feasible? Or, how should we think about how often to evaluate fine-tuned models for safety? What are best practices for language model benchmarking? And, looking to the future, how likely is it that certain research directions will pan out? (Examples include unlearning, watermarking, differential privacy, etc.)

Here is another example: risk thresholds. Can we translate benchmark results into assessments that are useful to policymakers. The problems are that this is dependent on a benchmark, it has to have a qualitative element, and knowledge and best practices shift rapidly. Any implementation will likely be iterative and involve conversations with policy experts and technical researchers.

It is useful to have technical capacity within governments. First, to carry out the actual technical work for implementing a policy or carry out safety testing. Second, you need it to have advisory capacity, and this is often much more useful.

Takeaways. First, if you’re a researcher, consider joining government or a think tank that supports government. Second, if you’re a policy maker, consider uncertainties that could be answered by technical capacity.

Panel: Privacy and Data Policy

Sabrina Ross, Herbie Bradley, Niloofar Mireshgallah, Matthew Jagielski, Paul Ohm (moderator), Katherine Lee (moderator)

Paul: We have this struggle in policy to come up with rules and standards that can be measured. What do we think about Herbie’s call for metrics?

Sabrina: We are at the beginning; the conversation is being led by discussions around safety. How do you measure data minimization, for example: comparing utility loss to data reduction. I’m excited by the trend.

Niloofar: There are multiple ways. Differential privacy (DP) was a theory concept, used for the census, and now is treated as a good tool. But with LLMs, it becomes ambiguous again. Tools can work in one place but not in another. Events like this help technical people understand what’s missing. I learned that most NLP people think of copyright as verbatim copying, but that’s not the only form of copying.

Paul: I worry that if we learn too hard into evaluation, we’ll lose values. What are we missing here?

Matthew: In the DP community, we have our clear epsilon values, and then we have our vibes, which aren’t measured but are built into the algorithm. The data minimization paper has a lot of intuitive value.

Herbie: Industry, academia, and government have different incentives and needs. Academia may like evaluations that are easily measurable and cheap. Industry may like it for marketing, or reducing liability risk. Government may want it to be robust or widely used, or relatively cheap.

Niloofar: It depends on what’s considered valuable. It used to be that data quality wasn’t valued. A few years ago, at ICML you’d only see theory papers, now there is more applied work.

Paul: You used this word “publish”: I thought you just uploaded things to ArXiv and moved on.

Katherine: Let’s talk about unlearning. Can we talk about evaluations that might be useful, and how it might fit into content moderation.

Matthew: To evaluate unlearning, you need to say something about a counterfactual world. State of the art techniques include things like “train your model a thousand times,” which is impractical for big models. There are also provable techniques; evaluation there looks much different. For content moderation, it’s unclear that this is an intervention on data and not alignment. If you have a specific goal, that you can measure directly.

Herbie: With these techniques, it’s very easy to target adjacent knowledge, which isn’t relevant and isn’t what you want to target. Often, various pieces of PII are available on the Internet, and the system could locate them even if information on them has been removed from the model itself.

Paul: Could we map the right to be forgotten onto unlearning?

Sophia: There are lots of considerations here (e.g. public figures versus private ones), so I don’t see a universal application.

Paul: Maybe what we want is a good output filter.

Niloofar: Even if you’re able to verify deletion, you may still be leaking information. There are difficult questions about prospective vs. retrospective activity. It’s a hot potato situation: people put out papers then other people show they don’t work. We could use more systematic frameworks.

Sophia: I prefer to connect the available techniques to the goals we’re trying to achieve.

Katherine: This is a fun time to bring up the copyright/privacy parallel. People talk about the DMCA takedown process, which isn’t quite applicable to generative AI but people do sometimes wonder about it.

Niloofar: I see that NLP people have a memorization idea, so they write a paper, and they need an application, so they look to privacy or copyright. They appeal to these two and put them together. The underlying latent is the same, but in copyright you can license it. I feel like privacy is more flexible, and you have complex inferences. In copyright, you have idea, expression, and those have different meanings.

Matthew: It’s interesting to see what changes in versions of a model. You are training the pain of a passive adversary versus one who is really going to try. For computer scientists, this idea of a weak vs. strong adversary is radioactive.

Paul: My Myth of the Superuser paper was about how laws are written to deal with powerful hackers but then used against ordinary users. Licensing is something you can do for copyright risk; in privacy, we talk about consent. Strategically, are they the same?

Sophia: For a long time, consent was seen as a gold standard. More recently, we’ve started to consider consent fatigue. For some uses it’s helpful, for others it’s not.

Paul: The TDM exception is interesting. The conventional wisdom in privacy was that those dumb American rules were opt-out. In copyright, the tables have turned.

Matthew: Licensing and consent change your distribution. Some people are more likely to opt in or opt out.

Herbie: People don’t have a good sense of how the qualities of licenseable differ from what is available on the Internet.

Niloofar: There is a dataset of people chatting with ChatGPT who affirmatively consented. But people share a lot of their private data through this, and become oblivious to what they have put in the model. You’re often sharing information about other people too. A journalist put their conversation with a private source into the chat!

Paul: Especially for junior grad students, the fact that every jurisdiction is doing this alone might be confusing. Why is that?

Herbie: I.e., why is there no international treaty?

Paul: Or even talk more and harmonize?

Herbie: We do. The Biden executive order influenced the E.U.’s thinking. But a lot of it comes down to cultural values and how different communities think.

Paul: Can you compare the U.K. to the E.U.?

Herbie: We’re watching the AI Act closely. I quite like what we’re doing.

Sophia: We have to consider the incentives that regulators are balancing. But in some ways, I think there is a ton of similarity. Singapore and the E.U. both have data minimization.

Herbie: There are significant differences between the thinking of different government systems in terms of how up-to-date they are.

Paul: This is where I explain to my horrified friends that the FTC has 45 employees working on this. There is a real resource imbalance.

Matthew: The point about shared values is why junior grad students shouldn’t be disheartened. The data minimization paper pulled out things that can be technicalized.

Niloofar: I can speak from the side of when I was a young grad student. When I came here, I was surprised by copyright. It’s always easier to build on legacy than to create something new.

Paul: None of you signed onto the cynical “It’s all trade war all the way down.” On our side of the pond, one story was that the rise of Mistral changed the politics considerably. If true, Mistral is the best thing ever to happen to Silicon Valley, because it tamps down protectionism. Or maybe this is the American who has no idea what he’s talking about.

Katherine: We’ve talked copyright, privacy, and safety. What else should we think about as we go off into the world?

Sophia: The problem is the organizing structure of the work to be done. Is fairness a safety problem, a privacy problem, or an inclusion problem? We’ve seen how some conceptions of data protection can impede fairness conversations.

Paul: I am genuinely curious. Are things hardening so much that you’ll find yourself in a group that people say, “We do copyright here; toxicity is down the hall?” (I think this would be bad.)

Herbie: Right now, academics are incentivized to talk about the general interface.

Paul: Has anyone said “antitrust” today? Right now, there is a quiet struggle between the antitrust Lina Khan/Tim Wu camp and all the other information harms. There are some natural monopoly arguments when it comes to large models.

Niloofar: At least on the academic side, people who work in theory do both privacy and fairness. When people who work in NLP started to care more, then there started to be more division. So toxicity/ethics people are little separate. When you say “safety,” it’s mostly about jailbreaking.

Paul: Maybe these are different techniques for different problem? Let me give you a thought about the First Amendment. Justice Kagan gets five justices to agree that social media is core protected speech. Lots of American scholars think this will also apply to large language models. This Supreme Court is putting First Amendment on the rise.

Matthew: I think alignment is the big technique overlap I’m seeing right now. But when I interact with the privacy community, people who do that are privacy people.

Katherine: That’s partly because those are the tools that we have.

Question: If we had unlearning, would that be okay with GDPR?

Question: If we go forward 2-3 years and there are some problems and clear beliefs about how they should be regulated, then how will this be enforced, and what skills do these people have?

Niloofar: On consent, I don’t know what we do about children.

Paul: In the U.S., we don’t consider children to be people.

Niloofar: I don’t know what this solution would look like.

Kimberly: In the U.K., if you’re over 13 you can consent. GDPR has protections for children. You have to consider risks and harms to children when you are designing under data protection by design.

Herbie: If you have highly adversarial users, unlearning might not be sufficient.

Sabrina: We’re already computer scientists working with economists. The more we can bring to bear, the more successful we’ll be.

Paul: I’ve spent my career watching agencies bring in technologists. Some success, some fail. Europe has had success with investing a lot. But the state of Oregon will hire half a technologist and pay them 30% what they would make. Europe understands that you have to write a big check, create a team, and plan for managing them.

Matthew: As an Oregonian, I’m glad Oregon was mentioned. I wanted to mention that people want unlearning to do some things that are more suitable for unlearning, and there are some goals that really are about data management. (Unless we start calling unlearning techniques “alignment.”)


And that’s it!

law  ✳ conferences