I’m at the GenLaw workshop on Evaluating Generative AI Systems: the Good, the Bad, and the Hype today, and I will be liveblogging the presentations.
Hoda Heidari: Welcome! Today’s event is sponsored by the K&L Gates Endowment at CMU, and presented by a team from GenLaw, CDT, and the Georgetown ITLP.
Katherine Lee: It feels like we’re at an inflection point. There are lots of models, and they’re being evaluated against each other. There’s also a major policy push. There’s the Biden executive order, privacy legislation, the generative-AI disclosure bill, etc.
All of these require the ability to balance capabilities and risks. The buzzword today is evaluations. Today’s event is about what evaluations are: ways of measuring generative-AI systems. Evaluations are proxies for things we care about, like bias and fairness. These proxies are limited, and we need many of them. And there are things like justice that we can’t even hope to measure. Today’s four specific topics will explore the tools we have and their limits.
A. Feder Cooper: Here is a concrete example of the challenges. One popular benchmark is MMLU. It’s advertised as testing whether models “possess extensive world knowledge and problem solving ability.” It includes multiple-choice questions from tests found online, standardized tests of mathematics, history, computer science, and more.
But evaluations are surprisingly brittle; CS programs don’t always rely on the GRE. In addition, it’s not clear what the benchmark measures. In the last week, MMLU has come under scrutiny. It turns out that if you reorder the questions as you give them to a language model, you get wide variations in overall scores.
This gets at another set of questions about generative-AI systems. MMLU benchmarks a model, but systems are much more than just models. Most people interact with a deployed system that wraps the model in a program with a user interface, filters, etc. There are numerous levels of indirection between the user’s engagement and the model itself.
And even the system itself is embedded in a much larger supply chain, from data through training to alignment to generation. There are numerous stages, each of which may be carried out by different actors doing different portions. We have just started to reason about all of these different actors and how they interact with each other. These policy questions are not just about technology, they’re about the actors and interactions.
Alexandra Givens: CDT’s work involves making sure that policy interventions are grounded in a solid understanding of how technologies work. Today’s focus is on how we can evaluate systems, in four concrete areas:
We also have a number of representatives from government giving presentations on their ongoing work.
Our goals for today are on providing insights on cross-disciplinary, cross-community engagement. In addition, we want to pose concrete questions for research and policy, and help people find future collaborators.
Paul Ohm: Some of you may remember the first GenLaw workshop, we want to bring that same energy today. Here at Georgetown Law, we take seriously the idea that we’re down the street from the Capitol and want to be engaged. We have a motto, “Justice is the end, law is but the means.” I encourage you to bring that spirit to today’s workshop. This is about evaluations in service of justice and the other values we care about.
Zack Lipton: “Why Evaluations are Hard”
One goal of evaluations is simply quality: is this system fit for purpose? One question we can ask is, what is different about evaluations in the generative-AI era? And an important distinction is whether a system does everything for everyone or it has a more pinned-down use case with a more laser-targeted notion of quality.
Classic discriminative learning involves a prediction or recognition problem (or a problem that you can twist into one). For example, I want to give doctors guidance on whether to discharge a patient, so I predict mortality.
Generically, I have some input and I want to classify it. I collect a large dataset of input-output pairs and generates a model–the learned pattern. Then I can test how well the model works by testing on some data we didn’t train on. The model of machine learning that came to dominate is that I have a test set, and I measure how well the model works on the test set.
So when we evaluate a discriminative model, there are only a few kinds of errors. For a yes-no classifier, those are false positives and false negatives. For a regression problem, that means over- and under-estimates. We might look into how well the model performs on different strata, either to explore how it works, or to check for disparity on salient demographic groups in the population. And then we are concerned whether the model is valid at all out of the distribution it was trained on–e.g., at a different hospital, or in the wild.
[editor crash, lost some text]
Now we have general-purpose systems like ChatGPT which are provided without a specific task. They’re also being provided to others as a platform for building their own tools. Now language models are not just language models. Their job is not just to accurately predict the next word but to perform other tasks.
We have some ways of assessing quality, but there is no ground truth we can use to evaluate against. There is no single dataset that represents the domain we care about. Evaluation starts going to sweeping benchmarks; the function of what we did in NLP before is to supply a giant battery of tests we can administer to test “general” capabilities of ML models. And the discourse shifts towards trying to predict catastrophic outcomes.
At the same time, these generic capabilities provide a toolkit for building stronger domain-specific technologies. Now people in the marketplace are shipping products without any data. There is a floor level of performance they have with no data at all. Generative AI has opened up new domains, but with huge evaluation challenges.
Right now, for example, health-care systems are underwater. But the clerical burden of documenting all of these actions is immense: two hours of form-filling for every one hour of patient care. At Abridge, we’re building a generative-AI system for physicians to document clinical notes. So how do we evaluate it? There’s no gold standard, we can’t use simple tricks. The problem isn’t red-teaming, it’s more about consistently high-quality statistical documentation. The possible errors are completely open-ended, and we don’t have a complete account of our goals.
Finally, evaluation takes place at numerous times. Before deployment, we can look at automated metrics–but at the end of the day, no evaluation will capture everything we care about. A lot of the innovation happens when we have humans in the loop to give feedback on notes. We use human spot checks, we have relevant experts judging notes, across specialties and populations, and also tagging errors with particular categories. We do testing during rollout, using staged releases and A/B tests. There are also dynamic feedback channels from clinician feedback (star ratings, free-form text, edits to notes, etc.). There are lots of new challenges–the domain doesn’t stand still either. And finally, there are regulatory challenges.
Emily Lanza: “Update from the Copyright Office”
The Copyright Office is part of the legislative branch, providing advice to Congress and the courts. It also administers the copyright registration system.
As far back as 1965, the Copyright Office has weighed in on the copyrightability of computer-generated works. Recently, these issues have become far less theoretical. We have asked applicants to disclaim copyright in more-than-de-minimis AI-generated portions of their works. In August, we published a notice of inquiry and received more than 10,000 comments. And a human has read every single one of those comments.
Three main topics:
First, AI outputs that imitate human artists. These are issues like the Drake/Weeknd deepfake. Copyright law doesn’t cover these, but some state rights do. We have asked whether there should be a federal AI law.
Second, copyrightability of outputs. We have developed standards for examination. The generative-AI application was five years ago, entirely as a test case. We refused registration on the ground that human authorship is required; the D.C. District Court agreed and the case is on appeal. Other cases present less clear-cut facts. Our examiners have issued over 200 registrations with appropriate disclaimers, but we have also refused registration in three high-profile cases.
The central question in these more complex scenarios is when and how a human can exert control over the creativity developed by the AI system. We continue to draw these lines on a case-by-case basis, and at some point the courts will weigh in as well.
Third, the use of human works to train AIs. There are 20 lawsuits in U.S. courts. The fair use analysis is complex, including precedents such as Google Books and Warhol v. Goldsmith. We have asked follow-up questions about consent and compensation. Can it be done through licensing, or through collective licensing, or would a new form of compulsory licensing be desirable? Can copyright owners opt in or out? How would it work?
Relatedly, the study will consider how to allocate liability between developers, operators, and users. Our goal is balance. We want to promote the development of this exciting technology, while continuing to allow human creators to thrive.
We also need to be aware of developments elsewhere. Our study asks whether approaches in any other countries should be adopted or avoided in the United States.
We are not the only ones evaluating this. Congress has been busy, too, holding hearings as recently as last week. The Biden Administration issued an executive order in October. Other agencies are involved, including the FTC (prohibition on impersonation through AI-enabled deepfakes), and FEC (AI in political ads).
We plan to issue a report. The first section will focus on digital replicas and will be published this spring. The second section will be published this summer and will deal with the copyrightability of outputs. Later sections will deal with training and more. We aim to publish all of it by the end of the fiscal year, September 30. We will revise the Compendium, and also a study by economists about copyright and generative AI.
Sejal Amin (CTO at Shutterstock): “Ensuring TRUST; Programs for Royalties in the Age of AI”
Shutterstock was founded in 2003 and has since become an immense marketplace for images, video, music, 3D, design tools, etc. It has been investing in AI capabilities as well. Showing images generated by Shutterstock’s AI tools. Not confined to any genre or style. Shutterstock’s framework is TRUST. I’m going to focus today on the R, Royalties.
Today’s AI economy is not really contributing to the creators who enable it. Unregulated crawling helps a single beneficiary. In 2023, Shutterstock launched a contributor fund that provides ongoing royalties tied to licensing for newly generated assets.
The current model provides an equal share per image based on their contributions, which are then used in training Shutterstock’s models. There is also compensation by similarity, or by popularity. These models have problems. Popularity is not a proxy for quality; it leads to a rich-get-richer phenomenon. And similarity is also flawed without a comprehensive understanding the world.
For us, quality is a high priority. High-quality content is an essential input into the training process. How could we measure that? Of course, the word quality is nebulous. I’m going to focus on:
A shared success model will need to understand customer demand.
Aesthetic excellence depends on technical proficiency (lighting, color balance) and visual appeal. Shutterstock screens materials for safety both manually and through human review. We have techniques to prevent generation of unsafe concepts. Diversity is important to all of us. We balance and improve representations of different genders, ethnicities, and orientation. Our fund attempts to support historically excluded creator groups. Our goal is shared success.
David Bau: “Unlearning from Generative AI Models”
Unlearning asks: “Can I make my neural network forget something it learned?”
In training, a dataset with billions of inputs is run through training, and then can generate potentially infinite outputs. The network’s job is to generalize, not memorize. If you prompt Stable Diffusion for “astronaut riding a horse on the moon” there is no such image in the training set, it will generalize to create one.
SD is trained on about 100 TB of data, but the SD model is only about 4GB of network weights. We intentionally make these nets too small to memorize everything. That’s why they must generalize.
But still, sometimes a network does memorize. Carlini et al. showed that there is substantial memorization in some LLMs, and the New York Times found out that there is memorization in ChatGPT.
In a search engine, takedowns are easy to implement because you know “where” the information is. In a neural network, however, it’s very hard to localize where the information is.
There are two kinds of things you might want to unlearn. First, verbatim regurgitation, second, unwanted generalized knowledge (artist’s style, undesired concepts like nudity or hate, or dangerous knowledge like hacking techniques).
Three approaches to unlearning:
Fundamentally, unlearning is tricky and will require combining approaches. The big challenge is how to improve the transparency of a system not directly designed by people.
Alicia Solow-Niederman: “Privacy, Transformed? Lessons from GenAI”
GenAI exposes underlying weak spots. One kind of weak spot is weaknesses in a discipline’s understanding (e.g., U.S. privacy law’s individualistic focus). Another is weaknesses in cross-disciplinary conversations (technologists and lawyers talking about privacy).
Privacy in GenAI: cases when private data is used in the. If I prompt a system with my medical data, or a law-firm associate uses a chatbot to generate a contract with confidential client data. It can arise indirectly when a non-AI company licenses sensitive data for training. For example, 404 reported that Automattic was negotiating to license Tumblr data. Automattic offered an opt-out, a solution that embraces the individual-control model. This is truly a privacy question, not a copyright one. And we can’t think about it without thinking what privacy should be as a social value.
Privacy out of GenAI: When does private data leak out of a GenAI system? We’ve already talked about memorization followed by a prompt that exposes it. (E.g., the poem poem poem attack.) Another problem is out-of-context disclosures. E.g., ChatGPT 3.5 “leaked a random dude’s photo”–a working theory is that this photo was uploaded in 2016 and ChatGPT created a random URL as part of its response. Policy question: how much can technical intervention mitigate this kind of risk?
Privacy through GenAI: ways where the use of the technology itself violates privacy. E.g., GenAI tools used to infer personal information: chatbots can discern age and geography from datasets like Reddit. The very use of a GenAI tool might lead to violations of existing protections. The example of GenAI for a health-care system is a good example of this kind of setting.
Technical patches risk distracting us from more important policy questions.
Niloofar Mireshghallah: “What is differential privacy? And what is it not?”
A big part of the success of generative AI is the role of training data. Most of the data is web-scraped, but this might not have been intended to be public.
But the privacy issues are not entirely new. The census collects data on name, age, sex, race, etc. This is used for purposes like redistricting. But this data could also be used to make inferences, e.g., where are there mixed-race couples? Obvious approach is to withhold some fields, such as name, but often the data can be reconstructed.
Differential privacy is a way of formalizing the idea that nothing can be learned about a participant in a database–is the database with the record distinguishable from the database without it? The key concept here is a privacy budget, which quantifies how much privacy can be lost through queries of a (partially obscured) database Common patterns are still visible, but uncommon patterns are not.
But privacy under DP comes at the cost of data utility. The more privacy you want, the more noise you need to add, and hence the less useful the data. And it has a disproprotionate imapct on the tails of the distribution, e.g., more inaccuracy in the census measurements of the Hispanic population.
Back to GenAI. Suppose I want to release a medical dataset with textual data about patients. Three patients have covid and a cough, one patient has a lumbar puncture. It’s very hard to apply differential privacy to text rather than tabular data. It’s not easy to apply clear boundaries between records to text. There are also ownership issues, e.g., “Bob, did you hear about Alice’s divorce?” applies to both Bob and Alice.
If we try applying DP with each patient’s data as a record, we get a many-to-many version. The three covid patients get converted into similar covid patients; we can still see the covid/cough relationship. But it does not detect and obfuscate “sensitive” information while keeping “necessary” information intact. We’ll still see “the CT machine at the hospital is broken.” This is repeated, but in context it could be identifying and shouldn’t be revealed. That is, repeated information might be sensitive! A fact that a lumbar puncture requires local anasthesia might appear only once, but that’s still a fact that ought to be learned, it’s not sensitive. DP is not good at capturing these nuances or these needle-in-haystack situations. There are similarly messy issues with images. Do we even care about celebrity photos? There are lots of contexual nuances.
[Panel omitted because I’m on it]
Andreas Terzis: “Privacy Review”
Language models learn from their training data a probability distribution of a sequence given the previous tokens. Can their memorize rare or unique training-data sequences? Yes, yes yes. So thus we ask, do actual LLMs memorize their training data?
Approach: use the LLM to generate a lot of data, and they predict membership of an example in the training data. If it has a high likelihood of being generated, then it’s memorized, if not, then no. In 2021, they showed that memorization happens in actual models, and since then, scale exacerbates the issue. Larger models memorize more.
Alignment seems to hide memorization, but not to prevent it. An aligned model might not return training data, but it can be prompted (e.g., “poem poem poem”) in ways that elicit it. And memorization happens with multimodal models too.
Privacy testing approaches: first, “secret sharer” invovles controlling canaries inserted into the training data. This requires access to the model and can also pollute it. “Data extraction” only requires access to the interface but may underestimate the actual amount of memorization.
There are tools to remove what might be sensitive data from training datasets. But they may not find all sensitive data (“andreas at google dot com”), and on the flipside, LLMs might benefit from knowing what sensitive data looks like.
There are also safety-filter tools. They stop LLMs from generating outputs that violate its policies. This is helpful in preventing verbatim memorization but can potentially be circumvented.
Differential privacy: use training-time noise to provide reduced sensitivity to specific rarer examples. This introduces a privacy-utility tradeoff. (And as we saw in the morning, it can be hard to adopt DP to some types of data and models.)
Deduplication can reduce memorization, because the more often an example is trained on, the more likely it is to be memorized. The model itself is likely to be better (faster to train on and less resources on memorizing duplicates.)
Privacy-preserving LLMs train on data intended to be public, and then finetine locally on user-contributed data. This and techniques in the previous slides can be combined to provide layered defense.
Dave Willner: “How to Build Trust & Safety for and with AI”
Top-down take from a risk management perspective. We are in a world where a closed system is a very different thing to build trust and safety for than an open system, and I will address them both.
Dealing with AI isn’t a new problem. Generative AI is a new way of producing content. But we have 15-20 years of experience in moderating content. There is good reason to think that generative-AI systems will make us better at moderating content; they may be able to substitute for human moderators. And the models offer us new sites of intervention, in the models themselves.
First, do product-specific risk assessment. (Standard T&S approach: actors, behaviors, and content.) Think about genre (text, image, multimodal, etc.) Ask how frequent some of this content is. And how is this system specifically useful to people who want to generate content you don’t want them to?
Next, it’s a defense in depth approach. You have your central model and a series of layers around it. So the only viable approach is to stack as many layers of mitigations as possible.
In addition, invest in learning to use AI to augment all of the things I just talked about. All of these techniques rely on human classification. This is error-prone work that humans are not good at and that takes a big toll on them. We should expect generative-AI systems to play a significant role here; early experiments are promising.
In an open-souce world, that removes centralized gatekeepers … which means removing centralized gatekeepers. I do worry we’re facing a tragedy of the commons. Pollution from externalities from models is a thing to keep in mind, especially with the more severe risks. We are already seeing significant CSAM.
There may not be good solutions here with no downsides. Openness versus safety may involve hard tradeoffs.
Nicholas Carlini: “What watermarking can and can not do”
A watermark is a mark placed on top of a piece of media to identify it as machine generated.
For example: an image with a bunch of text put on top of it, or a disclaimer at the start of a text passage. Yes, we can watermark, but these are not good watermarks; they obscure the content.
Better question: can we usefully watermark? The image has a subtle watermark that are present in the pixels. And the text model was watermarked, too, based on some of the bigram probabilities.
But even this isn’t enough. The question is what are your requirements? We want watermarks to be useful for some end task. For example, people want undetectable watermarks. But most undetectable watermarks are easy to remove–e.g., flip it left-to-right, or JPEG compress it. Other people want unremovable watermarks. By whom? An 8-year-old or a CS professional? Unremovable watermarks are also often detectable. Some people want unforgeable watermarks, so they can verify the authenticity of photos.
Some examples of watermarks designed to be unremovable.
Here’s a watermarked image of a tabby cat. An ML image-recognition model recognizes it as a tabby cat with 88% confidence. Adversarial perturbation can make the image look indistinguishable to us humans, but it is classified as “guacamole” with 99% confidence. Almost all ML classifiers are vulnerable to this.Synthetic fake images can be tweaked to look like real ones with trivial variations, such as texture in the pattern of hair.
Should we watermark? It comes down to whether we’re watermarking in a setting where we can achieve our goals. What are you using it for? How robustly? Who is the adversary? Is there even an adversary?
Here are five goals of watermarking:
Raquel Vazquez Llorente: “Provenance, Authenticity and Transparency in Synthetic Media”
Talking about indirect disclosure mechanisms, but I consider detection to be a close cousin. We just tested detection tools and broke them all.
Witness helps people use media and tech to protect their rights. We are moving fast to a world where human and AI don’t just coexist but intermingle. Think of inpainting and outpainting. Think of how phones include options to enhance image quality or allow in-camera editing.
It’s hard to address AI content in isolation from this. We’ve also seen that deception is as much about context as it is about content. Watermarks, fingerprints, and metadata all provide important information, but don’t provide the truth of data.
Finally, legal authentication. There is a lot of work in open-source investigations. The justice system plays an essential role in protecting rights and democracy. People in power have dismissed content as “fake” or “manipulated” when they want to avoid its impact.
Three challenges:
Erie Meyer: “Algorithmic Disgorgement in the Age of Generative AI”
CFPB sues companies to protect them from unfair, deceptive, or abusive practices: medical debt, credit reports, repeat-offender firms, etc. We investigate, litigate, and supervise.
Every one of the top-ten commercial banks uses chatbots. CFPB found that people were being harmed by poorly deployed chatbots that sent users into doom loops. You get stuck with a robot that doesn’t make sense.
Existing federal financial laws say that impeding customers from solving problems can be a violation of law. If the technology fails to recognize that consumers are invoking their federal rights, or fails to protect their private information. Firms also have an obligation to respond to consumer disputes and competently interact with customers. It’s not radical to say that technology should make things better, not worse. CFPB knew it needed to do this report because it publishers its complaints online. In those complaints, they searched for the word “human” and it pulled up a huge number of complaints.
Last year, CFPB put out a statement that “AI is Not an Execuse for Breaking the Law.” Bright-line rules benefit small companies by giving them clear guidance without needing a giant team to study the law. They also make it clear when a company is compliant or not, and make it clear when an investigation is needed.
An example: black-box credit models. Existing credit laws require firms making credit decisions to tell consumers why they made a decision. FCRA has use limitations, accuracy and explainability requirements, and a private right of action. E.g., targeted advertising is not on the list of allowed uses. CFPB has a forthcoming FCRA rulemaking.
When things go wrong, I think about competition, repeat offenders, and relationships. A firm shouldn’t get an edge over its competitors from using ill-gotten data. Repeat offenders are an indication that enforcement hasn’t shifted the firm’s incentives. Relationships: does someone answer the phone, can you get straight answers, do you know that Erica isn’t a human?
The audiences for our work are individuals, corporations, and the industry as a whole. For people: What does a person do when their data is misused? What makes me whole? How do I get my data out? For corporations, some companies violate federal laws repeatedly. And for the industry, what do others in the industry learn from enforcement actions?
Finally, disgorgement: I’ll cite a case from the FTC in the case against Google. The reason not to let Google settle was that while the “data” was deleted, the data enhancements were used to target others.
What keeps me up at night is that it’s hard to get great legislation on the books.
Elham Tabassi: “Update from US AI Safety Institute”
NIST is a non-regulatory agency under the Department of Commerce. We cultivate trust in technology and promote innovation. We promote measurement science and technologically valid standards. We work through a multi-stakeholder process. We try to identify what are the valuable effective measurement techniques.
Since 2023, we have:
Nitarshan Rajkumar: “Update from UK AI Safety Institute”
Our focus is to equip governments with an empirical understanding of the safety of frontier AI systems. It’s built as a startup within government, with seed funding of £100 million, and extensive talent, partnerships, and access to models and compute.
UK government has tried to mobilize international coordination, starting with an AI safety summit at Bletchley Park. We’re doing consensus-building at a scientific level, trying to do for AI safety what IPCC has done for climate change.
We have four domains of testing work:
We have four approaches to evaluations:
Katherine: What kinds of legislation and rules do we need?
Amba: The lobbying landscape complicates efforts. Industry has pushed for auditing mandates to undercut bright-line rules. E.g., facial-recognition auditing was used to undercut pushes to ban facial-recognition. Maybe we’re not talking enough about incentives.
Raquel: When we’re talking about generative-AI, we’re also talking about the broader information landscape. Content moderation is incredibly thorny. Dave knows so much, but the incentives are so bad. If companies are incentivized by optimizing advertising, data collection, and attention, then content moderation is connected to enforcing a misaligned system. We have a chance to shape these rules right now.
Dave: I think incentive problems affect product design rather than content moderation. The ugly reality of content moderation is that we’re not very good at it. There are huge technique gaps, humans don’t scale.
Katherine: What’s the difference between product design and content moderation?
Dave: ChatGPT is a single-player experience, so some forms of abuse are mechanically impossible. That kind of choice has much more of an impact on abuse as a whole.
Katherine: We’ve talked about standards. What about when standards fail? What are the remedies? Who’s responsible?
Amba: Regulatory proposals and regimes (e.g. DSA) that focus on auditing and evaluation have two weaknesses. First, they’re weakest on consequences: what if harm is discovered? Second, internal auditing is most effective (that’s where the expertise and resources are) but it’s not a substitute for external auditing. (“Companies shouldn’t be grading their own homework.”) Too many companies are on the AI-auditing gravy train, and they haven’t done enough to show that their auditing is at the level of effectiveness it needs to be. Scrutinize the business dynamics.
Nicholas: In computer security, there are two types of audits. Compliance audits check boxes to sell products, and actual audits where someone is telling you what you’re doing wrong. There are two different kinds of companies. I’m worried about the same thing happening here.
Elham: Another exacerbation is that we don’t know how to do this well. From our point of view, we’re trying to untangle these two, and come up with objective methods for passing and failing.
Question: Do folks have any reflection on approaches more aligned with transparency?
Nicholas: Happy to talk when I’m not on the panel.
Raquel: A few years ago, I was working on developing an authentication product. We got a lot of backlash from human-rights community. We hired different sets of penetration testers to audit the technology, and then we’d spend resource on patching. We equate open-source with security, but the amount of times we offered people code–but there’s not a huge amount of technical expertise.
Hoda: Right now, we don’t even have the right incentives to create standards except for companies’ bottom line. How do your agencies try to balance industry expertise with impacted communities?
Elham: Technologies change fast, so expertise is very important. We don’t know enough, and the operative word is “we” and collaboration is important.
Nitarshan: Key word is “iterative.” Do the work, make mistakes, learn from them, improve software, platform, and tooling.
Elham: We talk about policies we can put in place afterwards to check for safety and security. But these should also be part of the discussion of design. We want technologies that make it easy to do the right thing, hard to do the wrong thing, and easy to recover. Think of three-plug power outlets. We are not a standard-development organization; industry can lead standard development. The government’s job is to support these efforts by being third-party neutral objectives.
Question: What are the difference in how various institutions understand AI safety? E.g., protect company versus threats to democracy and human rights?
Nitarshan: People had an incorrect perception that we were focused on existential risk and we prominently platformed societal and other risks. We think of the risks as quite broad.
Katherine: Today, we’ve been zooming in and out. Safety is really interesting because we have tools that are the same for all of these topics–same techniques for privacy and copyright don’t necessarily work. Alignment, filters, etc. are a toolkit that is not necesarily specified. It’s about models that don’t do what we want them to do.
Let’s talk about trust and safety. Some people think there’s a tradeoff between safe and private systems
Dave: That is true especially early on in the development of a technology when we don’t understand it. But maybe not in the long run. For now, observation for learning purposes is important.
Andreas: Why would the system need to know more about individuals to protect them?
Dave: It depends on privacy. If privacy means “personal data” than no, but if privacy means “scrutiny of your usage” then yes.
Katherine: Maybe I’m generating a picture of a Mormon holding a cup of coffee. Depending on what we consider a violation, we’d need to know more about them, or to know what they care about. Know the age and context of a child.
Andreas: People have the control to disclose what they want to be know, that can also be used in responding.
Question: How do you think about whether models are fine to use only with certain controls, or should we avoid models that are brittle?
Dave: I’m very skeptical of brittle controls (terms of service, some refusals). Solving the brittleness of model-level mitigations is an important technical problem if you want to see open-source flourish. The right level to work at is the level you can make stick in the face of someone who is trying to be cooperative. Miscalibration is different than adversarial misuse. Right now, nothing is robust if someone can download the model and run it themselves.
Erie: What advice do you have for federal regulators who want to develop relationships with technical communities? How do you encourage whistleblowers?
Amba: Researchers are still telling us that problems with existing models are still unsolved. There are risks that are still unsolved; the genie is still out of the bottle. We’re not looking out to the horizon. Privacy, security, and bias harms are here right now.
Nicholas: I would be fine raising problems if I noticed them; I say things that get me in trouble in many circumstances. There are cases where it’s not worth getting in trouble–when I don’t have anything technically useful to add to the conversation.
Dave: People who work in these parts of companies are not doing it because they love glory and are feeling relaxed. They’re doing it because they genuinely care. That sentiment is fairly widespread.
Andreas: We’re here and we publish. There is a fairly vibrant community of open-source evaluations. In many ways they’re the most trustable. Maybe it’s starting to happen for security as well.
Katherine: Are proposed requirements for watermarking misguided?
Nicholas: As a technical problem, I want to know whether it works. In adversarial settings, not yet. In non-adversarial settings, it can work fine.
Katherine: People also mention homomorphic encryption–
Nicholas: That has nothing do with watermarking.
Katherine: –blockchain–
Nicholas: That’s dumb.
Raquel: There’s been too much emphasis on watermarking from a regulatory perspective. If we don’t embed media literacy, I’m worried about people looking at a content credential and misunderstanding what it covers.
Question: Is there value in safeguards that are easy to remove but hard to remove by accident?
Dave: It depends on the problem you’re trying to solve.
Nicholas: This the reason why depositions exist.
Raquel: This veers into UX, and the design of the interface the user engages with.
Question: What makes a good scientific underpinning for an evaluation? Compare the standards for cryptographic hashes versus the standards for penetration testing? Is it about math versus process?
Nitarshan: These two aren’t in tension. It’s just that right now ML evaluation is more alchemy than science. We can work on developing better methods.
And that’s it, wrapping up a nearly nine-hour day!