Beyond Red Teaming: Facilitating User-based Data Donation to Study Generative AI

Zeve Sanderson, Joshua Tucker / Nov 1, 2023

Zeve Sanderson is the Executive Director and Joshua A. Tucker is one of the co-founders and co-directors of NYU’s Center for Social Media and Politics (CSMaP).

Academics, technologists, policymakers, and journalists alike are in the midst of trying to ascertain the benefits and harms of generative AI tools. One of the primary and most discussed approaches for identifying potential harms is red teaming, which was included in President Biden’s executive order this week. In this approach, someone attempts to get an AI system to do something that it shouldn’t — with the goal of then fixing the issue. As a team from Data & Society details, red teaming “often remains a highly technical exercise” that works in some cases, while failing in others.

One of the limitations of red teaming is that it doesn’t give us information about how people actually use a generative AI system. Instead, red teaming focuses on learning information about how a system could be used by bad actors. While vital, this approach leaves us flying blind when understanding how these tools are being used in the wild — used by whom, for what purposes, and to what effects.

This limitation is especially stark in the context of consumer-facing generative AI tools. While there are well-established cases of citizen red-teamers figuring out how to get chatbots to break their own rules (and then posting the results on social media), we know little about how most people are using these services. And with ChatGPT reaching an estimated over 100 million users, information about how people actual use this and other similar tools is necessary to understand how the ever-evolving digital information environment is impacting society, especially as generative AI becomes integrated into other widely used services (e.g., online search).

Our work at the NYU Center for Social Media & Politics is animated by the belief that public policy and discourse should be founded on empirically rigorous research. A key challenge to studying the public’s use of services like ChatGPT is the lack of data, mirroring debates around access to social media platform data. So how could independent researchers gather the requisite data to study the use of generative AI? As one of us has previously proposed with regard to social media, there are essentially three ways to collect digital trace data:

  • Work with the platforms to voluntarily share data through mechanisms such as APIs or sandboxed analysis environments;
  • Work independently of the platforms, which includes data collection strategies like scraping and data donations from consenting study participants;
  • Work through government regulation to mandate platform data sharing with independent researchers, such as through the EU Digital Services Act.

There are trade-offs to each of these approaches. As researchers, we’ve spent a lot of time and energy discussing the need for federal legislation in the US to mandate data access. Although the Platform Accountability and Transparency Act was released in 2021 and introduced in 2022, little progress has been made. The EU, through the Digital Services Act, is leading in this regard, though no generative AI companies currently surpass the threshold to be included under the mandatory data sharing provisions.

The pressure on the government to act should remain, but given the speed with which these technologies are advancing and the potential impact they might have on in the near term, we want to focus on a another approach that would support independent research, introduce few legal concerns, and align with company incentives: consumer-focused AI companies, such as OpenAI, should voluntarily facilitate the ability for users to donate their data to research.

Why this will help research

Data donation refers to the process by which users consent to share their data with researchers for agreed upon purposes. As is the case with all data collection procedures, there are trade-offs with data donations. The main downside is that there can be a peculiar sample of people who are willing to participate (i.e. donate data), and that the peculiarity of the sample can’t be directly measured and thus controlled for statistically. The upside is that we can learn something about the people who consent to be part of our research by asking them questions about their demographic characteristics, attitudes, beliefs, and behaviors. In this way, we’re able to design research to understand the complex relationship between the online and offline.

Why can’t we scrape data from chatbots, using tools like custom web browsers that study participants install? The short answer is that we can, but doing so is technically challenging, potentially harms the sample, and introduces legal liability for the researcher.

Technically challenging: In order to scrape data, we would need to build a tool, such as a Chrome browser plugin, that would capture the HTML of the page. This approach is used in a number of studies that utilize digital media data donations. However, this introduces two challenges. First, custom tools require significant time and resources to build, limiting this option to the most technically sophisticated and well-resourced researchers. Second, it’s near impossible to scrape data from phone usage (especially iOS), meaning we have a restricted view into people’s online behavior. This limitation is especially important if we think that people might use generative AI services differently on desktop versus mobile.

Sample issues: As mentioned, we want to get as representative of a sample of AI users as possible. One of the things we’ve learned from collecting social media data for over a decade is that lowering the barrier for donating data makes the sample better. Needing to follow multiple instructions to install a custom plug-in or download your YouTube watch history has lower uptake than providing a Twitter handle. Companies streamlining data donation will improve the sample quality and make the science better.

Legal issues: Scraping data falls in a legal gray zone that introduces potential risks for researchers. The case law is complex, and it’s unclear under what circumstances it’s allowable and not, introducing chilling effects on research. As seen in the recent case of NYU researchers collecting data on Facebook, companies can threaten legal action for building scraping tools that consenting participants download. Companies coming to the table and facilitating data donations will prevent the legal issues that researchers could face from ever arising in the first place.

Why this doesn’t introduce privacy concerns

In feed-based social media, there was a challenging privacy consideration around donating data that someone was exposed to but didn’t create. For example, if Jane participates in a research study and wants to donate her social media data, the most useful version of these data would include the posts on her feed. But the people on her feed didn’t consent to being part of a study. Should she be allowed to share these posts with researchers? There’s disagreement around this question; for the purposes of this article, the point is not to adjudicate the debate, but rather to acknowledge that it’s complicated.

Generative AI is different. There isn’t a social component that introduces complex privacy considerations. As made clear in OpenAI’s terms of service (Section 3.1), people own the output from their inputs. In other words, a user would be donating data that, at least in the case of OpenAI, is hers.

3.1 Customer Content. You and End Users may provide input to the Services (“Input”), and receive output from the Services based on the Input (“Output”). We call Input and Output together “Customer Content.” As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain all ownership rights in Input and (b) own all Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

Why this will help companies

In lieu of government mandates, companies will need to voluntarily engage in facilitating processes by which users can easily donate data to research. This will take technical and human resources, as well as lead to more research on the platform. Why should they do it?

In short, in absence of the research that data donations would facilitate, the public discourse will focus on red teaming. Again, this is an important process that will make these systems more robust, safe, and secure. However, reliance on red teaming risks setting up a dynamic where much of the information available about the performance of a system is generated in a context in which it is found to fail.

This will likely mirror the news cycle around social media. Social media data are large, more available than other types of media data, and easily searchable. This created a dynamic where examples of normatively bad behavior online could be found quite readily, leading to a consistent stream of negative coverage. This coverage rarely contextualized the behavior within the context of the overall ecosystem to measure quantities of interest, such as prevalence or changes over time — in part because of the difficulty of getting these more comprehensive data.

Generative AI data is, at least conceptually, even larger and more prone to this type of coverage — whereas social media is limited to the size and output of the user base, generative AI is essentially infinite if thought of as every possible input-output pair. And it’s also easier to access, as anybody can red team a system from their own computer.

Rigorous research that is able to measure how generative AI is being used by people “in the wild” will likely show a complex picture of who is using these services and to what ends. Like the recent papers from the US 2020 Facebook & Instagram Election Study (to which members of our lab contributed), some insights will likely paint the companies in a good light and others in a bad light compared to the received wisdom. But in terms of getting internal buy-in within companies, it’s important that independent research infuses the public discourse with nuanced and rigorous analysis, avoiding narratives founded on conventional wisdom, cherry picked data, and anecdotes.

Moving forward

The goal of this article is straightforward: to convince companies building consumer-facing generative AI to facilitate user-based data donation for research. The technical solutions to this could be manyfold, ranging from easy-to-access download of usage history to APIs. But first, there needs to be buy-in internally. As outlined here, we believe such a commitment would be good for independent research, good for companies themselves, and ultimately good for an all-of-society approach to supporting innovation while mitigating harms.


Zeve Sanderson
Zeve Sanderson is the Executive Director of NYU’s Center for Social Media and Politics (CSMaP). His research interests include misinformation, radicalization, and fact-checking. His research and writing have appeared in numerous academic journals and popular publications.
Joshua Tucker
Joshua A. Tucker is one of the co-founders and co-directors of NYU’s Center for Social Media and Politics (CSMaP). He is Professor of Politics, an affiliated Professor of Russian and Slavic Studies, and an affiliated Professor of Data Science at New York University, as well as the Director of NYU’s ...