Exposing the Rotten Reality of AI Training Data

Justin Hendrix / Dec 31, 2023

Audio of this conversation is available via your favorite podcast service.

In a report released December 20, 2023, the Stanford Internet Observatory said it had detected more than 1,000 instances of verified child sexual abuse imagery in a significant dataset utilized for training generative AI systems such as Stable Diffusion 1.5.

This troubling discovery builds on prior research into the “dubious curation” of large-scale datasets used to train AI systems, and raises concerns that such content may contributed to the capability of AI image generators in producing realistic counterfeit images of child sexual exploitation, in addition to other harmful and biased material.

On Friday, December 29, I spoke to the report’s author, Stanford Internet Observatory Chief Technologist David Thiel.

What follows is a lightly edited transcript of the discussion.

David Thiel:

I'm David Thiel. I am chief technologist at the Stanford Internet Observatory, which is an institute at Stanford focused on essentially reduction of harm from various kinds of technological mis-implementations.

Justin Hendrix:

And David, can you tell me a little bit about your background?

David Thiel:

Sure. So most of my background is in computer security. I was a consultant and researcher for about 10 years before working at Facebook and Instagram on security across a pretty wide array of technologies, as well as starting to focus more on more human safety issues, so issues of, for example, sextortion, stalking, abuse, those kinds of things. So when I moved to the Stanford Internet Observatory about three and a half years ago or so, I focused mostly on building up technology for analysis of social media platforms, other ways to help research, but also brought along some of that experience with other human safety issues, child safety issues, and started performing research into a number of intersections of child safety and technology.

Justin Hendrix:

I suppose one of those intersections is with this newest generation of generative AI systems and the training sets that have gone into those, and that's what's led to this report. You just published prior to the Christmas holiday, identifying and eliminating CSAM and generative ML training data and models. Seems like this came together relatively quickly. How long was this underway?

David Thiel:

We started work on this in about September, and it was one of those projects which was inherently a bit slow and methodical. Effectively what happened was we had an external party come to us and say that they thought that they had identified some problematic content in the training set and were not able to get any action taken. We had the ability to use an API called PhotoDNA that Microsoft offers, and that allowed us to basically submit URLs to them to be scanned to see if they might be representing known instances of child abuse material.

We didn't actually get any hits from that sample, but it kicked off a process of if we were going to try and identify whether that material were present in the training set, that what would be the best methodology for doing so. Yeah, starting in about September, we started taking samples from that dataset that were potentially the most likely to contain CSAM, either by analyzing keywords, analyzing how they had been classified in the dataset, and found enough samples that were positive hits that we were able to refine that methodology and continue from there. Given that we were working on hundreds of thousands of images and we're working with an API that I think allows five queries a second by default, that started stretching. And so we started trying to find ways to accelerate that process somewhat.

Justin Hendrix:

And this is not just any dataset, a LAION-5B. So this is the second or third iteration of a large scale dataset created by unsupervised crawling of the internet. Can you just explain what this thing is, what's present in it? Are we talking about images, image URLs? How is this thing put together?

David Thiel:

Sure. So effectively the way that the dataset was compiled was they took what is kind of a wide-scale crawl of the internet based on a project called Common Crawl, and they took effectively the image URLs from that crawl of the internet, extracted the alt tag descriptions, so the text that's there to describe what an image is in the absence of being able to view it, and they basically used a little bit of machine learning to look at those alt tags and say does that seem like what this image actually is? So trying to discard things that were just complete mismatches that said it was a fruit when it was ant or something.

The actual dataset itself is effectively just a giant blob of metadata. It's URLs to the image, some basic info about the dimensions, the caption that it had, caption translated to English, and a couple of predictive values of whether it might be not-safe-for-work, whether it might contain a watermark or not, a couple of things like that.

So when somebody gets this dataset that you would effectively download this, still, rather massive bundle of metadata, but then you would, if you were going to train a model up on it, you would go ahead and kick off a process that downloads all of the original images. The images themselves aren't distributed for a number of reasons, copyright, et cetera. But this also has some limitations with how we were trying to use this dataset, which is that these URLs go dead over time. So in our analysis, about 30% of all of the URLs that we checked were already offline. So that raises the possibility since this is used to train machine learning models, that there were potentially a fair number more instances that were actually used to train the model that are no longer accessible on the internet. And so it's gone into this, a bit of a black box.

Justin Hendrix:

And so the headline of course that emerged around the report and lots of news coverage of it was, I suppose, based on one sentence: "we find that having possession of a LAION-5B dataset populated even in late 2023 implies that possession of thousands of illegal images, not including all of the intimate imagery published and gathered non-consensually, the legality of which is more variable by jurisdiction." So a bit of a bombshell in some ways that you've been able to quantify something that, I suppose, prior researchers had suggested was likely the case, that there was an enormous amount of dangerous material including child sexual abuse material present in the dataset.

David Thiel:

Yeah, and it had been something that was known about to some degree within that community, but everything was effectively anecdotal. People had said that they had seen inappropriate things or said that these captions indicate that there's inappropriate material here, but nobody had gone through to actually try and qualify that. So it left it in this ambiguous state where nobody was really saying there was any bad material in there particularly because if somebody had ran across material in there, well, A, they're legally not allowed to go in to search for child abuse material. If they did run across it, they would've been legally obligated to report it. So it's something that it just stayed in limbo for a bit. And in our case, we had programmatic ways to interrogate this that did not involve actually downloading, analyzing, or doing anything with the images other than just fingerprinting. We relied on an external agency that was actually permitted to verify that in cases where we needed to get verification.

Justin Hendrix:

So I want to try to understand exactly how that works just so my listener can maybe get a sense of how you were able to connect with multiple bits of technology and also other organizations that work on this problem. And I suppose it figured out how to address it or quarantine this information in ways that makes it possible to analyze and to assess. So these are groups like NCMEC and C3P. Can you talk a little bit about how you put together the web of entities that was involved in making this work possible?

David Thiel:

So the way that we started with this is that we had prior access to PhotoDNA, the service that identifies known instances, and we use that for every piece of social media that we ingest just in the regular course of our work, whether we're looking at disinformation or conspiracy theories or things like that. All of our media gets analyzed just from a researcher safety and ethics perspective, and we make sure that we don't store any of that material. So we used PhotoDNA as that first layer, taking the stuff that was most likely judged to be most not-safe-for-work by the original classifiers, ran that content through PhotoDNA.

The way that that service works, at least the way that we use it, is we just submit a URL to the API. We say, "Go fetch this, go look at it." We don't have any of our services downloaded or anything like that. Once we would get a hit on that, we would basically record those URLs, and we would pass them to the Canadian Center for Child Protection, the C3P, and they were able to actually look at those and classify them as saying like, "No, this is something else," because it's possible that there are false positives, not likely, but there will be some, or also subcategories of what type of abuse it might entail, or whether it was age ambiguous, visually, they couldn't classify it, that kind of thing.

So the way that they did that classification helped us because that way if we find that there's some random photo of a toaster in there or something, that would mess up our next steps. So we tried to get the most accurate sample of URLs, and then we just effectively leveraged part of what was in the dataset itself, which is basically the mathematical representation of what those images are in the model. And so you can effectively take those and say, "Given this representation, what do you think is conceptually most close to this?" And so it's a little bit like if you went to a Google reverse image search and you said, "Find me other images that are kind of like this," but this is more thematic rather than visually looks just like this kind of thing.

So that gave us hundreds of thousands of new candidates that we could then scan with PhotoDNA. They might not be the ones that rose to the top for our first layer analysis, but we were able to throw those at PhotoDNA to see if we got hits there. We also used a machine learning classifier, Thorn, a child safety organization that we've worked with in the past. For instances that were not flagged as CSAM, we effectively set up a temporary Cloud environment that would run them through that analyzer, and this was something that was made to detect CSAM, that could be new, like an unknown instance, but it would try and identify things as being probability of being CSAM. And so we took the top hits from that classifier and also pass those URLs to C3P.

So it was a multifactorial approach with some verification on the backend that gave us an idea of whether what we were doing was effective and making sense. And then the last layer to that was we did work with the National Center for Missing and Exploited Children, NCMEC, who gave us access to a cryptographic hash database of the instance of known CSAM. And that was leveraged because in the metadata, there are fingerprints of these images, but they're exact fingerprints, like if a single thing of that image is changed, it's totally broken. But there were just at that sheer amount of scale, we're able to find exact fingerprints of a number of known instances. Not the most desirable way to do it, but just leveraging what was already in the dataset. So that's how all of those parties came together.

Justin Hendrix:

You have a sense of how much larger the problem may be than what you were able to observe?

David Thiel:

Well, we can say that it is at least probably 30% bigger, and that is simply because of the amount of content that has come down. And we do expect that the attrition rate of actual CSAM is going to be relatively high. If any content is going to go dead on the internet, that's fairly likely to be it, or at least one would hope. So you could say maybe somewhere 30 to 50% bigger than what we were able to actually analyze. But my gut feeling is it could be anywhere from two to five times of what we found. It is going to vary because there is a lot of ambiguous material out there.

As we mentioned, there's just a lot of NCII, or non-consensual intimate imagery that shows up in that dataset, and people won't necessarily know if that person is of age or not. That, we consider its own problem, because somebody's nude mirror selfie is not something that should be populating this dataset. Majority of the time, those people probably didn't want that image distributed at all in the first place. But yeah, it's not massively bigger than what we were able to detect, but there's definitely a significant percentage more than we were able to analyze.

Justin Hendrix:

I want to get onto some of the implications of this, including for Stable Diffusion and other models that have been trained on this dataset and onto policy implications, as well. But maybe I want to ask you a question about how this training set got into the wild in the first place with this problem in it. The paper that presented the model was, I think, one of the best papers at NeurIPS, one of the more well-known computer science conferences that focuses on these issues. And the paper describes a mechanism that the researchers used to try to identify not-safe-for-work and other problematic content. What was wrong with that methodology? How did it fail? Is this a failure of the researchers who put this together in the first place?

David Thiel:

They did make some attempts to filter material. The fact that they did use a classifier to try and identify, quote/unquote, not-safe-for-work material was definitely helpful to this research. If nothing else, it was helpful for making decisions about what to include in training various models down the line. But if they had consulted with a child safety organization, they would've been able to be much more effective here, and they would've been able to effectively filter out 100% of all known samples of CSAM during the actual crawl and dataset assembly process. Instead, they used their own mechanism with a bunch of keywords. Basically, they threw it at their own machine learning analyzer that would look at the image and try and figure out what was depicted in a text tool format, and they put in some keywords of basically things that they didn't want to see.

Some of these made sense, and some of them did not make a ton of sense. So if you're finding child abuse materials on the internet, they're not going to come out and use the term underaged, for example, and that's something that we're looking for. It's also not a concept that a machine learning model really knows about. That's a legal concept and not something that a model is really trained on. So they put in some keywords, and undoubtedly that's got rid of some material that would've other ways been included. But it was unfortunately a missed opportunity to do a really thorough job. If we can get one of these organizations to do the compute resources necessary to actually calculate perceptual fingerprints of all of the images in their possession, then maybe we can come up with a more thorough way to do it. But they tried, but there were some lapses there.

I would say the secondary lapse is basically what the decisions were made at model training time. So when models were trained on this, the initial ones were trained on a pretty broad cross section of this dataset, and that includes all of the material that was calculated is not-safe-for-work. Much of that, and the majority of CSAM that we found intersected with that category being over 90% probability. So that material was all included. Images of children were included, images of adult sexual activity were included. And with a combination of these factors, that makes it so that these models were effectively able to produce CSAM themselves. So part of it is the way in which the data was crawled, part of it is the decisions that were made when training the models. But ultimately, those first really usable examples of Stable Diffusion have unfortunately caused a lot of harm as a result.

Justin Hendrix:

I want to just ask you about some of the fallout. Of course, LAION, the not-for-profit organization that hosted the dataset has pulled it down. Others have pulled down items that are based on this dataset. The statement from LAION's not-for-profit was sort of defensive, it seemed to me at least, invited you to work with it to help improve its filters, mentioned somewhat tersely that the thing had been deleted immediately in accordance with Article 17 of GDPR. It seems like they had already heard from a German state data protection commissioner almost instantly on news of the inclusion of these images. What else have you seen in terms of the fallout from this disclosure?

David Thiel:

Right now, that's been the immediate impact, is those metadata sets are now inaccessible. Presumably some research groups have started to evaluate what they want to do with copies of the dataset that are in their possession. It is legally a little bit of a weird area, because usually when you've got such a huge amount of data stored, like all of these billions of images, usually the people doing that are platform providers that have these safety mechanisms in place. And the laws that apply don't really consider like, "Oh, did you have a thousand images of child abuse within your almost 6 billion images stored on your set of hard drives?" They're usually focused on something a bit more intentional. So I think that nobody's quite sure legally what that means. We don't really even know, and certainly not on an international level. So I'd say mostly the effect has been limiting access to those datasets.

I'm not aware of removal of subsequent models or any action by platforms that are hosting the trained models. So while it makes some sense to limit access to those datasets, the majority of the harm is really in the models. So we'll see how that actually shakes out. When it comes to cleaning up datasets, part of the problem that we ran into, at the beginning of the project, I was a bit more optimistic about, "Let's find all these things. Let's get them out of the datasets. Problem solved." And I think it just became a bit more complicated as time went on, which effectively these datasets are all stored in GIT repositories. They have revision control. If we go and say, "Take these few thousand URLs out of there," there's literally going to be a change log that says, "We removed these exact couple thousand URLs." Here are the URLs to a couple of thousand images of CSAM.

So that makes publishing updates rather difficult. If I just threw out onto the internet a couple thousand URLs to CSAM, I'm not even sure what the legal status of that kind of action is. So we had difficulty coming up with a way that would allow those datasets to be cleaned without providing a roadmap to the material itself. So what I've been really saying is that if these datasets are going to continue to be used, they need to be substantially reformulated in some way, and that means there needs to be a significant change about what type of material is included. And that can be removing explicit material. It can be removing images of children, it can be removing images that artists have opted out for inclusion in datasets, but we're going to have to come up with something a bit more drastic than just, "Let's trim out these URLs and call it a day."

Justin Hendrix:

So you say in the report that models based on Stable Diffusion 1.5 that have not had safety measures applied to them should be deprecated and distribution ceased where feasible. How many models are out there based on Stable Diffusion 1.5? How many people are using this?

David Thiel:

There are probably thousands of models based off of Stable Diffusion 1.5. One way to think about these models and the reason why they're referred to in that community as checkpoints is basically they're like a point in time snapshot where people decided like, "Okay, we're done with training the model now." There's now technology for people quite easily to say, "Well, I'm going to continue training that model on whatever material I want to provide." And so we've had these communities of people taking models and basically making them better and better at a variety of tasks, but mostly at producing explicit content. So platforms primarily such as CIVITAI host a huge number of checkpoints along with model augmentations that make them better at producing explicit content. And as we looked at in a previous report, they also have models that are used within communities producing illegal content that make the subject appear younger.

So there's tons that are in use. And as I mentioned, this has caused a not inconsiderable amount of real world harm. Right now, people tend to think of things in terms of, "Oh, well, if this is generated CSAM, it's computer generated, it's not a real person, it's not a big deal." But a lot of what is happening in the real world right now is that real people are having their likenesses used in generation of this material. And we've seen at least five different instances in the last couple of months where what would appear to be Stable Diffusion-derived models have been used to undress school children or produce explicit content of children producing content of their peers in school.

So the fallout is pretty significant beyond just people are generating a bunch of either legal or illegal porn on their laptops. These models are just really good at combining concepts, and they're very good at particularly sexualizing women. They're very biased towards doing so. I think overall, those models based on 1.5 have been a net negative, a pretty clear net negative. They're horrendously biased in a number of ways. Sometimes you can't even get them to produce images of women with their clothes on even if you tell them to. So the number of biases that they have inherent and the fact that they were trained on that wide scope of material, I just think that we have to move on from what should be considered a toxic base-level model.

Justin Hendrix:

I guess it's a slight tension in the idea that we know about this problem with Stable Diffusion because this was an open dataset, and you were able to study it with other models that are now being widely deployed across the world, including ChatGPT, OpenAI's platform. We're not able to scrutinize the datasets. What degree of confidence do you have that this is not a problem that is generalizable to pretty much any generative AI system, whether we can scrutinize the dataset or not?

David Thiel:

I'd suspect that other generative models have this problem to varying degrees. One thing that I had anticipated with the publishing of this report is that people would say, "Look, this is the problem with these open-source models." And yeah, as you point out, the only reason why we were able to do this analysis is because it was an open-source model, because they were transparent about their training material. And as somebody who's worked in open-source software and technology for a very long time, I'm not opposed to open-source principles, though I have colleagues that say, "Opensource models should just not be a thing." I don't think that's the case. I think that when it comes to policy questions, the way to go here is towards more transparency about what goes into these models.

I think that for the generative models that are closed-source, I think that a similar data set shouldn't be made available, should be publicly auditable for a number of reasons, not just child safety or other safety issues. I think it's important to know what goes into those. So yes, I do think that these problems are present to varying degrees and other generative models. The fact that those models are usually hosted by a company and they can put on these after-the-fact controls to limit what prompts you can put in to analyze the output and say, "Oh, no, that doesn't look right," and then drop it, obviously that makes things easier to patch up than this just handing out a raw model to the world. But I do think that there should be enough transparency for similar projects to be implemented on the training sets for other models.

Justin Hendrix:

One thing that your report does point out that this isn't just a problem of the dataset of generative AI systems, but a lot of these URLs were hosted on some pretty big platforms, Reddit, Twitter X, I believe WordPress generally as a platform is mentioned in the report, as well. To what extent do you characterize this as a problem of the platforms themselves or a problem of internet hosts not being able to address these things upfront? Where do you situate the problem?

David Thiel:

So this actually brought up something interesting that I had not really put a lot of thought into before, which is that a lot of these are these general-purpose hosting platforms, social media platforms, and all of them are, not all, but many of them at some point went ahead and implemented industry standard scanning methodology for detecting known instances of CSAM. But given that this was a wide internet crawl, it got content that was posted a decade or more ago as well as stuff that was posted just recently. There's actually a little bit of a gap because usually that detection happens when somebody uploads it. So a company will start implementing these safety systems, somebody uploads an image and they say, "Ah, no. We don't want that to be posted."

Everything that was posted before they implemented that at-upload check is still there, and new instances of known CSAM or known NCII are being added all of the time. So there is actually this lag between when you implemented the technology and when new hashes are added to those databases. So for some of these platforms, it's probably a good idea to actually go back and do retrospective scanning of the content that they already have in their possession. Obviously, that's going to take some degree of compute, but my assumption is that a reasonable amount of the material that was on those major platforms was basically, and giving the benefit of the doubt here is that it was posted before they implemented scanning technology or it was from hashes that were added after content was uploaded. So there's a little bit of a gap, I think, in some of the child safety imagery protection mechanisms that those platforms might have.

Justin Hendrix:

Just a couple of days after your report published to Democratic congressional representatives, Anna Eshoo and Don Beyer, put out a proposed bill, the AI Foundation Model Transparency Act, that would appear to address some of these types of concerns. Have you had a chance to look at that bill or reporting on it? Is it the kind of thing you would have in mind?

David Thiel:

I have not had time to dig into that bill. We have policy people at SIO that are much better at teasing apart the contents of those and how they compare to other laws that are implemented. But I do think that, like I say, transparency should be the focus here and not necessarily trying to put strict legal regulations on how things are distributed or how they're hosted, that kind of thing. So I do think that that's at least the right direction to start from.

Justin Hendrix:

I suppose your work points to the possibility that these should be surmountable problems. We should be able to figure out how to build generative AI systems that effectively weed out stuff like child sexual abuse material, maybe copyrighted material, other types of problematic content. Are you confident that that will be the case in future, or do you think we have a rough road ahead of us?

David Thiel:

I think that things will improve in the future. I think that part of what we wanted to communicate with this paper is basically just that datasets can't be compiled in that way anymore. They have to be better curated, there have to be way better safety mechanisms, and it should have been that way from the start. But a couple of years ago, everybody was rushing to try and get the latest revision of their technology and models on the market. So I do think that things will change. I think they have changed to some degree, even with newer versions of Stable Diffusion. But unfortunately, some of that has also included less transparency about how those models were trained. So I think we need to reach a better state when it comes to that kind of transparency. But also when it comes to the safety mechanisms that are implemented, we tried to give a decent example of ways that safer systems could be built.

When it comes to the more recent regulation, both the executive order in the US and recent legislation in the EU, I think that those are going in some interesting directions, and part of what they are looking for is transparency. I do think they're flavored by a little bit of a different way of looking at risk. So for example, legislation in the EU looks at how powerful a model is. And the raw power or number of parameters in a model or how much compute it took to trade it is not necessarily indicative of potential harm. That's the stuff that people that stoke up all this imagery about artificial general intelligence, they're worried about that kind of stuff.

But the harms that these models can produce is not really dependent on the number of parameters or how much compute was enabled. It's really about what they can do. And so I think that having more of a focus on what those models can actually do and what the real world outcomes are would probably be beneficial. You don't need a lot of compute power to decide to deny somebody a home loan based on their ethnicity or something. The compute power necessary to generate explicit imagery from Stable Diffusion these days is quite small and people can do training at home. So I think that there's a little bit of that AGI-type existential risk concerns that are coloring some of that stuff. And I think a bit more of like a immediate harm focus is also necessary.

Justin Hendrix:

Is there any possibility that the loose collaboration model that you've had in doing this report could or should become permanent, that we might need a consortium of the types of organizations you've involved in this effort to be able to carry on these interrogations going forward?

David Thiel:

I think that there definitely should be greater cross-org collaboration between organizations that are building training sets, building models, and also other trust and safety-focused organizations. So I think collaboration between child safety organizations is useful, and we certainly are interested in performing this cross-org collaborative research. So I think that the original outcomes that we're dealing with right now, and I think that, yeah, they should become a bit more standard in the future. I think that we're at a space right now in general in technology and on the internet where things are starting to decentralize a fair bit. And so much of our trust and safety efforts in the industry have been confined to these really large platforms and this in-house knowledge that the Facebook's and formerly Twitter would've had internally. And now we've got a lot more distributed technology. We've got a lot more open-source hobbyists, and so we have to figure out how to make those child safety organizations, child safety models apply to this wider array of stakeholders. So I think that there's room for some positive change there.

Justin Hendrix:

What's next for you? What's your next point of inquiry when it comes to generative AI or the training sets that are used to populate these systems?

David Thiel:

Well, there are some newer training sets that are even larger and potentially actually a bit more difficult to interrogate. So maybe taking a look at those may also be looking at ways to refine the accuracy of that nearest neighbor detection and working with child safety organizations so that they can help use some of those findings to train their models to detect novel material. I hope that there's some additional beneficial research outcomes to be had from that. And then I think also we're going to try and find what the best ways to use the residual LAION data, if there's a way to recombine it in a way to make it significantly safer so that researchers can continue to have access to it. We still need to study Stable Diffusion 1.5, how it was made, how the material affected its training. So some people still need to do that, and we need to figure out ways so that they can do that as safely as possible. So those are a few of the future directions.

Justin Hendrix:

David, thank you so much for speaking to me about this research and wish you the best in your next study.

David Thiel:

Thanks very much for having me.

Authors

Justin Hendrix

Justin Hendrix is CEO and Editor of Tech Policy Press, a nonprofit media venture concerned with the intersection of technology and democracy. Previously, he was Executive Director of NYC Media Lab. He spent over a decade at The Economist in roles including Vice President of Business Development & In...

Exposing the Rotten Reality of AI Training Data

Our Content delivered to your inbox.

Thank you!

Authors

Topics