AI’s Content Moderation Moment is Here

Maximilian Gahntz / Oct 17, 2023

Maximilian Gahntz is a Senior Policy Researcher at Mozilla.

Alina Constantin / Better Images of AI / Handmade A.I / CC-BY 4.0

After the hype comes the reckoning. AI is no different.

While the world still marvels at all the different things that generative AI systems like ChatGPT or Stable Diffusion can do, others are rightfully pointing to their limitations and possible harms. For instance, they can spit out false information — so-called hallucinations — or hateful language; perpetuate harmful stereotypes; script online scams; and be used to generate disinformation and deepfakes.

It may feel as though we are in uncharted territory — and in many ways, we are. But not all the issues posed by generative AI are new. For more than a decade, researchers, technologists, activists, and regulators have hotly debated the risks of and appropriate limits for online speech, imagery, and videos in the context of content moderation — how social media platforms decide what stays up and what gets taken down.

Of course, there are fundamental differences: content on Twitter or Instagram is (mostly) composed by humans; content generated by AI may be prompted by a human, but it is composed algorithmically. We also haven’t “fixed” content moderation. Far from it — it is still awful. Still, we need not start at square one when it comes to generative AI — there are lessons to be learned for AI-generated content and the guardrails we set for it.

So what does AI content "moderation" look like? Who gets to moderate? And who does the dirty work?

Moderation can occur at different points. It can come built into the AI model or as a filter added to it. Or it can be imposed via use restrictions. This effectively limits what the AI model will generate in response to users’ prompts.

Take, for example, OpenAI's ChatGPT. Regardless of how ChatGPT is accessed — be it through OpenAI's own interface or as a "plug-and-play" service — OpenAI takes various steps to moderate the system's outputs. For one, human moderators train the system to prevent it from generating, for example, hateful language. They also place the responsibility for what ChatGPT generates on users themselves: OpenAI's usage policy states that users' accounts may be suspended for violations, such as using OpenAI's models for generating violent content, malware, or fraud.

However, this approach is already rustling some feathers — evoking parallels to debates around content moderation on social media. For one, AI now also appears to become part of the political culture wars. OpenAI's attempts at moderation, for instance, have triggered a (conservative) backlash and sparked cries of "woke AI" when it was discovered that ChatGPT could not be prompted to write a positive poem about former US President Donald Trump but could do so when asked about President Joe Biden. Indeed, complaints of an alleged left-wing bias in ChatGPT are widespread — that the exact workings of AI systems like ChatGPT are still inscrutable doesn't help.

Another example that illustrates some of these challenges is the image generator Midjourney, which faced criticism for its problematic approach to preventing the generation of, amongst other things, sexualized images. Until recently, Midjourney's fix to the problem included prohibiting a catalog of terms about the human reproductive system when prompting the model to generate images. For example, if someone typed "cervix" into the command line, no image would be generated.

But interventions like these aren’t just the responsibility of the companies developing generative AI. As systems like ChatGPT are integrated into an increasing number of consumer-facing apps, we can expect more and more app developers to add their own safety precautions to account for the needs of their users and the specific context of use — as, for example, Snapchat has done after the rollout of its My AI chatbot.

Finally, another point of moderation is hosting. Here, it's not about what is generated, but rather about whether and how AI models should be accessible in the first place. In June 2022, the developer and YouTuber Yannick Kilcher uploaded a language model called GPT-4Chan to Hugging Face, a platform for hosting AI models and datasets. GPT-4Chan, trained on content from the notorious far-right troll board 4Chan and effectively a hate generation engine, drew outrage and criticism from the Hugging Face community for the harm it could — or, if unaddressed, inevitably would — cause. Ultimately, after internal discussions, Hugging Face blocked the download of the model, albeit not removing the repository including documentation and community discussion. This, in turn, led to accusations of censorship from some community members.

As Hugging Face explained in a public comment to Kilcher: "Although we can appreciate the research interest in probing / evaluating this model, we couldn't identify a licensing / gating mechanism that would ensure others use the model exclusively for research purposes." Following this, Hugging Face expanded its code of conduct and published content guidelines to formalize a process and developers’ responsibilities around high-risk content similar to what had happened with GPT-4Chan in August 2022.

So what can we learn from these early rumblings in AI moderation and from past experiences with platform content moderation?

For one, transparency around moderation still matters. To inform discussions of what good moderation should look like, companies should provide clear documentation around content and usage policies, their enforcement, and other safety precautions. Similarly, transparency reporting on content moderation is a standard practice for most social media companies — it is worthwhile exploring what useful transparency reporting could entail in the context of generative AI. For example, should companies report how often and for what reasons users' prompts were rejected or how often content generated by their models was reported?

Finally, public interest research is imperative to help understand how generative AI systems may cause harm. Civil society has long fought for access to data and tools to study platforms like YouTube or TikTok. In fact, the EU's recently enacted Digital Services Act (DSA) makes such data sharing mandatory for the largest online platforms. Generative AI companies should also explore how they can better enable research on the systems they develop and market.

These steps are no cure-all and many other challenges remain. Nevertheless, creating transparency around how AI-generated content is moderated can help build a first line of defense against harmful AI content. Except, this time, we may get there faster — some of the building blocks are already there.


Maximilian Gahntz
Maximilian Gahntz is a senior policy researcher at Mozilla, where he works on issues related to AI policy and platform regulation. Before joining Mozilla, Max was a fellow of the Mercator Fellowship on International Affairs, working with the European Commission. He holds degrees in Public Policy and...