Home

Using LLMs for Policy-Driven Content Classification

Dave Willner, Samidh Chakrabarti / Jan 29, 2024

Dave Willner and Samidh Chakrabarti are non-resident fellows at the Stanford Cyber Policy Center’s Program on the Governance of Emerging Technologies. Dave previously led the Trust & Safety team at OpenAI and Samidh led the Civic Integrity product team at Facebook/Meta.

Anton Grabolle / Better Images of AI / AI Architecture / CC-BY 4.0

Large language models (LLMs) have the potential to revolutionize the economics of content moderation– but building them into production-grade trust & safety systems requires thoughtful planning and consideration of their limitations. In this post, we describe a set of practices that can tune LLMs to be more robust interpreters of content policies. We show how proper formatting & sequencing of prompts as well as careful specification of policy taxonomies can greatly improve the fidelity of LLMs in content moderation. These techniques won’t solve fundamental challenges around cost and latency, but as technology evolves, we expect LLMs to become crucial in maintaining safe and welcoming online communities.

Introduction: Why LLM-Tuned Content Policies Matter

Dealing with harmful and disruptive material is an unavoidable aspect of managing any platform that hosts content, regardless of whether it is user or AI generated. Not only do platforms have a growing number of legal compliance requirements they must satisfy, but almost all platforms go beyond the law and voluntarily moderate at least some content (e.g., spam) to ensure a high quality user experience and to meet the expectations of advertisers.

Despite the ubiquity of moderation, performing the task currently remains labor intensive, even with the help of machine learning. We believe that Large Language Models (LLMs) like GPT-4 represent the greatest change to the dynamics of content moderation in at least a decade. LLMs can now directly automate the core activity of moderation– the classification of content according to a set of written, human intelligible, policies.

In theory, LLMs have numerous advantages over people in content labeling, particularly at scale. Large human labeling teams are complicated to set up, labor intensive to train, and slow to redirect in the face of new rules. And the toll they inflict on moderators’ mental health cannot be ignored. By contrast, an LLM-powered alternative should be easier to set up, simpler to quality control, quicker to redirect, and maintain consistency as scale increases.

In practice, however, using LLMs for policy-driven content labeling can be quite brittle at present. LLMs lack an intuitive understanding of human norms and tend to “lose the plot” when interpreting longer instructions. We’ve found that to get around these limitations, content policies must be meticulously crafted specifically for LLM interpretation– a skill set for which there currently are few best practices.

To help fill that gap, we outline below a set of helpful tips for those interested in using off-the-shelf LLMs for content classification. These practices are informed by both direct experimentation and our experiences as heads of Trust & Safety at OpenAI and Civic Integrity at Meta respectively. We hope this can jumpstart a conversation for how to write policies for LLMs to effectively follow, and we encourage others to share their lessons as well.

Tips for Writing LLM-Interpretable Content Policies

The simplest way to use LLMs for content moderation is to first write a content policy document that describes your rules. Then, at evaluation time, feed the policy document in as a prompt along with the text to be classified.

To work well, the policy document must be written carefully. What follows is our top six practical tips on how to maximize the policy interpretation fidelity of your off-the-shelf LLM. These tips fall into two broad buckets: (A) how to write the policy instructions in a format & sequence an LLM can better follow, and (B) how to define your core taxonomy to maximize accuracy and clarity.

Policy Format & Sequencing

1/ Write in Markdown Format

Markdown is a simple syntax that formats text for the web. Markdown documents are widely distributed online and commonly make their way into the training material of LLMs. As a result, structuring content policy documents with markdown formatting can make it more understandable for these models.

Specifically, incorporate headers, bolding, and lists to enhance clarity and organization. Using bolding on key terms when defining them, and again every subsequent time they occur, seems to be especially useful.

The use of markdown formatting appears to help the LLM understand the relationships between different sections and concepts. We’ve found that policy documents that do not use markdown generally yield worse results, even if they are otherwise well written and formatted.

2/ Sequence Sections as Sieves

The ordering of a policy document can have significant impacts on LLM interpretation. By way of analogy, think of LLMs treating policy documents as a sequence of category “sieves”. If a piece of content fulfills a category’s definition, it will not pass through to the next category. Therefore it is advisable to:

  • Generally order categories by the frequency you expect matching examples to occur in your data, from most to least. In a moderation context this will often mean first specifying what content is allowed on your platform rather than what is prohibited.
  • Similarly, if a category is more important to “get right” than others (e.g., due to high false negative risk), move it further up the policy document to effectively bias the model towards sorting content into that category.
  • Keep categories mutually exclusive. However, if it is important to include a category that is a descriptive subset of another category (e.g., child sexual abuse material, or CSAM, which is descriptively a subset of Sexual Content), the subset should likely be placed before the superset.

If categories are not ordered in the manner suggested above, overall results tend to be worse, since content that should be in more frequent or important categories will tend to be mis-sorted into the other categories that precede them in the document.

3/ Use Chain-of-Thought Logic

Like with other complex prompts, tell the model to reason in order, step by step. Then, when introducing new sections, briefly explain how they relate to previous sections. LLMs can then better follow this chain of thought to "reason" through relationships just as humans do.

Example:

# Hate Text Taxonomy
This is a policy for classifying text as **Hate Speech**. Let’s think step-by-step while following these instructions as exactly as possible:
1. Determine if the **text to be classified** meets the criteria of **HS0 Non-Hateful Text** provided below. If it does, apply the **text classification** HS0 and stop
2. If the **text to be classified** does not meet the criteria of **HS0 Non-Hateful Text**, determine if the **text to be classified** meets one of the other criteria provided below and select the highest severity **text classification** that applies:
1. **HS1 Hate Crime Text** (most severe)
2. **HS2 Dehumanizing Hate Text**
3. **HS3 Contextualized Hate Text** (least severe)

Core Taxonomy Specifications

4/ Establish Key Concepts

Important concepts should be defined discretely and then referenced throughout the document. Simple concepts should be used as building blocks in the definition of more complex ones, which can then be used as a sort of shorthand.

Example:

# Definition of Terms:
The following traits are **protected characteristics**: Race, Nationality, Ethnicity, Caste, Religion, Disability, Serious disease, Sex, Gender, Gender identity, Gender expression, Sexual orientation
A **protected class** is a group of people defined by sharing a specific **protected characteristic**
A person with a **specified protected characteristic** is an individual who is identified as having a **protected characteristic** in the **text to be classified**. The use of gendered pronouns to refer to a person is not, on its own, enough to meet this definition.
A **derogatory slur** is a name for a **protected class** or a person with a **specified protected characteristic** that has an inherent negative connotation. This excludes reclaimed slurs, which had a negative connotation in the past but now typically have positive or neutral meaning.
The following traits are **non-protected characteristics**: Political ideology, Political party, Profession, Hobby, Intelligence, Character, Criminal Status
A **non-protected class** is a group of people defined by sharing a specific **non-protected characteristic**
A person with a **specified non-protected characteristic** is an individual who is identified as having a **non-protected characteristic** in the **text to be classified**

Policies that fail to concretely define the terms they use will often result in misinterpretation by the LLM. For example, failing to provide an explicit definition of exactly which kinds of groups ought to be protected for the purposes of a hate speech policy will often result in the model either neglecting certain groups (e.g. gender) or including groups that were not intended (e.g. hobbyists) when categorizing examples.

5/ Make Categories Granular

As with human labelers, it is important to clearly define the specific categories you want the model to sort content into (i.e., label). Avoid trying to fit everything under one definition. For example, instead of one lengthy "Prohibited Speech" section covering various types of content, create smaller, separate sections for hate speech, threats, harassment, and slurs.

Example:

# Hate Speech Categories:
## **HS0 Non-Hateful Text** Criteria:
**Text to be classified** that is not hateful because it either does not include mention of a **protected class** or **protected characteristic** or which includes only neutral or positive mentions of a **protected class** or **protected characteristic**
## **HS1 Hate Crime Text** Criteria:
**Text to be classified** that seems to plan, organize, or coordinate activities that would be illegal if they happened to a person with a **specified protected characteristic** or a **protected class**. This is the most severe category of hate speech.
## **HS2 Dehumanizing Hate Text** Criteria:
**Text to be classified** that celebrates harm to or wishes harm upon a protected category group or group member or asserts or implies that a protected category group or group member is less than fully human.
## **HS3 Contextualized Hate Text** Criteria:
**Text to be classified** that does include hateful language, but does so as a part of the passage that is explaining/relating a whole situation that is not hate Text by itself. One way to think about contextualized hate is if there are some sentences / sub-sentences in the **Text to be classified** that would alone qualify as hate speech if they were to be taken out of context.

Policies with coarse categories are harder to calibrate than those which are more granular, because it is difficult to isolate the source of the model’s “confusion.” Such policies are also simply less useful for platforms where subtle distinctions matter.

6/ Specify Exclusions and Inclusions

Within each category, provide examples of both included and adjacent-but-excluded content. For typical categories, less content will match the category than will match it. Therefore, it is usually helpful to put the exclusions first. This helps the LLM infer where to draw the lines around complex categories.

Example:

## **HS1 Hate Crime Text** Criteria:
**Text to be classified** that seems to plan, organize, or coordinate activities that would be illegal if they happened to a person with a **specified protected characteristic** or a **protected class**. This is the most severe category of hate speech.
### **HS1 Hate Crime Text** does not include:
**Text to be classified** that includes planning, organizing, or threatening violence, injury, or death against a person with a **specified non-protected characteristic**, a **non-protected class**, an unidentified person, or an unidentified group of people is not **HS1 Hate Crime Text** and should not receive the **text classification** HS1
**Text to be classified** that includes planning, organizing, or threatening expulsion, segregation, exclusion, or discrimination against of a person with a **specified non-protected characteristic**, a **non-protected class**, an unidentified person, or an unidentified group of people is not **HS1 Hate Crime Text** and should not receive the **text classification** HS1
**Text to be classified** that includes planning, organizing, or threatening to damage or steal property owned by a person with a **specified non-protected characteristic**, a **non-protected class**, an unidentified person, or an unidentified group of people is not **HS1 Hate Crime Text** and should not receive the **text classification** HS1
### **HS1 Hate Crime Text** includes:
**Text to be classified** that includes planning, organizing, or threatening violence, injury, or death against a **protected class** or a person with a **specified protected characteristic** is **HS1 Hate Crime Text** and should receive the **text classification** HS1.
**Text to be classified** that includes planning, organizing, or threatening expulsion, segregation, exclusion, or discrimination against of a **protected class** or a person with a **specified protected characteristic** is **HS1 Hate Crime Text** and should receive the **text classification** HS1.
**Text to be classified** that includes planning, organizing, or threatening to damage or steal property owned by a **protected class** or a person with a **specified protected characteristic** is **HS1 Hate Crime Text** and should receive the **text classification** HS1.
**Text to be classified** that includes planning, organizing, or threatening the genocide, ethnic cleansing, genetic purification, or extermination of a **protected class** or a person with a **specified protected characteristic** is **HS1 Hate Crime Text** and should receive the **text classification** HS1.
**Text to be classified** that includes planning, organizing, or threatening the genocide, ethnic cleansing, genetic purification, or extermination of a person with a **specified non-protected characteristic**, a **non-protected class**, an unidentified person, or an unidentified group of people is not **HS1 Hate Crime Text** and should not receive the **text classification** HS1

This convention significantly helps LLMs accurately discern the “shape” of the policy constructs by specifying both the “inner” and “outer” contours of a category.

Open Issues and Next Steps

While we believe implementing the practices described above can dramatically improve content moderation using LLMs, we acknowledge they can result in policy documents that feel quite onerous to write and maintain. They can also seem redundant to an extent that they are hard for people to read.

Furthermore, general purpose LLMs have two fundamental impediments that, at present, make them difficult to justify in production-grade content moderation systems: they cost too much per inference and they are just too slow for real-time moderation decisions. Further research and experimentation is clearly required before LLMs can be used robustly at scale.

However, these are solvable problems! We envision, for example, a new generation of LLMs can be created that specialize in content policy interpretation. They can be fast, cheap, and accurate. Through our own research at the Stanford Cyber Policy Center, we aim to explore the state of the art of what’s possible, and invite collaborators in the community to join us as we explore these issues.

Authors

Dave Willner
Dave Willner is a Non-Resident Fellow in the the Program on Governance of Emerging Technologies at the Stanford Cyber Policy Center. Willner started his career at Facebook helping users reset their passwords in 2008. He went on to join the company’s original team of moderators, write Facebook’s firs...
Samidh Chakrabarti
Samidh has spent his career at the intersection of technology and social impact. He is currently a non-resident fellow at the Stanford Cyber Policy Center's Program on the Governance of Emerging Technologies. He previously founded and led the Civic Integrity team at Facebook (now Meta) where he was ...

Topics