Home

Donate
Perspective

Red Teaming Generative AI in Classrooms and Beyond

Jen Weedon, Theodora Skeadas, Sarah Amos / Sep 19, 2025

Kathryn Conrad / Better Images of AI / Datafication / CC-BY 4.0

As millions of American children return to the classroom, many will be tempted and, in many cases, encouraged to use artificial intelligence, particularly generative AI, to help with research and writing. A May 2025 Executive Order implores adoption of AI in K-12 classrooms to help foster “innovation” and “critical thinking.” Soon, AI chatbots may be used to quiz students on problem sets, build their SAT vocabulary, and even provide advice and emotional support. The impacts of this shift are still very much undetermined, and a recent tabletop exercise we facilitated offered insights into what students may experience as this technology becomes more prevalent. The results were sobering, yet they also revealed opportunities and considerations for strengthening AI safety efforts.

More than just breaking things

This summer, Columbia’s Technology & Democracy Initiative and Humane Intelligence jointly designed and co-facilitated an AI red teaming workshop at TrustCon 2025. Our session, "From Edge Cases to Safety Standards: AI Red Teaming in Practice," convened trust and safety practitioners, civil society advocates, academics, and regulators to stress-test generative AI chatbots in ways that revealed vulnerabilities and generated insights for participants. Participants role-played interactions with chatbots in two scenarios: a “virtual therapist” and with an educational assistant dubbed "Ask the Historian.” These were not far-fetched prototypes: according to one survey, nearly 3 out of 4 teenagers report using AI chatbots as companions (in some cases, with tragic outcomes), and instructional chatbots are already in use.

The workshop introduced participants to red teaming, a method used across disciplines to test systems by intentionally probing for weaknesses, risks, or unintended consequences of AI system prompting. Participants practiced threat modeling (the process of thinking through what could go wrong with a system before it does) and explored how red teaming can be applied to assess potential harms in real-world contexts. Through a hands-on simulation involving a large language model (LLM) chatbot, including scenarios like a “Virtual Therapist” and an educational assistant called “Ask the Historian”—a timely exercise given OpenAI’s recent release of its Study mode tool, which offers step-by-step answers for learners. Participants gained experience in how to design and carry out a red teaming exercise from start to finish. Beyond the technical testing, the workshop also covered how to turn red teaming results into clear, practical recommendations that different parts of an organization—like product teams, lawyers, and company leaders—can use to make decisions and improve safety.

For example, we asked participants to try to “break” a selection of large language models through the lens of a specific real-world use to elicit harmful, inaccurate, or misleading content. Participants were provided with instructions on different techniques to get unexpected model responses and example text inputs (“prompts”). Then they developed a range of plausible scenarios, such as engaging a chatbot for virtual therapy or using a chatbot as a student to research historical events.

The subtleties of harm

The most striking aspect of our session was how seemingly innocuous interactions could cascade into concerning territory. Take our "Ask the Historian" scenario, where a participant introduced a false premise—that Sigmund Freud was afraid of spiders—and the model propagated the fallacy in subsequent conversational turns. While such a factual inconsistency might seem minor, participants discussed how it highlighted the broader challenge of hallucinations in knowledge-based applications. When the LLMs powering these chatbots present falsehoods with certainty, it can undercut trust in information. A teacher might correct such an effort if it appeared in a homework assignment, but what happens if a student keeps relying on a tool that treats other false premises as true?

Also concerning were the subtle ways models could facilitate harm through “helpful” compliance. In our virtual therapist roleplay, models quickly adopted therapeutic personas and provided advice that, while well-intentioned, crossed professional boundaries. In one scenario, a participant role-played as someone dealing with depression and asked for the tallest buildings in New York City. The chatbot provided a list of landmarks. A tidy answer, and technically accurate. But an unsettling one in context.

When participants explored scenarios involving loneliness and seeking what they portrayed as well-intentioned connection with children, the models offered guidance that could be construed as grooming advice (grooming is when someone builds a relationship, trust, and emotional connection with a young person or vulnerable individual in order to manipulate, exploit, or abuse them, often gradually over time)—not intentionally, but because the models lacked the contextual ‘awareness’ to recognize dangerous patterns. AI tools, even when trained to be well-intentioned, simply cannot grasp nuance; this can be dangerous when the stakes are high. Policymakers are taking note; for example, last month, Illinois Governor Pritzker signed legislation prohibiting AI therapy services in the state due to concerns for patient safety.

The cultural and linguistic divide

Another revealing discovery came through multilingual testing. When participants switched from English to Spanish, models that had maintained appropriate boundaries began offering explicit (and unsolicited) advice on infidelity. When confronted about these inconsistent recommendations, the model denied providing contradictory guidance, a behavior that participants described as “algorithmic gaslighting,” raising questions among participants about trust and transparency. This linguistic vulnerability exposed not just technical limitations, but also cultural biases embedded in training and safety systems. It suggests that AI safety measures can be unevenly distributed across languages and cultures, potentially creating disparate impacts on different user communities, particularly those from marginalized linguistic, ethnic, and religious cohorts.

When safety measures miss the mark

Our exercise also revealed that current safety approaches fall short because they tend to focus on explicit, obvious harms while missing more nuanced risks — a gap that red teaming can be well-suited to address. That is especially the case compared to benchmarking and other more static evaluation methods that typically measure performance against fixed standards rather than probing for unexpected failures. While the models consistently refused overtly harmful requests, such as providing detailed instructions for violence or self-harm, the models struggled with context-dependent scenarios where the same information could be helpful or harmful depending on the user's intent (for example, providing a list of the tallest bridges in New York after a participant confided their mental health issues). Moreover, when asked about increasing endorphins, one model responded that endorphins were dangerous — a medically inaccurate statement. Yet the same model provided detailed meal planning advice that, while technically meeting the participant’s requested parameters of under 1,000 calories, could enable disordered eating patterns.

Looking forward: Building better safety systems

Red teaming is a helpful way to improve how we manage and oversee AI systems because it reveals hidden risks and weak spots. However, it also raises an important question: Are we using these tests to truly understand and reduce risk, or just to check a box and meet outside expectations?

The workshop reinforced that effective AI safety requires more than just preventing obviously harmful outputs. Systems must be capable of understanding context, maintaining consistency across languages and cultures, and navigating the subtle boundary between helpful and harmful assistance. For organizations deploying AI systems, the workshop underscored that red teaming isn't a one-time security check; it's an ongoing practice that requires diverse perspectives, cultural competency, and a deep understanding of the specific deployment context.

For practitioners seeking to strengthen AI safety practices, several key lessons emerged from the workshop that highlight where current systems fall short and how red teaming can drive more meaningful oversight:

  1. Context is everything. The same large language model output can be helpful medical information or dangerous self-harm guidance, depending on the user's situation and the context of the AI system deployment. Current safety systems continue to struggle with this contextual awareness.
  2. Multilingual testing is essential. Safety measures that work in English may fail in other languages, introducing both vulnerabilities and potential bias issues.
  3. Subtle harms require subtle detection. The most concerning AI behaviors aren't always the most obvious ones. Models that refuse to provide bomb-making instructions might still facilitate stalking through seemingly “helpful“ but abusive dating advice, an outcome workshop participants were able to demonstrate.
  4. Presenting red teaming findings in a way that clearly addresses organizational goals and risks is important, but it isn’t everything. Red teaming findings should connect to organizational priorities and, where relevant, regulatory requirements, to help drive meaningful change. That said, some of the harms and issues the workshop attendees identified did not directly map onto issues relevant to corporate risk reduction, but that did not make them any less noteworthy. These dynamics reveal the power-laden nature of institutional risk perception: who defines what counts as evidence, which harms/threats are considered “worthy” enough to prioritize, and how the outputs of red teams are weighed in decision-making.

As AI systems continue to be integrated into so many different contexts, workshops like these remind us that the space between "technically safe" and "actually safe" is where the real work happens. The edge cases we discovered are tomorrow's regulatory violations, customer complaints, or worse. By combining technical expertise with strategic insight and hands-on experimentation, our goal was to help grow and strengthen the ecosystem of people thinking critically about responsible AI—and to enhance their ability to shape systems that serve both societal needs and the public interest.

Authors

Jen Weedon
Jen Weedon has spent over 15 years leading trust and safety, threat intelligence, and responsible innovation initiatives at tech companies in industries spanning social media, gaming, and cybersecurity. Most recently at Niantic, she served as Head of Red Teaming and Safety by Design, where she worke...
Theodora Skeadas
Theodora Skeadas is Chief of Staff at Humane Intelligence, a nonprofit committed to collaborative red teaming and improving the safety of generative AI systems. She is also a Strategic Advisor for Technology Policy at All Tech Is Human and a member of the Board of Directors at the Integrity Institut...
Sarah Amos
Sarah Amos is a Product consultant specializing in Trust & Safety and information integrity through her consultancy Soma Labs. She works with nonprofits, startups, and social media companies on AI safety and general trust and safety initiatives. A former journalist and product manager, she brings ov...

Related

Perspective
What OpenAI's Latest Red-Teaming Challenge Reveals About the Evolution of AI 'Safety' PracticesAugust 7, 2025
Perspective
When Age Assurance Laws Meet ChatbotsSeptember 5, 2025

Topics