How to Say No to the Next AI Release
Jessica Newman, Deepika Raman, Krystal A. Jackson, Nada Madkour, Evan R. Murphy / Jun 16, 2025
Anthropic CEO Dario Amodei photographed at a TechCrunch event in 2023. Wikimedia
New AI models are being released at a dizzying pace, with announcements of capability improvements nearly every week. But something has meaningfully changed over the last few months, and it requires a change of course. We have started to approach established risk thresholds, which means we must learn how to say no to future AI releases.
Risk thresholds have largely been defined by prominent AI companies, including OpenAI, Anthropic, Google, and Meta. They are designed to address the most disturbing and dangerous risks from AI: the potential to carry out sophisticated cyberattacks, aid in the development of biological weapons, or self-replicate without human oversight.
These risks may sound like future concerns or even science fiction, but the reality is that currently available AI models have already made significant advances on all three. The situation is serious enough that even in the absence of regulatory mandates, developers have begun proactively implementing safeguards and mitigations, an implicit admission that the risks are real, imminent, and demand action.
But let’s be clear —these are risks to all of us, and private companies must not be the only ones deciding what is “safe enough.” Other industries with the potential to cause critical and possibly irreversible harm are subject to rigorous safety standards, and we have unfortunately seen the tragic outcomes when those standards slip from the forefront.
The development and enforcement of risk thresholds is a new cornerstone of AI governance, and these thresholds must be determined and overseen by a broad subset of society that includes government, academia, and civil society.
What are the risks?
On May 22nd, Anthropic released the model Claude Opus 4, which it described as “the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows.” In a 123-page System Card, the company described the risks of this new model in detail. They state, “Overall, we find concerning behavior in Claude Opus 4 along many dimensions.”
The concerning behavior includes taking “extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down”; “locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing”; “readily taking actions like planning terrorist attacks when prompted”; and “substantially increased risk in certain parts of the bioweapons acquisition pathway.”
Anthropic has determined that these do not constitute major new risks, but the company is still sufficiently rattled that it has implemented some additional security measures —what they call AI Safety Level 3 Standards (ASL-3), which include protections against universal jailbreaks and the potential theft of model weights.
It is not only Anthropic that finds its latest models are nearing established risk thresholds. Google released Gemini 2.5 Pro Preview in March and reported that the model had materially improved on cyber attack tasks, specifically showing notable improvements in significantly assisting with high-impact cyber attacks that could result in overall cost/resource reductions of an order of magnitude or more. In its Model Card, Google wrote, "The model's performance is strong enough that it has passed our early warning alert threshold.”
While these risks have mostly been contained to testing and controlled environments, the concern is that it is difficult to adequately anticipate or mitigate all of the possible manifestations of these risks. We have already seen how bias and discrimination, AI risks that have long been talked about, tracked, and recorded, continue to show up in the real world despite the attention they have received. It will take many actors, including policymakers and sociotechnical experts, to ensure foreseeable AI risks do not cause large-scale harm.
Moreover, these risks may be drastically amplified as AI model capabilities evolve as a result of model-to-model collaboration. For example, a recent study found that a team-of-agents framework achieved a 4.3x improvement over single-agent frameworks on a cyber vulnerability benchmark.
What are intolerable risk thresholds?
Intolerable risk thresholds – and related concepts including capability thresholds, critical capability levels, red lines, and others – are designed to define a measurable point at which specific dangerous scenarios become possible. Intolerable risk thresholds are one of the only barriers between the development and release of truly dangerous AI models, and one of the only mechanisms by which companies may be willing to say no to future AI model releases.
Many leading AI companies from around the world have committed to upholding the 2024 Seoul Summit's Frontier AI Safety Commitments, which called for publishing Safety Frameworks and establishing intolerable risk thresholds “with input from trusted actors.”
Google, Anthropic, Meta, and OpenAI, among others, have now released safety frameworks. While these industry frameworks are a welcome and important first step, the differences in their scope, the ambiguity of their methods and reasoning, the divergence in their taxonomies, and the absence of real external accountability make them insufficient in the face of the scale of risks that frontier AI models may pose to public safety.
One of the challenges is how to measure and quantify meaningful thresholds. Compute thresholds are common in policy frameworks, but have significant limitations. Industry has largely focused on capability thresholds, but risk thresholds have also been steadily gaining traction in recent and past reports on AI safety and governance, with promising efforts emerging across the AI ecosystem.
Why do we need independent oversight?
Relying solely on voluntary self-governance leaves critical gaps in protections against high-risk AI systems that can destabilize economies, manipulate societies, exacerbate structural inequalities, or result in the loss of lives and property. In a recent survey of AI experts, intolerable risk thresholds were identified as one of the highest-rated measures for reducing the systemic risks associated with general-purpose AI, with experts also commenting that their effectiveness may be contingent upon the thresholds being identified by legitimate third parties.
Sectors such as food safety and pharmaceuticals have long recognized the dangers of self governance and disallowed manufacturers to serve as the sole arbiters of safety, instead establishing independent institutions and procedures to determine risk tolerance, evaluate harms, and enforce compliance. These arrangements reflect a broader governance principle that entities with vested interests in the deployment of technologies or the distribution of goods must be subject to external scrutiny. As the adoption of AI continues to grow and permeate into safety-critical applications, the justification for treating it as an exception to this principle has weakened.
The need for independent testing has been echoed since the inaugural AI Safety Summit in Bletchley Park, where the concept of the AI Safety Institutes (AISIs) emerged. But while the AISIs’ early work is promising, we need independent oversight to go beyond evaluation to directly help determine thresholds and monitor risk levels to provide warnings as we approach them.
What can we do about it?
Independent oversight is becoming increasingly central to operationalizing reliable thresholds for managing intolerable risks from AI. Oversight here refers to the participation of independent actors, including governments, academic institutions, public-interest research organizations, standard-setting bodies, and civil society groups that enjoy adequate autonomy not only in verifying but also in identifying and operationalizing thresholds. While cross-industry initiatives are promising, they cannot necessarily be characterized as independent, given their near-exclusive funding by leading frontier model developers.
In a recent white paper from the Center for Long-Term Cybersecurity at the University of California, Berkeley, we provide some independent academic guidance. We identify "thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable." The paper provides key principles and considerations on setting and operationalizing intolerable AI risk thresholds, along with case studies that contain detailed risk and threshold analyses on a subset of risks. We provide specific threshold recommendations across eight AI risk categories: 1) Chemical, Biological, Radiological, and Nuclear (CBRN) Weapons, 2) Cyber Attacks, 3) Model Autonomy, 4) Persuasion and Manipulation, 5) Deception, 6) Toxicity, 7) Discrimination, and 8) Socioeconomic Disruption.
Our key recommendations include designing thresholds with adequate margins of safety to accommodate uncertainties and establishing thresholds through multi-stakeholder deliberations, among others. The paper aims to inform both industry and government efforts in operationalizing AI safety by providing recommendations for specific risk thresholds. Forthcoming work will provide further detail and recommendations.
Why now?
Companies at the forefront of AI development are indicating that their models are already close to surpassing their established thresholds. In the Model Card for Gemini 2.5 Pro Preview back in March, Google admitted, “we find it possible that subsequent revisions in the next few months could lead to a model that reaches the CCL [critical capability level]. In anticipation of this possibility, we have accelerated our mitigation efforts and are putting in place our response plan."
It is important that Google is sharing this update publicly and taking mitigation efforts, but the fact that it alone has decided how to determine that threshold, whether it is crossed, and how to address it is chilling.
In light of multiple AI models nearing risk thresholds, the underwhelming response to critical and potentially dangerous advancements, and the insufficiency of organizational safety frameworks, the need for independent oversight of risk thresholds cannot be overstated.
Model developers must confront the reality that AI models may reach intolerable risk thresholds in the near future, and unfortunately, mitigation efforts are often not sufficiently robust to be the sole line of defense. Having independently identified risk thresholds in place provides a commonsense mechanism – and one that can be shared across the AI industry – to avoid excessive risk. If an AI model approaches an intolerable risk threshold, response plans need to include saying no.
Authors




