Why AI ‘Model Cards’ Are an Urgent Necessity for Child Safety
Camille François, Margaret Mitchell, Yacine Jernite, Vinay Rao, J. Nathan Matias / Apr 2, 2026Last November, when the Canadian Center for Child Protection published nationally-representative data on the scale of sextortion and other forms of image based online abuse, it was a wake-up call for child safety online. According to the survey of nearly 1,300 Canadian teens who were sexually victimized online, over half of the cases involved images and 23% involved threats to post, send, or show others real or AI images. Despite child safety protections by major tech platforms that promise to detect and intervene on intimate imagery of minors, 39% reported these experiences on Snapchat, 20% on Instagram, and 20% on Facebook.
Sextortion detection, particularly in settings involving minors, is increasingly urgent—in the United States, the National Center for Missing and Exploited Children (NCMEC) received nearly 100 reports of financial sextortion per day in 2024, with further surges in 2025. And while child safety algorithms have become an important talking point for technology makers eager to assure concerned parents, the open secret of child safety is that no one can say how well those algorithms work. This supply chain of ignorance represents a devastating risk to children's lives each day they log on. It's also an unexploded landmine for designers trying to create safer products, for technologists working to provide safety services to those designers, and for investors financing safer tech.
The rest of the AI industry routinely expects transparent documentation about algorithm performance. In 2019, researchers Margaret Mitchell, Timnit Gebru, and colleagues proposed “model cards”—short documents accompanying machine learning models that detail their intended use, performance, and limitations. This addressed a critical gap in the AI lifecycle by informing stakeholders about the specific strengths and weaknesses of models, for instance helping researchers shape their priorities, consumers navigate product choices, and developers decide which models to use on their platforms.
Since then, model cards have become an industry norm. The popular AI platform Hugging Face hosts over two million models, most with some form of documentation of the model’s strengths, limitations, and appropriate uses. Major AI labs publish detailed “system cards” alongside new releases. The EU AI Act is codifying similar transparency requirements into law.
Yet in child safety—a domain where AI models make high-stakes decisions affecting both the protection of vulnerable children and the rights of millions of users—these baseline transparency standards remain entirely absent.
The models used to detect CSAM, identify grooming, or flag self-harm, overwhelmingly remain black boxes despite being widely deployed. Little standardized documentation is provided about how they work, where they fail, and, importantly, what biases they carry. This has significant ramifications for creating safe online environments: When we cannot know what these systems work well for, who they are failing, and where there are gaps that may be augmented in other ways, we risk harming those who are most vulnerable online.
This needs to change.
How black box algorithms put children at risk
Our argument for model cards is practical. The absence of structured performance data on child safety systems is a significant impediment to good policy making on child safety. In the EU for instance, the lack of technical transparency on how these tools work and where they fail has put a number of child safety programs to a halt. The same opacity is just as damaging for developers. Based on our experience, organizations routinely give up on child safety tools because they cannot evaluate their performance or figure out how to integrate them.
Consider the two main approaches to novel CSAM detection in use across the industry. The first trains classifiers directly on confirmed CSAM—material verified by organizations like the National Center for Missing and Exploited Children (NCMEC) or the Internet Watch Foundation—teaching models to recognize visual patterns of abuse. The second uses proxy methods: combining separate models for nudity detection and age estimation to infer whether content may constitute CSAM, without having to train AI models on actual abuse material, which can be illegal to possess in most cases.
These approaches produce meaningfully different outcomes. A model trained on real CSAM can detect a broader range of indicators that proxy methods miss—for example, abuse imagery where the child’s face is not visible, or where nudity is not explicit. Proxy systems may handle certain edge cases differently but miss CSAM that doesn’t conform to their signals. Each carries different trade-offs in precision and recall, different failure modes, and different implications for those most affected by those failures.
Yet a trust and safety engineer evaluating tools today typically cannot determine which approach a given vendor uses—let alone how the model performs across critical dimensions. What is the false negative rate on AI-generated CSAM? How is this affected by factors such as gender and skin tone? How does age estimation in proxy CSAM models perform across different demographics? Does the model perform differently on video versus still images? Without answers, teams set thresholds incorrectly, misallocate human review resources, lose confidence in their tools, and in the worst cases abandon them entirely—leaving users more vulnerable to harm.
Knowing where a model fails is not a weakness to hide, it is critical intelligence for organizations. A team that knows its classifier underperforms on certain kinds of imagery can build complementary safeguards, help route that content to human review, or discuss platform-level changes to take pressure off the automatic classification systems. A team that lacks an understanding of a system's gaps cannot create an appropriately robust safety net.
Why child safety lags behind the rest of AI
The story of how model cards became standard elsewhere is partly an economic one. The general-purpose machine learning (ML) ecosystem is intensely competitive. Hundreds of models compete for developer adoption. In that environment, transparency became a competitive advantage: developers choosing between similar models needed reliable information about performance and limitations. Providers who offered it earned trust. Those who didn’t were passed over.
Child safety experienced none of this pressure and its underlying market is among the most concentrated in all of technology. For detecting known CSAM (material previously verified and hashed), one tool dominates: Microsoft's PhotoDNA (dating back to 2009). For detecting novel or previously unseen abuse material (the category most likely to represent a child in active, ongoing harm and the fastest growing type of CSAM) the picture is starker still. Fewer than five commercially available classifiers exist for this purpose, with Thorn's Safer and Google's Content Safety API being the overwhelming majority of the third-party novel CSAM classifiers disclosed by platforms across regulatory filings, congressional testimony, or transparency reports. For grooming detection, the market is even thinner.
So the market is small, buyers are few, and switching costs are high. When you are one of a few providers in a domain where the underlying data is illegal to possess, you may face many hurdles (and little incentives) to produce rigorous analyses and accompanying transparency that inform parents, technologists, and policymakers about the level of protection that children receive.
What would a model card include? Aside from baseline technical specifications, algorithm makers could report how well the model performs on photographic versus AI-generated CSAM. They could report on an algorithm's detection and error rates based on skin tone, gender, age and region. When organizations considering these models have a clearly-legible artifact of how well a system performs, they can design their child safety programs around the strengths and weaknesses of a given algorithm.
The ecosystem is now beginning to diversify, with new open-source tools and more providers entering the space. As it does, the field has an opportunity—and an obligation—to catch up with AI transparency norms.
Transparency improves human-AI cooperation on safety
Child safety is not a single problem solved by a single tool but a constellation of challenges, each addressed by different detection technologies that share the same transparency deficit.
Grooming, for instance, describes interactions in which someone methodically builds up trust with a child to set up further abuse. Identifying these interactions early is an important component of child safety. Grooming detection models analyze conversational patterns across languages, slang, and evolving predatory tactics. What data were they trained on—real predator conversations, synthetic data, or content from specific platforms? How do they handle multilingual contexts? Without documentation, deployers cannot assess whether a model fits their user base.
Self-harm detection involves especially sensitive trade-offs. Overly aggressive detection can surveil vulnerable young people seeking support, while under-detection misses genuine crises. How does a model distinguish content that discusses self-harm from content that promotes it, or encode potentially marginalizing assumptions about normal behaviors? What do these mean for handling false negatives without introducing new risks to the user?
Across every sub-domain, the pattern is the same: Organizations deploying these models need structured information about what the models do, how they were built, and where they fail. Model cards are the established mechanism for providing it.
Don’t wait for perfect benchmarks
Anyone who has worked in this space is familiar with two legitimate objections to broad calls for transparency. The first is about adversaries: doesn't publishing information about how detection models work (including where they fail) help bad actors evade them? The second objection is about benchmarks: how do you document model performance when standardized evaluation infrastructure doesn't exist? The data is illegal to possess in most jurisdictions; reliable labels require expert annotation; the threat landscape evolves constantly.
The adversarial concern has a name in other, parallel domains like cybersecurity: security through obscurity. The field's broad consensus is that relying solely on hiding or obscuring information to ensure safety does more harm than good. Adversaries find their own methods for discovering obscured information and exploiting weaknesses. The recommended approach in cybersecurity is “defense in depth,” – designing systems so that no one exploit can lead to compromise. This logic translates well to child safety.
The benchmarks problem is partly a tragedy of first movers: no one wants to be the first to publish imperfect performance data without comparable figures from competitors to contextualize it. But this logic, followed collectively, produces permanent silence, and we must start with small steps wherever we can. Model cards can document what evaluations have been conducted, even imperfect ones. They can describe test data to the extent possible, the metrics computed, and the known limitations of those metrics.
A promising path forward involves closed evaluation servers operated by trusted entities (ex. NCMEC, IWF): a curated test set of verified content maintained securely, against which developers submit models for evaluation without direct data access. Results, disaggregated by content type, demographics, and provenance, would be returned in a standardized format suitable for model cards. This approach could help address the benchmarks problem without requiring anyone to possess CSAM: it mirrors paradigms already in use in sensitive domains such as secure medical data enclaves.
Child safety sits at an inflection point. The ecosystem of tools is diversifying, regulatory pressure is mounting, and the scale of the challenge demands that we get this right.
The broader ML community arrived at the understanding years ago that models should come with documentation. It is past time for the child safety field to catch up.
We call on model developers, platforms, technologists, regulators, and civil society to work together on standardized model card templates for child safety tools, on shared evaluation infrastructure, and on a culture of disclosure that treats transparency not as a burden but as a baseline.
Children deserve to be protected as robustly as possible—and that requires tools we can actually understand.
Authors





