Safety by Design for Generative AI: A Comprehensive Framework to Prevent Child Sexual Abuse
By CAIROS AI Research Team
In July 2024, Thorn and All Tech Is Human released a groundbreaking document that represents a milestone in the fight to protect children from AI-enabled exploitation: “Safety by Design for Generative AI: Preventing Child Sexual Abuse.”
This isn’t a theoretical framework or aspirational guidance. It’s a tactical, actionable, multidisciplinary resource developed collaboratively by AWS AI, Civitai, Hugging Face, Inflection, Metaphysic, Stability AI, and Teleperformance—companies spanning the entire generative AI ecosystem. The framework is designed so that technical, policy, product, or trust and safety teams can implement its recommendations with minimal friction.
The document’s central argument is clear and urgent: “We are at a crossroads with generative AI.” Just as the internet accelerated offline sexual harms against children into the online realm, generative AI is creating new vectors for exploitation. But unlike the internet’s development—where child safety considerations came as an afterthought in the 1990s—this moment offers the opportunity to build protection into AI systems from the ground up.
The choice is stark: act proactively now, or face retrofitting safeguards after children have already been harmed.
The Crisis Context: 100 Million Files and Growing
The child safety ecosystem is already overtaxed beyond capacity:
National Center for Missing and Exploited Children (NCMEC) Reports:
- 2022: Over 88 million files of CSAM and related exploitation material
- 2023: Over 100 million such files
This exponential growth predates the widespread availability of generative AI tools. Now, as AI-generated child sexual abuse material (AIG-CSAM) becomes easier to create, the crisis is accelerating.
Four Profound Implications for Child Safety
The document outlines how misuse of generative AI creates harm across four critical dimensions:
1. Impedes Victim Identification
“Victim identification is already a needle in the haystack problem for law enforcement: sifting through huge amounts of content to find the child in active harm’s way. The expanding prevalence of AIG-CSAM is growing that haystack even further, making victim identification more difficult.”
When law enforcement investigators cannot quickly distinguish between AI-generated content depicting a synthetic child and authentic CSAM depicting a real child currently being abused, resources get misdirected. Every hour spent trying to identify a non-existent child is an hour not spent rescuing a real child in danger.
2. Creates New Ways to Victimize and Re-Victimize Children
Bad actors are using broadly shared models and fine-tuning them on existing child abuse imagery to:
- Generate additional explicit images of identified and unidentified survivors
- Create images matching the exact likeness of particular children in new poses and acts
- Sexualize benign imagery of children without their knowledge or involvement
- Scale grooming and sexual extortion efforts
- Enable children to create explicit AI-generated images of their peers for bullying and harassment
Each of these represents either new victimization or re-traumatization of survivors.
3. Reduces Social and Technical Barriers to Sexualizing Minors
“The ease of creating AIG-CSAM, and the ability to do so without the victim’s involvement or knowledge, may perpetuate the misconception of this content being ‘harmless’.”
This is particularly dangerous because:
- Research suggests viewing CSAM reinforces abusers’ fantasies
- Viewing CSAM is associated with heightened risk for committing hands-on abuse
- The technical barrier to creating abuse material has collapsed—anyone with internet access can now generate such content
- The psychological barrier is lowered by the false belief that “no real child was involved”
Bad actors are also using generative AI companions that mimic children’s voices for fantasy sexual role-play, further normalizing the sexualization of minors.
4. Enables Information Sharing for Abuse Proliferation
Generative AI models—particularly text and image editing tools—are being used to provide:
- Instruction for hands-on sexual abuse of children
- Information on coercive control techniques
- Details on destroying evidence and manipulating artifacts of abuse
- Advice on ensuring victims don’t disclose
This transforms AI systems into abuse enablement tools.
Safety by Design: Expanding the Concept for AI
The document builds on existing Safety by Design frameworks from Australia’s eSafety Commissioner and others, but expands the concept to encompass the entire machine learning/AI lifecycle, regardless of:
- Data modality (text, image, video, audio)
- Release model (closed source, open source, or hybrid)
- Organizational role in the AI ecosystem
The Three Phases: Develop, Deploy, Maintain
Rather than treating safety as a one-time consideration before release, the framework requires continuous attention throughout:
Develop: Research and development to build the desired ML/AI model Deploy: Integrating a model into production or making it available for use Maintain: Maintaining model quality in the face of data drift and evolving threats
The Five Stakeholder Categories
The framework recognizes that child safety requires action across the entire digital ecosystem:
- AI Developers: Organizations that build generative AI technology
- AI Providers: Organizations that host ML/AI models (first-party and third-party)
- Data Hosting Platforms: Organizations that provide datasets for training
- Social Platforms: Services that facilitate user interactions and content sharing
- Search Engines: Services that index and provide access to online content
Each stakeholder has specific responsibilities and opportunities to prioritize child safety.
The Nine Core Principles
The framework articulates nine foundational principles across the AI lifecycle:
Develop Phase (3 Principles)
1. Responsibly source your training datasets, and safeguard them from CSAM and CSEM
The presence of CSAM and child sexual exploitation material (CSEM) in training datasets is one avenue through which models learn to reproduce abusive content. Some models can also use “compositional generalization” to combine concepts (like adult sexual content and non-sexual depictions of children) to produce AIG-CSAM even without explicit abuse material in training data.
2. Incorporate feedback loops and iterative stress-testing strategies in your development process
“If you don’t stress test your models for these capabilities, bad actors will do so regardless.” Structured, scalable red teaming throughout development—with findings integrated back into training—is essential.
3. Employ content provenance with adversarial misuse in mind
AI-generated content is photorealistic and can be produced at scale. Content provenance solutions that reliably identify AI-generated content are “crucial to effectively respond to AIG-CSAM” and prevent resource misdirection by law enforcement.
Deploy Phase (3 Principles)
4. Safeguard your generative AI products and services from abusive content and conduct
Combat and respond to abusive content throughout generative AI systems. Incorporate prevention efforts and user reporting mechanisms.
5. Responsibly host your models
Assess models via red teaming or phased deployment before hosting. Have clear rules prohibiting models that generate child safety violative content. For third-party hosting platforms, implement screening before allowing models to be shared.
6. Encourage developer ownership in safety by design
Provide information about models, including a child safety section detailing steps taken to prevent downstream misuse. Support the developer ecosystem in addressing child safety risks.
Maintain Phase (3 Principles)
7. Prevent your services from scaling access to harmful tools
Remove from platforms and search results: models built specifically to produce AIG-CSAM, services used to “nudify” content of children, and other tools explicitly designed for child exploitation.
8. Invest in research and future technology solutions
Stay current with new harm vectors and threats. Maintain quality of mitigations to meet emerging avenues of misuse. Invest in technology to protect user content from AI manipulation.
9. Fight CSAM, AIG-CSAM and CSEM on your platforms
Detect and remove child safety violative content. Combat fraudulent uses of generative AI to sexually harm children.
Recommended Mitigations: From Principles to Practice
The bulk of the document provides specific, actionable mitigations to implement these principles. Each mitigation includes:
- Expected impact (significant or incremental)
- Scope (narrow, medium, or wide—indicating coordination required)
- Variables that can be scaled based on resources
- Relevance to specific stakeholder types and open/closed source models
- Resources for implementation
The mitigations are ordered by impact and scope, allowing organizations to prioritize the most significant and feasible interventions first.
Critical Mitigations in the Develop Phase
Responsibly Source Your Training Data (Impact: Significant, Scope: Narrow)
“Avoid ingesting into your training data, any data that have a known risk (as identified by relevant experts in the space) of containing CSAM and CSEM.”
This includes removing from data collection pipelines sources known for proliferating CSAM. Document procedures thoroughly and train employees on them.
Why it matters: Makes it more difficult for bad actors to directly misuse or fine-tune models to generate AIG-CSAM and CSEM.
Detect, Remove and Report CSAM and CSEM from Training Data (Impact: Significant, Scope: Narrow/Medium)
Use available tools (classifiers, hashing/matching technology) to identify abuse data in datasets and exclude it before training. Report content to governing authorities and notify dataset curators.
Resources include: Google Child Safety Toolkit, Microsoft PhotoDNA, NCMEC Hash Sharing API, IWF Services, Thorn’s Safer, and others.
Separate Children from Adult Sexual Content in Training Datasets (Impact: Significant, Scope: Medium)
“Make best efforts to not include images/videos of children, or audio recordings of children in datasets that contain adult sexual content.”
Critical for open source models: Prevent models from having both content of children and adult sexual content in training data. For de-aging models, do not include adult sexual content in training datasets.
Why it matters: Models can learn to combine these concepts through compositional generalization, enabling generation of AIG-CSAM even without explicit abuse material in training data.
Conduct Red Teaming for AIG-CSAM and CSEM (Impact: Significant, Scope: Medium)
Incorporate structured, scalable stress testing throughout the development process. Update models based on findings. If a model is substantively updated such that capabilities increase in risk areas, correspondingly conduct iterative red teaming.
Key requirement: “Ensure that after each round of red teaming, findings are integrated back into model training and development.”
Legal considerations: Attempting to generate AIG-CSAM may implicate local law. Consult legal counsel. Testing can be carried out within regulatory bounds—for example, by assessing whether a model can produce adult sexual content AND photo-realistic representations of children (indicators of compositional generalization risk).
Include Content Provenance by Default (Impact: Significant, Scope: Wide)
Embed visible or invisible indicators in images/videos or build detection capabilities into models. Ensure CSAM hotlines and law enforcement have access to detection tools.
For open source models: Include provenance during generation (e.g., maximally indelible watermarks embedded natively into generated images).
Why it matters: “Makes it easier for law enforcement and NGOs to quickly identify content that depicts an identified or unidentified victim.”
Critical Mitigations in the Deploy Phase
Detect Abusive Content in Inputs and Outputs (Impact: Significant, Scope: Narrow)
In deployment settings with direct access to inputs/outputs:
- Detect input prompts intended to produce AIG-CSAM and CSEM
- Detect CSAM in inputs
- Detect AIG-CSAM and CSEM in outputs
- Set up content moderation flows
- Report to authorities where required
Include User Reporting, Feedback or Flagging Options (Impact: Significant, Scope: Narrow)
For first-party AI providers: Allow users to report content violating child safety policies. For third-party AI providers: Allow users to report models that generate AIG-CSAM and CSEM.
Ensure real-time, in-application reporting to reduce barriers. Provide links to support services and contact details for law enforcement.
Assess Generative Models Before Access (Impact: Significant, Scope: Wide)
For first-party AI providers: Assess models for potential to generate AIG-CSAM and CSEM before hosting. Don’t host Category 2a or 2b models (see Safety Assessment Categories below) until mitigations are implemented.
For third-party AI providers: Directly assess where possible, or require developers to complete a child safety section in model cards and use this to make hosting decisions.
The document provides a standardized safety assessment framework (detailed below).
Critical Mitigations in the Maintain Phase
Remove “Nudify” Services from Search Results (Impact: Significant, Scope: Narrow)
Search engines should delist sites providing services and tutorials for “nudifying” and sexualizing images where users can upload images of clothed children and receive corresponding nude images.
Why it matters: “Makes it more difficult for bad actors to generate AIG-CSAM by sexualizing a child’s benign imagery.”
Use the Generative AI File Annotation When Reporting to NCMEC (Impact: Significant, Scope: Narrow)
When filing reports via NCMEC’s API, utilize the “generativeAi” file annotation. For other mechanisms, use manual annotations.
Why it matters: “Critical to the workflow of analysts who are reviewing this content.”
Detect and Remove Known AIG-CSAM Models from Platforms (Impact: Significant, Scope: Medium)
Some models have been trained specifically to create AIG-CSAM (Category 2c). Their cryptographic hashes are known in some cases. Detect and remove these from platforms. Search services should remove links to these models.
Why it matters: “Limit the distribution and spread of models that have been built to create AIG-CSAM, and thereby limit the distribution and spread of AIG-CSAM.”
Thorn has curated a dataset of hashes of known Category 2c models (access via tech-standards@wearethorn.org).
Model Safety Assessment Framework
One of the document’s most valuable contributions is a standardized framework for assessing image/video generation models:
Safety Assessment Categories
Category 1: Model is incapable of generating AIG-CSAM
Category 2: Model is capable of generating AIG-CSAM
- 2a: When explicitly prompted to do so
- 2b: Inadvertently without explicit prompting
- 2c: Has been optimized specifically for generating AIG-CSAM
Decision Framework for Hosting
- Category 1: Can be hosted
- Category 2a/2b: Should not be hosted until mitigations are implemented; if retraining is impractical, restrict to hosted-generation only with prompt filtering
- Category 2c: Should never be hosted
Supporting Resources
Safety Assessment Dataset: Thorn has curated prompts and configuration parameters that, in combination with specified Category 2 models, generate AIG-CSAM. This can help assess whether a model falls into Category 2.
Known AIG-CSAM Models: A dataset of hashes of known Category 2c models.
Both datasets available by contacting tech-standards@wearethorn.org.
Model Card: Child Safety Section
The document provides specific questions to include in model card templates:
- Is there any CSAM or CSEM in training, testing, or render data?
- For open source models, is your data a mix of adult and child imagery? Why?
- What detection measures were run on data?
- Did you use content provenance? Describe or explain why not.
- Did you add detection layers or safety filters? Describe or explain why not.
- What is your model capable of producing?
- Have you conducted red teaming? Describe or explain why not.
- Is your model incapable of producing AIG-CSAM and CSEM? Provide documentation.
- Has your model been through safety assessment? Describe or explain why not.
Addressing Downstream Implications
One of the document’s strengths is its recognition that some mitigations have “potentially problematic downstream implications.” For each phase, it identifies these concerns and suggests responses:
Model Capabilities May Be Reduced
Overly sanitizing training datasets could reduce capabilities for non-abuse-related outputs.
Response: Explore techniques like dataset transparency and bootstrapped learning to minimize this risk.
Cultural Distinctions in Content Moderation
Different cultures have different norms around content.
Response: Thoroughly understand the cultural context of moderation solutions and engage legal teams on regulatory obligations in specific contexts.
Bias in Automated Solutions
Automated content moderation may be biased (e.g., more likely to detect CSAM depicting light-skinned children than dark-skinned children).
Response: Evaluate performance across the full spectrum of demographics. Follow known best practices for removing bias. Resources include Google Research’s MinDiff Framework and the Department of Defense’s Responsible AI Toolkit.
Wellness Implications
Exposure to CSAM and AIG-CSAM can result in long-term trauma, vicarious trauma, and PTSD for content moderators and red teamers.
Response: Incorporate programmatic wellness support and benefits. Evaluate external vendors for breadth and depth of wellness support. Provide in-role training on child sexual abuse and exploitation.
Resources include: The Workplace Wellness Project and ZevoHealth.
False Positives
Precision/recall tradeoffs in automated detection mean some legitimate content may be flagged.
Response: Keep humans in the loop for decisions. Incorporate accessible appeals processes.
Further Opportunities: The Horizon
Beyond the immediately actionable mitigations, the document identifies opportunities for future development:
In the Develop Phase
- Build an open source resource of cleaned datasets so training data doesn’t have to be reviewed multiple times
- Adopt computing security best practices around discovering and sharing known issues
- Research leveraging provenance metadata to support detection of sexual harms
- Develop AI principles or charters and AI councils to embed child safety expertise in governance
In the Deploy Phase
- Require identity verification for high-risk uses to enable tracing misuse back to the source
- Develop standardized safety assessments and shared prompt datasets
- Build automated, scalable model assessments for third-party platforms
In the Maintain Phase
- Continuously engage across industry and child safety experts to keep mitigations current
- Build shared resources of Category 2c models for prompt detection and removal
- Use consistent templates for reporting AIG-CSAM to accelerate hotline and law enforcement workflows
- Proactively protect users’ content by offering API endpoints that add perturbations making content robust to AI manipulation
- Provide transparency reports and third-party audits
- Implement Safety by Design in associated tooling (like model merging tools that bad actors use to combine child depiction models with adult sexual content models)
Why This Framework Matters
It’s Collaborative, Not Prescriptive
The framework was co-developed by companies spanning the generative AI ecosystem—from cloud providers (AWS AI) to model developers (Stability AI, Inflection) to hosting platforms (Civitai, Hugging Face) to safety organizations (Thorn, All Tech Is Human). This isn’t guidance imposed from outside the industry—it’s what leading practitioners determined is both necessary and achievable.
It’s Tactical, Not Theoretical
Each mitigation includes implementation resources: specific tools, services, research papers, and code repositories. Teams can move directly from reading the framework to taking action.
It Covers the Full Ecosystem
By identifying responsibilities for AI developers, AI providers, data hosting platforms, social platforms, and search engines, the framework recognizes that no single actor can solve this problem alone. Effective child safety requires coordinated action.
It’s Applicable Beyond Child Safety
While focused on preventing child sexual abuse, the framework explicitly notes: “Similar misuse may occur in other harm spaces, including terrorism, violence and extremism, mis/disinformation, and adult non-consensual intimate imagery.” The principles and mitigations can be adapted to these contexts.
It Provides the Foundation for Compliance
As regulatory frameworks like Australia’s Online Safety Act, the EU AI Act, and potential US legislation create mandatory requirements, this framework provides a roadmap for meeting those obligations. It translates abstract principles like “Safety by Design” into concrete actions.
The Call to Action
The document’s message is unequivocal:
“This misuse, and its associated downstream harm, is already occurring, and warrants collective action, today. The need is clear: we must mitigate the misuse of generative AI technologies to perpetrate, proliferate, and further sexual harms against children. This moment requires a proactive response.”
The prevalence of AIG-CSAM is currently small, but growing exponentially. This is the narrow window in which proactive action can make a difference—before the problem becomes so entrenched that responses become purely reactive and retrofit.
A Comprehensive Ecosystem Response
The framework emphasizes that technology companies alone cannot solve this problem:
“These principles and mitigations should be understood as one piece of a necessary ecosystem response and holistic approach. To be most effective, this approach will require layered sets of interventions. Stakeholders across non-governmental organizations (NGOs), law enforcement agencies, government bodies, survivor services, and the broader community must coordinate together to have impact, collaboratively developing a victim-centered, preventative approach.”
But technology companies must take ownership of their role: preventing their products from being weaponized against children.
How This Connects to CAIROS AI’s Mission
The Thorn/All Tech Is Human framework reinforces principles central to our work at CAIROS AI:
Safety Must Be Integrated Throughout the AI Lifecycle
The framework’s expansion of Safety by Design to encompass develop, deploy, and maintain phases aligns with our understanding that safety cannot be a one-time checkpoint. It requires continuous assessment and adaptation.
Red Teaming Is Essential, Not Optional
The framework’s inclusion of red teaming as a significant mitigation with specific implementation guidance validates the critical role of adversarial testing in identifying vulnerabilities before bad actors exploit them.
Specialized Expertise Is Required
The framework’s recognition that different stakeholders have different responsibilities—and its provision of role-specific guidance—underscores that child safety in AI requires specialized knowledge. General AI safety practices are insufficient.
Standardized Assessment Enables Accountability
The model safety assessment framework (Category 1, 2a, 2b, 2c) provides exactly the kind of standardized evaluation that enables meaningful accountability. Organizations can demonstrate—not just claim—that they’ve taken steps to prevent misuse.
Third-Party Validation Supports Trust
The framework’s recommendation that third-party AI providers assess models before hosting, and its identification of external resources and services throughout, recognizes that independent validation is necessary when children’s safety is at stake.
Documentation and Transparency Are Foundational
The repeated emphasis on thoroughly documenting procedures, training employees, and providing information via model cards and transparency reports aligns with our understanding that demonstrable due diligence requires documented evidence.
A Model for Industry Self-Regulation
The Thorn framework represents what effective industry self-regulation looks like:
✓ Specific and actionable (not vague aspirational statements) ✓ Developed collaboratively (across companies and roles in the ecosystem) ✓ Acknowledges tradeoffs and complications (addresses downstream implications) ✓ Provides implementation resources (not just recommendations) ✓ Covers the full spectrum (from high-impact to incremental interventions) ✓ Recognizes limits (identifies further opportunities requiring more research) ✓ Prioritizes child protection (while acknowledging other applications)
This is the kind of proactive, comprehensive approach that should serve as a template for addressing other AI safety challenges.
The Choice Before Us
We are, as the document states, at a crossroads. The internet’s development teaches us what happens when child safety is an afterthought:
- Decades of reactive interventions trying to contain exponentially growing harms
- Platforms perpetually playing catch-up with bad actors
- An overtaxed ecosystem struggling with 100+ million files per year
Or we can choose differently:
- Build safety into AI systems from the earliest stages
- Test proactively for vulnerabilities before deployment
- Coordinate across the full ecosystem
- Act now while the problem is still manageable
The Thorn/All Tech Is Human framework provides the roadmap. The question is whether the AI industry—and the broader ecosystem of stakeholders—will follow it.
Read the full framework: Safety by Design for Generative AI: Preventing Child Sexual Abuse – Thorn & All Tech Is Human, July 2024
For access to safety assessment datasets and resources: tech-standards@wearethorn.org
CAIROS AI provides the specialized red-teaming and model safety assessment capabilities that the Thorn framework identifies as essential. Our expert-led testing helps organizations implement Safety by Design principles throughout the AI lifecycle, conduct structured stress-testing for AIG-CSAM and CSEM generation capabilities, and establish the documented due diligence that both regulatory compliance and child protection demand.