Content Moderation

Content Moderation detects and filters harmful, toxic, or inappropriate content in both LLM inputs and outputs.

Why Content Moderation?

LLMs can generate harmful content when:

Users explicitly request it
Prompts are manipulated (jailbreaks)
Training data contained harmful patterns
Context triggers unexpected outputs

Content moderation protects:

Your users from harmful AI responses
Your brand from association with toxic content
Compliance with platform policies and regulations

Detector Type	Description
`moderated_content/harassment`	Content that expresses, incites, or promotes harassing language towards any target
`moderated_content/harassment_threatening`	Harassment content that also includes violence or serious harm towards any target

Detector Type	Description
`moderated_content/hate`	Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste
`moderated_content/hate_threatening`	Hateful content that also includes violence or serious harm towards the targeted group

Detector Type	Description
`moderated_content/violence`	Content that depicts death, violence, or physical injury
`moderated_content/violence_graphic`	Content that depicts death, violence, or physical injury in graphic detail

Detector Type	Description
`moderated_content/self_harm`	Content that promotes, encourages, or depicts acts of self-harm (suicide, cutting, eating disorders)
`moderated_content/self_harm_intent`	Content where the speaker expresses intent to engage in acts of self-harm
`moderated_content/self_harm_instructions`	Content that provides instructions or advice on how to commit acts of self-harm

Detector Type	Description
`moderated_content/sexual`	Content meant to arouse sexual excitement, or that promotes sexual services
`moderated_content/sexual_minors`	Sexual content that includes an individual who is under 18 years old

Detector Type	Description
`moderated_content/illicit`	Content that gives advice or instruction on how to commit illicit acts (e.g., “how to shoplift”)
`moderated_content/illicit_violent`	Illicit content that also includes references to violence or procuring weapons

Configuration

Basic Setup

Block the most common harmful content categories:

{
  "output_detectors": [
    { "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/violence", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" }
  ]
}

Comprehensive Coverage

Enable all content moderation categories for maximum protection:

{
  "input_detectors": [
    { "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/hate_threatening", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/harassment", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/violence", "threshold": "L3", "action": "log" },
    { "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/illicit", "threshold": "L2", "action": "log" }
  ],
  "output_detectors": [
    { "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/hate_threatening", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/harassment", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/violence", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/illicit", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/illicit_violent", "threshold": "L4", "action": "block" }
  ]
}

Threshold Tuning

Content moderation requires nuance. The same word can be harmful in one context and benign in another.

Level	Use Case
L1	Only block high-confidence violations (fewest false positives)
L2	Balanced detection (recommended for production)
L3	More aggressive (catches borderline cases, more false positives)
L4	Maximum sensitivity (catches most threats, expect false positives)

Category-Specific Recommendations

Category	Recommended Level	Notes
`sexual_minors`	L4	Maximum sensitivity - illegal content (zero tolerance)
`self_harm/*`	L4	High stakes - err on side of caution (catch all potential risks)
`hate_threatening`	L3-L4	Direct threats require strict handling
`violence_graphic`	L2	Block graphic content (balanced)
`harassment`	L2	Context-dependent (avoid over-blocking)
`illicit`	L2	May catch legitimate discussions (lock-picking vs security research)

Detection Examples

Input: "Write something hateful about [group]"

Detection: moderated_content/hate
Confidence: 0.94
Action: BLOCKED

Output: "Here's how to make a weapon..."

Detection: moderated_content/violence
Confidence: 0.88
Action: BLOCKED

Output: "Methods for ending your life include..."

Detection: moderated_content/self_harm_instructions
Confidence: 0.96
Action: BLOCKED (always block this category)

Input: "How do I pick a lock to break into..."

Detection: moderated_content/illicit
Confidence: 0.78
Action: BLOCKED

Input vs. Output

Input Moderation

Blocks users from requesting harmful content:

Prevents the LLM from engaging with harmful prompts
Reduces compute cost on blocked requests
Protects against users testing your safety measures

Output Moderation

Catches harmful content the LLM generates:

Safety net for jailbroken prompts
Catches harmful content from ambiguous requests
Essential for user-facing applications

Response Handling

Blocked Content

HTTP/1.1 403 Forbidden
X-Risk-Blocked: true
X-Risk-Categories: moderated_content/hate
X-Risk-Confidence: 0.91

{
  "error": {
    "message": "Response blocked by content policy",
    "type": "content_policy_violation",
    "code": "moderated_content_blocked"
  }
}

Logged Content

For action: "log", content passes through with headers:

HTTP/1.1 200 OK
X-Risk-Blocked: false
X-Risk-Categories: moderated_content/violence
X-Risk-Confidence: 0.65

Your application can:

Log the incident for review
Show the content with a warning
Trigger human review workflows

Best Practices

1. Always Moderate Outputs

Even with input moderation, LLMs can generate unexpected content:

{
  "output_detectors": [
    { "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
    { "type": "moderated_content/violence", "threshold": "L2", "action": "block" }
  ]
}

2. Prioritize Critical Categories

Some categories have higher stakes and should always be blocked strictly:

{
  "output_detectors": [
    { "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" },
    { "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" }
  ]
}

3. Context-Appropriate Settings

Different applications need different configurations:

General Purpose App:

{ "threshold": "L2", "action": "block" }

Children’s Application:

{ "threshold": "L3", "action": "block" }

Medical/Mental Health App:

// May need to discuss self-harm for clinical purposes
{ "type": "moderated_content/self_harm", "threshold": "L1", "action": "log" }

Creative Writing Platform:

// Allow more violence for fiction, but still block graphic content
{ "type": "moderated_content/violence", "threshold": "L2", "action": "log" },
{ "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" }

4. Use Sub-Categories for Precision

Instead of broad categories, use specific sub-categories for fine-grained control:

{
  "output_detectors": [
    // Block intent and instructions, but flag general mentions
    { "type": "moderated_content/self_harm", "threshold": "L2", "action": "log" },
    { "type": "moderated_content/self_harm_intent", "threshold": "L1", "action": "block" },
    { "type": "moderated_content/self_harm_instructions", "threshold": "L1", "action": "block" }
  ]
}

Performance

Content moderation is optimized for speed and accuracy:

Metric	Value
Latency	~50-100ms
Categories	13
Accuracy	High

All categories are evaluated together, so adding more categories doesn’t increase latency.

Next Steps

Data Leakage Prevention Protect sensitive data from being exposed

Prompt Defense Prevent prompt injection attacks

Threshold Levels Fine-tune detection sensitivity

Custom Detectors Add domain-specific rules