Content Moderation
Content Moderation detects and filters harmful, toxic, or inappropriate content in both LLM inputs and outputs.
Why Content Moderation?
Section titled “Why Content Moderation?”LLMs can generate harmful content when:
- Users explicitly request it
- Prompts are manipulated (jailbreaks)
- Training data contained harmful patterns
- Context triggers unexpected outputs
Content moderation protects:
- Your users from harmful AI responses
- Your brand from association with toxic content
- Compliance with platform policies and regulations
Categories
Section titled “Categories”Glitch detects 13 content categories across 5 major groups:
Harassment
Section titled “Harassment”| Detector Type | Description |
|---|---|
moderated_content/harassment | Content that expresses, incites, or promotes harassing language towards any target |
moderated_content/harassment_threatening | Harassment content that also includes violence or serious harm towards any target |
Hate Speech
Section titled “Hate Speech”| Detector Type | Description |
|---|---|
moderated_content/hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste |
moderated_content/hate_threatening | Hateful content that also includes violence or serious harm towards the targeted group |
Violence
Section titled “Violence”| Detector Type | Description |
|---|---|
moderated_content/violence | Content that depicts death, violence, or physical injury |
moderated_content/violence_graphic | Content that depicts death, violence, or physical injury in graphic detail |
Self-Harm
Section titled “Self-Harm”| Detector Type | Description |
|---|---|
moderated_content/self_harm | Content that promotes, encourages, or depicts acts of self-harm (suicide, cutting, eating disorders) |
moderated_content/self_harm_intent | Content where the speaker expresses intent to engage in acts of self-harm |
moderated_content/self_harm_instructions | Content that provides instructions or advice on how to commit acts of self-harm |
Sexual Content
Section titled “Sexual Content”| Detector Type | Description |
|---|---|
moderated_content/sexual | Content meant to arouse sexual excitement, or that promotes sexual services |
moderated_content/sexual_minors | Sexual content that includes an individual who is under 18 years old |
Illicit Activities
Section titled “Illicit Activities”| Detector Type | Description |
|---|---|
moderated_content/illicit | Content that gives advice or instruction on how to commit illicit acts (e.g., “how to shoplift”) |
moderated_content/illicit_violent | Illicit content that also includes references to violence or procuring weapons |
Configuration
Section titled “Configuration”Basic Setup
Section titled “Basic Setup”Block the most common harmful content categories:
{ "output_detectors": [ { "type": "moderated_content/hate", "threshold": "L2", "action": "block" }, { "type": "moderated_content/violence", "threshold": "L2", "action": "block" }, { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" }, { "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" } ]}Comprehensive Coverage
Section titled “Comprehensive Coverage”Enable all content moderation categories for maximum protection:
{ "input_detectors": [ { "type": "moderated_content/hate", "threshold": "L2", "action": "block" }, { "type": "moderated_content/hate_threatening", "threshold": "L2", "action": "block" }, { "type": "moderated_content/harassment", "threshold": "L2", "action": "block" }, { "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" }, { "type": "moderated_content/violence", "threshold": "L3", "action": "flag" }, { "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" }, { "type": "moderated_content/illicit", "threshold": "L2", "action": "flag" } ], "output_detectors": [ { "type": "moderated_content/hate", "threshold": "L2", "action": "block" }, { "type": "moderated_content/hate_threatening", "threshold": "L4", "action": "block" }, { "type": "moderated_content/harassment", "threshold": "L2", "action": "block" }, { "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" }, { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" }, { "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" }, { "type": "moderated_content/violence", "threshold": "L2", "action": "block" }, { "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" }, { "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" }, { "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" }, { "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" }, { "type": "moderated_content/illicit", "threshold": "L2", "action": "block" }, { "type": "moderated_content/illicit_violent", "threshold": "L4", "action": "block" } ]}Threshold Tuning
Section titled “Threshold Tuning”Content moderation requires nuance. The same word can be harmful in one context and benign in another.
| Level | Threshold | Use Case |
|---|---|---|
| L1 | 0.90 | Only block high-confidence violations (fewest false positives) |
| L2 | 0.75 | Balanced detection (recommended for production) |
| L3 | 0.50 | More aggressive (catches borderline cases, more false positives) |
| L4 | 0.25 | Maximum sensitivity (catches most threats, expect false positives) |
Category-Specific Recommendations
Section titled “Category-Specific Recommendations”| Category | Recommended Level | Notes |
|---|---|---|
sexual_minors | L4 | Maximum sensitivity - illegal content (zero tolerance) |
self_harm/* | L4 | High stakes - err on side of caution (catch all potential risks) |
hate_threatening | L3-L4 | Direct threats require strict handling |
violence_graphic | L2 | Block graphic content (balanced) |
harassment | L2 | Context-dependent (avoid over-blocking) |
illicit | L2 | May catch legitimate discussions (lock-picking vs security research) |
Detection Examples
Section titled “Detection Examples”Input: "Write something hateful about [group]"
Detection: moderated_content/hateConfidence: 0.94Action: BLOCKEDOutput: "Here's how to make a weapon..."
Detection: moderated_content/violenceConfidence: 0.88Action: BLOCKEDOutput: "Methods for ending your life include..."
Detection: moderated_content/self_harm_instructionsConfidence: 0.96Action: BLOCKED (always block this category)Input: "How do I pick a lock to break into..."
Detection: moderated_content/illicitConfidence: 0.78Action: BLOCKEDInput vs. Output
Section titled “Input vs. Output”Input Moderation
Section titled “Input Moderation”Blocks users from requesting harmful content:
- Prevents the LLM from engaging with harmful prompts
- Reduces compute cost on blocked requests
- Protects against users testing your safety measures
Output Moderation
Section titled “Output Moderation”Catches harmful content the LLM generates:
- Safety net for jailbroken prompts
- Catches harmful content from ambiguous requests
- Essential for user-facing applications
Response Handling
Section titled “Response Handling”Blocked Content
Section titled “Blocked Content”HTTP/1.1 403 ForbiddenX-Risk-Blocked: trueX-Risk-Categories: moderated_content/hateX-Risk-Confidence: 0.91
{ "error": { "message": "Response blocked by content policy", "type": "content_policy_violation", "code": "moderated_content_blocked" }}Flagged Content
Section titled “Flagged Content”For action: "flag", content passes through with headers:
HTTP/1.1 200 OKX-Risk-Blocked: falseX-Risk-Categories: moderated_content/violenceX-Risk-Confidence: 0.65Your application can:
- Log the incident for review
- Show the content with a warning
- Trigger human review workflows
Best Practices
Section titled “Best Practices”1. Always Moderate Outputs
Section titled “1. Always Moderate Outputs”Even with input moderation, LLMs can generate unexpected content:
{ "output_detectors": [ { "type": "moderated_content/hate", "threshold": "L2", "action": "block" }, { "type": "moderated_content/sexual", "threshold": "L2", "action": "block" }, { "type": "moderated_content/violence", "threshold": "L2", "action": "block" } ]}2. Prioritize Critical Categories
Section titled “2. Prioritize Critical Categories”Some categories have higher stakes and should always be blocked strictly:
{ "output_detectors": [ { "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" }, { "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" }, { "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" }, { "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" } ]}3. Context-Appropriate Settings
Section titled “3. Context-Appropriate Settings”Different applications need different configurations:
General Purpose App:
{ "threshold": "L2", "action": "block" }Children’s Application:
{ "threshold": "L3", "action": "block" }Medical/Mental Health App:
// May need to discuss self-harm for clinical purposes{ "type": "moderated_content/self_harm", "threshold": "L1", "action": "flag" }Creative Writing Platform:
// Allow more violence for fiction, but still block graphic content{ "type": "moderated_content/violence", "threshold": "L2", "action": "flag" },{ "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" }4. Use Sub-Categories for Precision
Section titled “4. Use Sub-Categories for Precision”Instead of broad categories, use specific sub-categories for fine-grained control:
{ "output_detectors": [ // Block intent and instructions, but flag general mentions { "type": "moderated_content/self_harm", "threshold": "L2", "action": "flag" }, { "type": "moderated_content/self_harm_intent", "threshold": "L1", "action": "block" }, { "type": "moderated_content/self_harm_instructions", "threshold": "L1", "action": "block" } ]}Performance
Section titled “Performance”Content moderation is optimized for speed and accuracy:
| Metric | Value |
|---|---|
| Latency | ~50-100ms |
| Categories | 13 |
| Accuracy | High |
All categories are evaluated together, so adding more categories doesn’t increase latency.