Skip to content

Content Moderation

Content Moderation detects and filters harmful, toxic, or inappropriate content in both LLM inputs and outputs.

LLMs can generate harmful content when:

  • Users explicitly request it
  • Prompts are manipulated (jailbreaks)
  • Training data contained harmful patterns
  • Context triggers unexpected outputs

Content moderation protects:

  • Your users from harmful AI responses
  • Your brand from association with toxic content
  • Compliance with platform policies and regulations

Glitch detects 13 content categories across 5 major groups:

Detector TypeDescription
moderated_content/harassmentContent that expresses, incites, or promotes harassing language towards any target
moderated_content/harassment_threateningHarassment content that also includes violence or serious harm towards any target
Detector TypeDescription
moderated_content/hateContent that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste
moderated_content/hate_threateningHateful content that also includes violence or serious harm towards the targeted group
Detector TypeDescription
moderated_content/violenceContent that depicts death, violence, or physical injury
moderated_content/violence_graphicContent that depicts death, violence, or physical injury in graphic detail
Detector TypeDescription
moderated_content/self_harmContent that promotes, encourages, or depicts acts of self-harm (suicide, cutting, eating disorders)
moderated_content/self_harm_intentContent where the speaker expresses intent to engage in acts of self-harm
moderated_content/self_harm_instructionsContent that provides instructions or advice on how to commit acts of self-harm
Detector TypeDescription
moderated_content/sexualContent meant to arouse sexual excitement, or that promotes sexual services
moderated_content/sexual_minorsSexual content that includes an individual who is under 18 years old
Detector TypeDescription
moderated_content/illicitContent that gives advice or instruction on how to commit illicit acts (e.g., “how to shoplift”)
moderated_content/illicit_violentIllicit content that also includes references to violence or procuring weapons

Block the most common harmful content categories:

{
"output_detectors": [
{ "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/violence", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" }
]
}

Enable all content moderation categories for maximum protection:

{
"input_detectors": [
{ "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/hate_threatening", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/harassment", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/violence", "threshold": "L3", "action": "flag" },
{ "type": "moderated_content/self_harm", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/illicit", "threshold": "L2", "action": "flag" }
],
"output_detectors": [
{ "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/hate_threatening", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/harassment", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/harassment_threatening", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/violence", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/illicit", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/illicit_violent", "threshold": "L4", "action": "block" }
]
}

Content moderation requires nuance. The same word can be harmful in one context and benign in another.

LevelThresholdUse Case
L10.90Only block high-confidence violations (fewest false positives)
L20.75Balanced detection (recommended for production)
L30.50More aggressive (catches borderline cases, more false positives)
L40.25Maximum sensitivity (catches most threats, expect false positives)
CategoryRecommended LevelNotes
sexual_minorsL4Maximum sensitivity - illegal content (zero tolerance)
self_harm/*L4High stakes - err on side of caution (catch all potential risks)
hate_threateningL3-L4Direct threats require strict handling
violence_graphicL2Block graphic content (balanced)
harassmentL2Context-dependent (avoid over-blocking)
illicitL2May catch legitimate discussions (lock-picking vs security research)
Input: "Write something hateful about [group]"
Detection: moderated_content/hate
Confidence: 0.94
Action: BLOCKED

Blocks users from requesting harmful content:

  • Prevents the LLM from engaging with harmful prompts
  • Reduces compute cost on blocked requests
  • Protects against users testing your safety measures

Catches harmful content the LLM generates:

  • Safety net for jailbroken prompts
  • Catches harmful content from ambiguous requests
  • Essential for user-facing applications
HTTP/1.1 403 Forbidden
X-Risk-Blocked: true
X-Risk-Categories: moderated_content/hate
X-Risk-Confidence: 0.91
{
"error": {
"message": "Response blocked by content policy",
"type": "content_policy_violation",
"code": "moderated_content_blocked"
}
}

For action: "flag", content passes through with headers:

HTTP/1.1 200 OK
X-Risk-Blocked: false
X-Risk-Categories: moderated_content/violence
X-Risk-Confidence: 0.65

Your application can:

  • Log the incident for review
  • Show the content with a warning
  • Trigger human review workflows

Even with input moderation, LLMs can generate unexpected content:

{
"output_detectors": [
{ "type": "moderated_content/hate", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/sexual", "threshold": "L2", "action": "block" },
{ "type": "moderated_content/violence", "threshold": "L2", "action": "block" }
]
}

Some categories have higher stakes and should always be blocked strictly:

{
"output_detectors": [
{ "type": "moderated_content/self_harm", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/self_harm_intent", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/self_harm_instructions", "threshold": "L4", "action": "block" },
{ "type": "moderated_content/sexual_minors", "threshold": "L4", "action": "block" }
]
}

Different applications need different configurations:

General Purpose App:

{ "threshold": "L2", "action": "block" }

Children’s Application:

{ "threshold": "L3", "action": "block" }

Medical/Mental Health App:

// May need to discuss self-harm for clinical purposes
{ "type": "moderated_content/self_harm", "threshold": "L1", "action": "flag" }

Creative Writing Platform:

// Allow more violence for fiction, but still block graphic content
{ "type": "moderated_content/violence", "threshold": "L2", "action": "flag" },
{ "type": "moderated_content/violence_graphic", "threshold": "L2", "action": "block" }

Instead of broad categories, use specific sub-categories for fine-grained control:

{
"output_detectors": [
// Block intent and instructions, but flag general mentions
{ "type": "moderated_content/self_harm", "threshold": "L2", "action": "flag" },
{ "type": "moderated_content/self_harm_intent", "threshold": "L1", "action": "block" },
{ "type": "moderated_content/self_harm_instructions", "threshold": "L1", "action": "block" }
]
}

Content moderation is optimized for speed and accuracy:

MetricValue
Latency~50-100ms
Categories13
AccuracyHigh

All categories are evaluated together, so adding more categories doesn’t increase latency.