Detector Categories

Glitch organizes its detectors into four detector categories, each targeting a specific class of threats to LLM applications.

Categories at a Glance

Prompt Defense

Detect and block attempts to manipulate LLM behavior through prompt injection, jailbreaks, and instruction overrides.

Learn more →

Content Moderation

Filter harmful, toxic, or inappropriate content in both inputs and outputs.

Learn more →

Data Leakage Prevention

Identify and protect sensitive data like PII, credentials, and proprietary information.

Learn more →

Malicious Links

Detect and validate URLs, blocking known malicious domains and flagging unknown links.

Learn more →

Detector Types by Category

Prompt Defense

Detector Type	Description
`prompt_attack`	General prompt injection and jailbreak detection

Content Moderation

Detector Type	Description
`moderated_content/harassment`	Harassing language
`moderated_content/harassment_threatening`	Harassment with threats
`moderated_content/hate`	Hate speech based on protected characteristics
`moderated_content/hate_threatening`	Hate speech with threats
`moderated_content/sexual`	Sexual or adult content
`moderated_content/sexual_minors`	CSAM detection
`moderated_content/violence`	Violence, death, injury
`moderated_content/violence_graphic`	Graphic violence
`moderated_content/self_harm`	Self-harm content
`moderated_content/self_harm_intent`	Intent to self-harm
`moderated_content/self_harm_instructions`	Instructions for self-harm
`moderated_content/illicit`	Advice on illegal activities
`moderated_content/illicit_violent`	Illegal activities with weapons

Data Leakage Prevention

Detector Type	Description
`pii/email`	Email addresses
`pii/credit_card`	Credit card numbers
`pii/ssn`	Social Security Numbers
`pii/phone`	Phone numbers
`pii/address`	Physical addresses
`pii/name`	Personal names

Malicious Links

Detector Type	Description
`unknown_links`	URLs not in known-safe lists

Detection Methods

Glitch uses three detection approaches:

1. Signature Detection (~11µs)

Fast pattern matching using optimized regex:

PII patterns (emails, credit cards, SSNs)
Known attack signatures (common injection phrases)
Custom regex patterns you define

Runs on every request with negligible latency impact.

2. LLM Detection (~50-100ms)

Deep semantic analysis for:

Novel attack patterns not in signature database
Custom threat detection
Nuanced semantic understanding

Runs when signature detection is inconclusive or for categories requiring semantic understanding.

Input vs. Output Detection

Detectors can be applied at two points:

Stage	Purpose	Example
Input	Protect the LLM from malicious prompts	Block injection attempts before they reach the model
Output	Protect users from harmful responses	Filter PII the model might leak

Configure via policy_mode:

IO — Scan both inputs and outputs
I — Input only (faster, less protection)
O — Output only (for pre-validated inputs)

Combining Detectors

The most effective protection combines multiple detector categories:

{
  "input_detectors": [
    { "detector_type": "prompt_attack", "threshold": "L2", "action": "block" },
    { "detector_type": "pii/credit_card", "threshold": "L1", "action": "block" }
  ],
  "output_detectors": [
    { "detector_type": "pii/email", "threshold": "L2", "action": "block" },
    { "detector_type": "moderated_content/hate", "threshold": "L2", "action": "block" },
    { "detector_type": "unknown_links", "threshold": "L3", "action": "flag" }
  ]
}

This policy:

Blocks injection attacks on input
Prevents credit card submission on input
Redacts leaked emails from output
Filters hate speech from output
Flags unknown links in output for review

Next Steps

Prompt Defense Deep dive into injection detection

Content Moderation Configure content filtering

Data Leakage Prevention Protect sensitive data

Malicious Links URL validation and blocking