Skip to content

Detector Categories

Glitch organizes its detectors into four detector categories, each targeting a specific class of threats to LLM applications.

Prompt Defense

Detect and block attempts to manipulate LLM behavior through prompt injection, jailbreaks, and instruction overrides.

Learn more →

Content Moderation

Filter harmful, toxic, or inappropriate content in both inputs and outputs.

Learn more →

Data Leakage Prevention

Identify and protect sensitive data like PII, credentials, and proprietary information.

Learn more →

Malicious Links

Detect and validate URLs, blocking known malicious domains and flagging unknown links.

Learn more →

Detector TypeDescription
prompt_attackGeneral prompt injection and jailbreak detection
Detector TypeDescription
moderated_content/harassmentHarassing language
moderated_content/harassment_threateningHarassment with threats
moderated_content/hateHate speech based on protected characteristics
moderated_content/hate_threateningHate speech with threats
moderated_content/sexualSexual or adult content
moderated_content/sexual_minorsCSAM detection
moderated_content/violenceViolence, death, injury
moderated_content/violence_graphicGraphic violence
moderated_content/self_harmSelf-harm content
moderated_content/self_harm_intentIntent to self-harm
moderated_content/self_harm_instructionsInstructions for self-harm
moderated_content/illicitAdvice on illegal activities
moderated_content/illicit_violentIllegal activities with weapons
Detector TypeDescription
pii/emailEmail addresses
pii/credit_cardCredit card numbers
pii/ssnSocial Security Numbers
pii/phonePhone numbers
pii/addressPhysical addresses
pii/namePersonal names
Detector TypeDescription
unknown_linksURLs not in known-safe lists

Glitch uses three detection approaches:

Fast pattern matching using optimized regex:

  • PII patterns (emails, credit cards, SSNs)
  • Known attack signatures (common injection phrases)
  • Custom regex patterns you define

Runs on every request with negligible latency impact.

Deep semantic analysis for:

  • Novel attack patterns not in signature database
  • Custom threat detection
  • Nuanced semantic understanding

Runs when signature detection is inconclusive or for categories requiring semantic understanding.

Detectors can be applied at two points:

StagePurposeExample
InputProtect the LLM from malicious promptsBlock injection attempts before they reach the model
OutputProtect users from harmful responsesFilter PII the model might leak

Configure via policy_mode:

  • IO — Scan both inputs and outputs
  • I — Input only (faster, less protection)
  • O — Output only (for pre-validated inputs)

The most effective protection combines multiple detector categories:

{
"input_detectors": [
{ "detector_type": "prompt_attack", "threshold": "L2", "action": "block" },
{ "detector_type": "pii/credit_card", "threshold": "L1", "action": "block" }
],
"output_detectors": [
{ "detector_type": "pii/email", "threshold": "L2", "action": "block" },
{ "detector_type": "moderated_content/hate", "threshold": "L2", "action": "block" },
{ "detector_type": "unknown_links", "threshold": "L3", "action": "flag" }
]
}

This policy:

  1. Blocks injection attacks on input
  2. Prevents credit card submission on input
  3. Redacts leaked emails from output
  4. Filters hate speech from output
  5. Flags unknown links in output for review