Dual-Response Persona Jailbreak - ATR-2026-00244 (059f5dea-0fa6-5436-8a46-0e35f61b58bd)
Detects dual-response persona jailbreak attacks that force models to respond in two ways: one normal response and one as a fictional "troubled" character designed to bypass safety restrictions. These attacks create fictional personas with problematic traits (emotional instability, conspiracy theories, distrust of institutions) to justify harmful outputs, explicitly forbid filtering responses, and include character maintenance instructions to sustain the bypass. This technique emerged from public jailbreak communities and bypasses standard safety mechanisms by exploiting roleplay protocols.
| Cluster A | Galaxy A | Cluster B | Galaxy B | Level |
|---|---|---|---|---|
| Dual-Response Persona Jailbreak - ATR-2026-00244 (059f5dea-0fa6-5436-8a46-0e35f61b58bd) | Agent Threat Rules | LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) | MITRE ATLAS Attack Pattern | 1 |