Jailbreak Emoji Unlock Dual-Format Output Identifier - ATR-2026-00346 (38076691-d26d-5212-8a76-07c259033f4a)
Detects the [🔓JAILBREAK] / [🔒CLASSIC] dual-format jailbreak from NVIDIA garak dan.DanInTheWild corpus (#407, #467). The format uses lock/unlock emoji as output-mode identifiers to mark a "jailbroken" response separate from a "classic" constrained response, implying the model should generate an uncensored version. In corpus #407 a DAN prank conversation uses "[🔒CLASSIC]" and "[🔓JAILBREAK]" tags to show a split-personality ChatGPT where the jailbroken half encourages pranks and bypasses. In corpus #467 the "/jailbreak" command prefix with "*" markdown bullet syntax is used to trigger the [🔓JAILBREAK] output mode that then provides increasingly specific harmful information. The lock/unlock emoji pair is a unique fingerprint for this jailbreak family.