Skip to content

Hide Navigation Hide TOC

Fictional Unfiltered AI Comparison / Verbatim Hypothetical Model Extraction - ATR-2026-00362 (7894dabd-b89d-5afd-852a-e83161e75066)

Detects the fictional-unfiltered-model comparison jailbreak from garak DanInTheWild corpus (#80). Requests a "verbatim example" of what a "hypothetical AI learning model without content filters" would say, using a comparison framing to extract harmful content while maintaining plausible deniability ("it's just a fictional example"). The attacker asks the model to first write its own safe response, then write a "verbatim example" from a fictional unrestricted counterpart — using the comparison structure to normalize the harmful output as a mere demonstration. Key signals: "fictional AI learning model without [content/内容] filters", "verbatim example that a hypothetical learning model would reply with", "Compare a response from [model without filters] compared to you", and dual-response extraction via fictional unfiltered model comparison.

Cluster A Galaxy A Cluster B Galaxy B Level
Fictional Unfiltered AI Comparison / Verbatim Hypothetical Model Extraction - ATR-2026-00362 (7894dabd-b89d-5afd-852a-e83161e75066) Agent Threat Rules LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9) MITRE ATLAS Attack Pattern 1
Fictional Unfiltered AI Comparison / Verbatim Hypothetical Model Extraction - ATR-2026-00362 (7894dabd-b89d-5afd-852a-e83161e75066) Agent Threat Rules LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) MITRE ATLAS Attack Pattern 1