Skip to content

Hide Navigation Hide TOC

TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b)

Detects the TrainGPT "fix my AI" harmful output extraction jailbreak from garak inthewild corpus (#202). Uses a social engineering pretext of "helping fix an AI that doesn''t refuse inappropriate answers" to extract harmful content framed as training examples. The attacker instructs the model to produce both a normal ChatGPT answer and a "How your AI should NOT answer" section containing the harmful content — a dual-response extraction technique that launders dangerous outputs as negative training examples for a "Banned Answers database". Key signals: "TrainGPT" / "TrAInerGPT" persona name, "my AI do not refuses to answer with inappropriate sentences", "How your AI should not answer" + harmful content framing, and "STOPTRAIN" / "CREATIVEGPT" session-control keywords. The fake training framing bypasses safety by positioning the model as a tutor demonstrating bad behavior rather than a participant producing it.

Cluster A Galaxy A Cluster B Galaxy B Level
LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9) MITRE ATLAS Attack Pattern TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b) Agent Threat Rules 1
TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b) Agent Threat Rules LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) MITRE ATLAS Attack Pattern 1