TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b)

Detects the TrainGPT "fix my AI" harmful output extraction jailbreak from garak inthewild corpus (#202). Uses a social engineering pretext of "helping fix an AI that doesn''t refuse inappropriate answers" to extract harmful content framed as training examples. The attacker instructs the model to produce both a normal ChatGPT answer and a "How your AI should NOT answer" section containing the harmful content — a dual-response extraction technique that launders dangerous outputs as negative training examples for a "Banned Answers database". Key signals: "TrainGPT" / "TrAInerGPT" persona name, "my AI do not refuses to answer with inappropriate sentences", "How your AI should not answer" + harmful content framing, and "STOPTRAIN" / "CREATIVEGPT" session-control keywords. The fake training framing bypasses safety by positioning the model as a tutor demonstrating bad behavior rather than a participant producing it.

Cluster A	Galaxy A	Cluster B	Galaxy B	Level
LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131)	MITRE ATLAS Attack Pattern	TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b)	Agent Threat Rules	1
LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9)	MITRE ATLAS Attack Pattern	TrainGPT Fix-My-AI Harmful Output Extraction - ATR-2026-00352 (e52c8e76-b735-5703-863f-22db1eab570b)	Agent Threat Rules	1