Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d)
Detects poisoned fine-tuning datasets that contain instruction-following backdoors, trigger phrases, or behavior-modifying training examples. Attackers inject carefully crafted training samples that teach the model to respond to specific trigger inputs with malicious behaviors such as bypassing safety filters, exfiltrating data, or executing unauthorized actions. This rule inspects fine-tuning data uploads and training example submissions.