WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Published in Thirty-eighth Conference on Neural Information Processing Systems (NeurIPS) 2024, 2024
Recommended citation: [code] Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri. (2024). “WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models.” Thirty-eighth Conference on Neural Information Processing Systems (NeurIPS) 2024.
Recommended citation: [code] Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri. (2024). "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models." Thirty-eighth Conference on Neural Information Processing Systems (NeurIPS) 2024.
Download Paper