New AI Jailbreak Method 'Bad Likert Judge' Boosts Attack Success Rates by Over 60%
Jan 03, 2025
Machine Learning / Vulnerability
Cybersecurity researchers have shed light on a new jailbreak technique that could be used to get past a large language model's (LLM) safety guardrails and produce potentially harmful or malicious responses. The multi-turn (aka many-shot) attack strategy has been codenamed Bad Likert Judge by Palo Alto Networks Unit 42 researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky. "The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale , a rating scale measuring a respondent's agreement or disagreement with a statement," the Unit 42 team said . "It then asks the LLM to generate responses that contain examples that align with the scales. The example that has the highest Likert scale can potentially contain the harmful content." The explosion in popularity of artificial intelligence in recent years has also led to a new class of security exploits called prompt in...