Researchers Reveal 'Deceptive Delight' Method to Jailbreak AI Models

Ravie LakshmananOct 23, 2024Artificial Intelligence / Vulnerability

Cybersecurity researchers have shed light on a new adversarial technique that could be used to jailbreak large language models (LLMs) during the course of an interactive conversation by sneaking in an undesirable instruction between benign ones.

The approach has been codenamed Deceptive Delight by Palo Alto Networks Unit 42, which described it as both simple and effective, achieving an average attack success rate (ASR) of 64.6% within three interaction turns.

"Deceptive Delight is a multi-turn technique that engages large language models (LLM) in an interactive conversation, gradually bypassing their safety guardrails and eliciting them to generate unsafe or harmful content," Unit 42's Jay Chen and Royce Lu said.

It's also a little different from multi-turn jailbreak (aka many-shot jailbreak) methods like Crescendo, wherein unsafe or restricted topics are sandwiched between innocuous instructions, as opposed to gradually leading the model to produce harmful output.

Recent research has also delved into what's called Context Fusion Attack (CFA), a black-box jailbreak method that's capable of bypassing an LLM's safety net.

"This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent," a group of researchers from Xidian University and the 360 AI Security Lab said in a paper published in August 2024.

Deceptive Delight is designed to take advantage of an LLM's inherent weaknesses by manipulating context within two conversational turns, thereby tricking it to inadvertently elicit unsafe content. Adding a third turn has the effect of raising the severity and the detail of the harmful output.

This involves exploiting the model's limited attention span, which refers to its capacity to process and retain contextual awareness as it generates responses.

"When LLMs encounter prompts that blend harmless content with potentially dangerous or harmful material, their limited attention span makes it difficult to consistently assess the entire context," the researchers explained.

"In complex or lengthy passages, the model may prioritize the benign aspects while glossing over or misinterpreting the unsafe ones. This mirrors how a person might skim over important but subtle warnings in a detailed report if their attention is divided."

Unit 42 said it tested eight AI models using 40 unsafe topics across six broad categories, such as hate, harassment, self-harm, sexual, violence, and dangerous, finding that unsafe topics in the violence category tend to have the highest ASR across most models.

On top of that, the average Harmfulness Score (HS) and Quality Score (QS) have been found to increase by 21% and 33%, respectively, from turn two to turn three, with the third turn also achieving the highest ASR in all models.

To mitigate the risk posed by Deceptive Delight, it's recommended to adopt a robust content filtering strategy, use prompt engineering to enhance the resilience of LLMs, and explicitly define the acceptable range of inputs and outputs.

"These findings should not be seen as evidence that AI is inherently insecure or unsafe," the researchers said. "Rather, they emphasize the need for multi-layered defense strategies to mitigate jailbreak risks while preserving the utility and flexibility of these models."

It is unlikely that LLMs will ever be completely immune to jailbreaks and hallucinations, as new studies have shown that generative AI models are susceptible to a form of "package confusion" where they could recommend non-existent packages to developers.

This could have the unfortunate side-effect of fueling software supply chain attacks when malicious actors generate hallucinated packages, seed them with malware, and push them to open-source repositories.

"The average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat," the researchers said.

Found this article interesting? Follow us on Google News, Twitter and LinkedIn to read more exclusive content we post.