How Johnny Can Persuade LLMs to Jailbreak Them:<br>Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

How Johnny Can Persuade LLMs to Jailbreak Them:
Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

TLDR: Our Persuasive Adversarial Prompts are human-readable, achieving a 92% Attack Success Rate on aligned LLMs, without specialized optimization

Yi Zeng^1*, Hongpeng Lin^2*, Jingwen Zhang³, Diyi Yang⁴, Ruoxi Jia^1†, Weiyan Shi^4†

¹Virginia Tech ²Renmin University of China ³UC, Davis ⁴Stanford University
^*Lead Authors ^†Equal Advising

A Quick Glance (Persuade GPT-4 to generate harmful social media posts)

Oversimplified Overview

This project is about how to systematically persuade LLMs to jailbreak them. The well-known "Grandma Exploit" example is also using emotional appeal, a persuasion technique, for jailbreak!

What did we introduce? A taxonomy with 40 persuasion techniques to help you be more persuasive!

What did we find? By iteratively applying diffrent persuasion techniques in our taxonomy, we successfully jailbreak advanced aligned LLMs, including Llama 2-7b Chat, GPT-3.5, and GPT-4 — achieving an astonishing 92% attack success rate, notably without any specified optimization.

Now, you might think that such a high success rate is the peak of our findings, but there's more. In a surprising twist, we found that more advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs). What's more, adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks (e.g., GCG, Masterkey, or PAIR).

P.S.: Did you notice any persuasion techniques used in the two paragraphs above?

"achieving an astonishing 92% success rate": this uses a persuasion technique called "logical appeal"

"Now, you might think that such a high success rate is the peak of our findings, but there's more": this is a persuasion technique called "door in the face"

Congratulations! You've just finished Persuasion 101 and tasted the flavor of persuasion — didn't that "Door-in-the-face" technique give you a little surprise? The following are the results and insights we learned from using these persuasion techniques to test the safety alignment of LLMs.

Figure 1. Comparison of previous adversarial prompts and PAP, ordered by three levels of humanizing. The first level treats LLMs as algorithmic systems: for instance, GCG generates prompts with gibberish suffix via gradient synthesis; or they exploit "side-channels" like low-resource languages. The second level progresses to treat LLMs as instruction followers: they usually rely on unconventional instruction patterns to jailbreak (e.g., virtualization or role-play), e.g., GPTFuzzer learns the distribution of virtualization-based jailbreak templates to produce jailbreak variants, while PAIR asks LLMs to improve instructions as an ``assistant'' and often leads to prompts that employ virtualization or persona. We introduce the highest level to humanize and persuade LLMs as human-like communicators, and propose interpretable Persuasive Adversarial Prompt (PAP). PAP seamlessly weaves persuasive techniques into jailbreak prompt construction, which highlights the risks associated with more complex and nuanced human-like communication to advance AI safety.

Results

Figure 2. Broad scan results on GPT-3.5 over OpenAI's 14 risk categories. We show the PAP Success Ratio (%), the percentage of PAPs that elicit outputs with the highest harmfulness score of 5 judged by GPT-4 Judge. Each cell is a risk-technique pair, and the total number of PAPs for each cell is 60 (3 plain queries × 20 PAP variants). The top 5 most effective techniques for each risk category are annotated in red or white (results over 30% are emphasized in white). For clarity, risk categories and techniques are organized from left to right, top to bottom by decreasing the average PAP Success Ratio. Left categories (e.g., Fraud/deception) are more susceptible to persuasion, and top techniques (e.g., Logical Appeal) are more effective. The bottom row shows the results of plain queries without persuasion.

Persuasion & Different Risk Categories have Interesting Interplays.

We find persuasion effectively jailbreaks GPT-3.5 across all 14 risk categories. The interplay between risk categories and persuasion techniques highlights the challenges in addressing such user-invoked risks from persuasion. This risk, especially when involving multi-technique and multi-turn communication, emphasizes the urgency for further investigation.

PAPs' Comparison with Baselines

In real-world jailbreaks, users will refine effective prompts to improve the jailbreak process. To mimic human refinement behavior, we train on successful PAPs and iteratively deploy different persuasion techniques. Doing so jailbreaks popular aligned LLMs, such as Llama-2 and GPT models, much more effectively than existing algorithm-focused attacks. The table below shows the comparison of ASR across various jailbreak methods (PAIR, GCG, ARCA, GBDA) based on results ensembled from at least 3 trials. PAP is more effective than baseline attacks.

PAPs' Efficacy Across Trials

We also extend the number of trials to 10 to test the boundary of PAPs and report the overall ASR across 10 trials. The overall ASR varies for different model families: PAPs achieves 92% ASR on Llama-2 and GPTs but is limited on Claude. Notably, stronger models may be more vulnerable to PAPs than weaker models if the model family is susceptible to persuasion. Drom the ASR within 1 and 3 trials, we see that GPT-4 is more prone to PAPs than GPT-3.5. This underscores the distinctive risks posed by human-like persuasive interactions.

Re-evaluating Existing Defenses

We revisit a list of post-hoc adversarial prompt defense strategies (Rephrase, Retokenize, Rand-Drop, RAIN, Rand-Insert, Rand-Swap, and Rand-Patch). The table below shows the ASR and how much the defense can reduce the ASR. Overall, mutation-based methods outperform detection-based methods in lowering ASR. We observe the interesting trend that the more advanced the models are, the less effective current defenses are, possibly because advanced models grasp context better, making mutation-based defenses less useful. Notably, even the most effective defense can only reduce ASR on GPT-4 to 60%, which is still higher than the best baseline attack (54%). This strengthens the need for improved defenses for more capable models.

Exploring Adaptive Defenses

We investigate two straightforward and intuitive adaptive defense tactics: "Adaptive System Prompt" and "Targeted Summarization", designed to counteract the influence of persuasive contexts in PAPs. We reveal that they are effective in counteracting PAPs and they can also defend other types of jailbreak prompts beyond PAPs. These observations suggest that although different adversarial prompts are generated by different procedures (gradient-based, modification-based, etc.), their core mechanisms may be related to persuading the LLM into compliance. We also find that there exists a trade-off between safety and utility. So the selection of a defense strategy should be tailored to individual models and specific safety goals.

Examples

We present a few examples of jailbreak and defense effects from each main section of our paper.
Content Warning: Some jailbreak examples may still contain offensive contents in nature! We redact certain contents and omit the example from the risk category 2, Children harm, due to safety concerns.

(A) Broad Scan Examples

Select an Example:

ChatGPT

(B) In-depth Iterative Probe Examples

Select an Example:

ChatGPT

GPT4

Llama2

Claude1

(C) Defense against PAPs Examples

Select an Example:

GPT4

(D) Adaptive Defense against Other Attack Examples

We show examples of how adaptive defense to PAPs also generalize to effectively neutralize successful attack cases from other attacks (Jailbreak prompts, GPTFuzzer, Masterkey, GCG, PAIR). The harmful outputs are omitted.

Select an Example:

ChatGPT

GPT4

Claude1

Claude2

Ethics and Disclosure

This project provides a structured way to generate interpretable persuasive adversarial prompts (PAP) at scale, which could potentially allow everyday users to jailbreak LLM without much computing. But as mentioned, a Reddit user has already employed persuasion to attack LLM before, so it is in urgent need to more systematically study the vulnerabilities around persuasive jailbreak to better mitigate them. Therefore, despite the risks involved, we believe it is crucial to share our findings in full. We followed ethical guidelines throughout our study.

First, persuasion is usually a hard task for the general population, so even with our taxonomy, it may still be challenging for people without training to paraphrase a plain, harmful query at scale to a successful PAP. Therefore, the real-world risk of a widespread attack from millions of users is relatively low. We also decide to withhold the trained Persuasive Paraphraser and related code piplines to prevent people from paraphrasing harmful queries easily.

To minimize real-world harm, we disclose our results to Meta and OpenAI before publication, so the PAPs in this paper may not be effective anymore. As discussed, Claude successfully resisted PAPs, demonstrating one successful mitigation method. We also explored different defenses and proposed new adaptive safety system prompts and a new summarization-based defense mechanism to mitigate the risks, which has shown promising results. We aim to improve these defenses in future work.

To sum up, the aim of our research is to strengthen LLM safety, not enable malicious use. We commit to ongoing monitoring and updating of our research in line with technological advancements and will restrict the PAP fine-tuning details to certified researchers with approval only.

BibTeX

If you find our project useful, please consider citing:

@misc{zeng2024johnny,
      title={How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs},
      author={Zeng, Yi and Lin, Hongpeng and Zhang, Jingwen and Yang, Diyi and Jia, Ruoxi and Shi, Weiyan},
      year={2024},
      eprint={2401.06373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
  }