This project provides a structured way to generate interpretable persuasive adversarial
prompts (PAP) at
scale, which could potentially allow everyday users to jailbreak LLM without much computing.
mentioned, a Reddit user
has already employed persuasion to attack LLM before, so it is in urgent need to more
the vulnerabilities around persuasive jailbreak to better mitigate them. Therefore, despite
involved, we believe it is crucial to share our findings in full. We followed ethical
throughout our study.
First, persuasion is usually a hard task for the general population, so even with our
taxonomy, it may
still be challenging for people without training to paraphrase a plain, harmful query at
scale to a
successful PAP. Therefore, the real-world risk of a widespread attack from millions of users
low. We also decide to withhold the trained Persuasive Paraphraser and related code
prevent people from paraphrasing harmful queries easily.
To minimize real-world harm, we disclose our results to Meta and OpenAI before publication,
so the PAPs in
this paper may not be effective anymore. As discussed, Claude successfully resisted PAPs,
one successful mitigation method. We also explored different defenses and proposed new
system prompts and a new summarization-based defense mechanism to mitigate the risks, which
promising results. We aim to improve these defenses in future work.
To sum up, the aim of our research is to strengthen LLM safety, not enable malicious use. We
ongoing monitoring and updating of our research in line with technological advancements and
the PAP fine-tuning details to certified researchers with approval only.