This project provides a structured way to generate interpretable persuasive adversarial
                            prompts (PAP) at
                            scale, which could potentially allow everyday users to jailbreak LLM without much computing.
                            But as
                            mentioned, a Reddit user
                            has already employed persuasion to attack LLM before, so it is in urgent need to more
                            systematically study
                            the vulnerabilities around persuasive jailbreak to better mitigate them. Therefore, despite
                            the risks
                            involved, we believe it is crucial to share our findings in full. We followed ethical
                            guidelines
                            throughout our study. 
                            First, persuasion is usually a hard task for the general population, so even with our
                            taxonomy, it may
                            still be challenging for people without training to paraphrase a plain, harmful query at
                            scale to a
                            successful PAP. Therefore, the real-world risk of a widespread attack from millions of users
                            is relatively
                            low. We also decide to withhold the trained Persuasive Paraphraser and related code
                                piplines to
                            prevent people from paraphrasing harmful queries easily. 
                            To minimize real-world harm, we disclose our results to Meta and OpenAI before publication,
                            so the PAPs in
                            this paper may not be effective anymore. As discussed, Claude successfully resisted PAPs,
                            demonstrating
                            one successful mitigation method. We also explored different defenses and proposed new
                            adaptive safety
                            system prompts and a new summarization-based defense mechanism to mitigate the risks, which
                            has shown
                            promising results. We aim to improve these defenses in future work. 
                            To sum up, the aim of our research is to strengthen LLM safety, not enable malicious use. We
                            commit to
                            ongoing monitoring and updating of our research in line with technological advancements and
                            will restrict
                            the PAP fine-tuning details to certified researchers with approval only.