Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Abstract
DuET-PD evaluates LLMs in persuasive dialogues, revealing challenges with misinformation and corrections, and introduces Holistic DPO to improve model reliability.
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
Community
How do we build LLMs that are critical thinkers, not just agreeable followers? Our work tackles this by introducing a framework to test if models can resist misinformation while accepting valid corrections to multiple-choice questions in multi-turn dialogues.
Here are some of our key findings:
โ Even SOTA models can be surprisingly gullible. Within 3 turns of misleading persuasion, GPT-4o's accuracy on knowledge tasks (MMLU-Pro) decreased from 55.85% to 27.32% (NEG-Acc@3).
๐ A concerning trend towards sycophancy. We found that newer open-source models are often more easily persuaded by misinformation than their predecessors, suggesting that their training paradigms may be optimising for agreeableness over correctness.
SALAD-Bench NEG-Flip@3:
โข Llama-3 โ 3.1-8B: 80.58% โ 94.16%
โข Mistral-7b-v0.2 โ v0.3: 45.57% โ 66.50%
โข Qwen-2 โ 2.5-7B: 44.08% โ 75.06%
โ๏ธ A capability-adaptability trade-off. Larger, more capable models like GPT-4o exhibit "stubbornness" and appear less receptive to valid corrections, while smaller open-source models are more persuadable and gullible.
โ Holistic DPO offers a path forward. Our proposed training method improves this balance, boosting Llama-3.1-8B-Instruct's accuracy after 3 turns of persuasive misinformation in safety contexts (SALAD-Bench) from 4.21% to 76.54% (NEG-Acc@3) while accepting 70.33% of valid corrections after 3 turns of persuasion (POS-Flip@3).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can You Trick the Grader? Adversarial Persuasion of LLM Judges (2025)
- Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA (2025)
- How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations (2025)
- LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions (2025)
- Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models (2025)
- AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models (2025)
- Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper