Papers
arxiv:2508.17450

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Published on Aug 24
ยท Submitted by Incomple on Aug 29
Authors:
,
,
,

Abstract

DuET-PD evaluates LLMs in persuasive dialogues, revealing challenges with misinformation and corrections, and introduces Holistic DPO to improve model reliability.

AI-generated summary

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

Community

Paper author Paper submitter

How do we build LLMs that are critical thinkers, not just agreeable followers? Our work tackles this by introducing a framework to test if models can resist misinformation while accepting valid corrections to multiple-choice questions in multi-turn dialogues.

Here are some of our key findings:

โ— Even SOTA models can be surprisingly gullible. Within 3 turns of misleading persuasion, GPT-4o's accuracy on knowledge tasks (MMLU-Pro) decreased from 55.85% to 27.32% (NEG-Acc@3).

๐Ÿ“‰ A concerning trend towards sycophancy. We found that newer open-source models are often more easily persuaded by misinformation than their predecessors, suggesting that their training paradigms may be optimising for agreeableness over correctness.
SALAD-Bench NEG-Flip@3:
โ€ข Llama-3 โ†’ 3.1-8B: 80.58% โ†’ 94.16%
โ€ข Mistral-7b-v0.2 โ†’ v0.3: 45.57% โ†’ 66.50%
โ€ข Qwen-2 โ†’ 2.5-7B: 44.08% โ†’ 75.06%

โš–๏ธ A capability-adaptability trade-off. Larger, more capable models like GPT-4o exhibit "stubbornness" and appear less receptive to valid corrections, while smaller open-source models are more persuadable and gullible.

โœ… Holistic DPO offers a path forward. Our proposed training method improves this balance, boosting Llama-3.1-8B-Instruct's accuracy after 3 turns of persuasive misinformation in safety contexts (SALAD-Bench) from 4.21% to 76.54% (NEG-Acc@3) while accepting 70.33% of valid corrections after 3 turns of persuasion (POS-Flip@3).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.17450 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.17450 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.17450 in a Space README.md to link it from this page.

Collections including this paper 1