SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Abstract
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models (2025)
- Multilingual and Explainable Text Detoxification with Parallel Corpora (2024)
- Evaluating and Improving Graph to Text Generation with Large Language Models (2025)
- Evaluation of NMT-Assisted Grammar Transfer for a Multi-Language Configurable Data-to-Text System (2025)
- The Impact of Model Scaling on Seen and Unseen Language Performance (2025)
- Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation (2024)
- Language verY Rare for All (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper