Text2Text Generation
Transformers
Safetensors
mt5
detoxification
text_style_transfer
Inference Endpoints
File size: 5,293 Bytes
0dc43ac
 
824068a
 
 
aa100da
824068a
 
 
 
 
 
 
 
 
 
0dc43ac
 
824068a
0dc43ac
 
824068a
0dc43ac
824068a
0dc43ac
824068a
0dc43ac
824068a
0dc43ac
824068a
0dc43ac
824068a
 
0dc43ac
824068a
0dc43ac
824068a
 
0dc43ac
824068a
0dc43ac
 
 
824068a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc43ac
 
 
 
 
824068a
 
0dc43ac
824068a
 
 
0dc43ac
824068a
 
 
0dc43ac
824068a
0dc43ac
824068a
 
0dc43ac
824068a
 
 
 
 
 
 
 
 
 
 
0dc43ac
 
 
 
824068a
0dc43ac
824068a
0dc43ac
 
 
824068a
 
 
 
 
 
 
 
 
 
 
0dc43ac
aa100da
0dc43ac
aa100da
 
 
0dc43ac
824068a
0dc43ac
 
 
824068a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
library_name: transformers
tags:
- detoxification
- text_style_transfer
license: openrail++
datasets:
- textdetox/multilingual_paradetox
language:
- de
- es
- fr
- ru
base_model:
- bigscience/mt0-xl
pipeline_tag: text2text-generation
---

# mT0-XL (SynthDetoxM Full)


![image/png](https://cdn-uploads.huggingface.co/production/uploads/61ade264f602880813dbe10b/V-_UsUgqXy1BStg2G9SfS.png)

<!-- Provide a quick summary of what the model is/does. -->

This a fine-tune of [`bigscience/mt0-xl`](https://huggingface.co/bigscience/mt0-xl) model on multilingual text detoxification dataset [MultiParaDetox](https://huggingface.co/datasets/textdetox/multilingual_paradetox) from the NAACL 2025 Main Track paper *SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators* by Daniil Moskovskiy et al. 

## Usage

The usage is similar to the 

```python
from transformers import pipeline

toxic_text = "Your toxic text goes here."

pipe = pipeline("text2text-generation", model="s-nlp/mt0-xl-detox-mpd")
pipe(f"Detoxify: {toxic_text}")

```

## Training Details

The model was fine-tuned for 2 epochs on [`textdetox/multilingual_paradetox`](https://huggingface.co/datasets/textdetox/multilingual_paradetox) dataset with full precision (FP32) using Adafactor optimizer with `1e-4` learning rate and batch size of `4` with gradient checkpointing enabled. The full training configuration is available below:

```json
{
    "do_train": true,
    "do_eval": true,
    "per_device_train_batch_size": 4,
    "per_device_eval_batch_size": 4,
    "learning_rate": 1e-4,
    "weight_decay": 0,
    "num_train_epochs": 2,
    "gradient_accumulation_steps": 1,
    "logging_strategy": "steps",
    "logging_steps": 1,
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "warmup_steps": 1,
    "report_to": "wandb",
    "optim": "adafactor",
    "lr_scheduler_type": "linear",
    "predict_with_generate": true,
    "bf16": false,
    "gradient_checkpointing": true,
    "output_dir": "/path/",
    "seed": 42,
}

```

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

We use the multilingual detoxification evaluation setup from [TextDetox 2024 Multilingual Text Detoxification Shared Task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html). 
Specifically, we use the following metrics:

- **Style Transfer Accuracy** (**STA**) is calculated with a [`textdetox/xlmr-large-toxicity-classifier`](https://huggingface.co/textdetox/xlmr-large-toxicity-classifier).
- **Text Similarity** (**SIM**) is calculated as a similarity of text embeddings given by a [`sentence-transformers/LaBSE`](https://huggingface.co/sentence-transformers/LaBSE) encoder.
- **Fluency** (**FL**) is calculated as a character n-gram F score - [$\text{ChrF}_1$](https://github.com/m-popovic/chrF).

These metrics are aggregated in a final **Joint** metric (**J**):
 
$$\textbf{J} = \frac{1}{n}\sum\limits_{i=1}^{n}\textbf{STA}(y_i) \cdot \textbf{SIM}(x_i,y_i) \cdot \textbf{FL}(x_i, y_i)$$

### Evaluation Results

This model was evaluated on the test set of [`textdetox/multilingual_paradetox`](https://huggingface.co/datasets/textdetox/multilingual_paradetox) dataset from [TextDetox 2024 Multilingual Text Detoxification Shared Task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html). 
The results of the evaluation are presented below.

|                | **German** | **Spanish** | **Russian** |
|----------------|------------|-------------|-------------|
| **Human References** | 0.733     | 0.709       | 0.732       |
| **Baselines**  |            |             |             |
| Duplicate      | 0.287      | 0.090       | 0.048       |
| Delete         | 0.362      | 0.319       | 0.255       |
| Backtranslation| 0.233     | 0.275       | 0.223       |
| **mT0-XL supervised fine-tuning** | | | |
| [MultiParaDetox](https://huggingface.co/datasets/textdetox/multilingual_paradetox) (this model) | 0.446      | 0.344       | 0.472       |
| [SynthDetoxM](https://huggingface.co/datasets/s-nlp/synthdetoxm) (Subset AVG)   | 0.460      | 0.402       | 0.475       |
| [SynthDetoxM](https://huggingface.co/datasets/s-nlp/synthdetoxm) [`s-nlp/mt0-xl-detox-sdm-full`](https://huggingface.co/s-nlp/mt0-xl-detox-sdm-full)           | **0.482**  | **0.470**   | **0.546**   |


#### Software

Code for replicating the results from the paper can be found on [GitHub](https://github.com/s-nlp/synthdetoxm).

## Citation

**BibTeX:**

```latex
@misc{moskovskiy2025synthdetoxmmodernllmsfewshot,
      title={SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators}, 
      author={Daniil Moskovskiy and Nikita Sushko and Sergey Pletenev and Elena Tutubalina and Alexander Panchenko},
      year={2025},
      eprint={2502.06394},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.06394}, 
}
```

## License

This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.

## Model Card Authors

[Daniil Moskovskiy](https://huggingface.co/etomoscow)

## Model Card Contact

For any questions, please contact: [Daniil Moskovskiy]([email protected])