File size: 2,819 Bytes
a64daa8
 
1c4c4f0
 
 
 
 
 
 
 
 
 
 
d155313
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: mit
datasets:
- astromis/presuicidal_signals
language:
- ru
metrics:
- f1
library_name: transformers
pipeline_tag: text-classification
tags:
- russian
- suicide
---

# Presuicidal RuBERT base

The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior.

The model has two categories:
* category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others.
* category 0 - normal texts that don't contain abovementioned information.

# How to use

```python
import torch

tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert")
model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert")
model.eval()

text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"]

tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")

with torch.no_grad():
prediction = model(**tokenized_text).logits
print(prediction.argmax(dim=1).numpy())
# >>> [1, 0]
```

# Training procedure

## Data preprocessing

Before training, the text was transformed in the next way:
* removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`;
* the punctuation was removed;
* text was lowered;
* all enters was swapped to spaces;
* all several spaces were collapsed.

As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume.

## Training

The training was done with `Trainier` class that have next parameters:
```
TrainingArguments(evaluation_strategy="epoch",
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=1e-5,
num_train_epochs=5,
weight_decay=1e-3,
load_best_model_at_end=True,
save_strategy="epoch")
```

# Metrics

| F1-micro | F1-macro | F1-weighted |
|----------|----------|-------------|
| 0.811926 | 0.726722 | 0.831000 |

# Citation

```bibxtex
@article {Buyanov2022TheDF,
title={The dataset for presuicidal signals detection in text and its analysis},
author={Igor Buyanov and Ilya Sochenkov},
journal={Computational Linguistics and Intellectual Technologies},
year={2022},
month={June},
number={21},
pages={81--92},
url={https://api.semanticscholar.org/CorpusID:253195162},
}
```