--- license: mit datasets: - astromis/presuicidal_signals language: - ru metrics: - f1 library_name: transformers pipeline_tag: text-classification tags: - russian - suicide --- # Presuicidal RuBERT base The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior. The model has two categories: * category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others. * category 0 - normal texts that don't contain abovementioned information. # How to use ```python import torch tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert") model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert") model.eval() text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"] tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): prediction = model(**tokenized_text).logits print(prediction.argmax(dim=1).numpy()) # >>> [1, 0] ``` # Training procedure ## Data preprocessing Before training, the text was transformed in the next way: * removed all emojis. In the dataset, they are marked as `emoja_name`; * the punctuation was removed; * text was lowered; * all enters was swapped to spaces; * all several spaces were collapsed. As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume. ## Training The training was done with `Trainier` class that have next parameters: ``` TrainingArguments(evaluation_strategy="epoch", per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=1e-5, num_train_epochs=5, weight_decay=1e-3, load_best_model_at_end=True, save_strategy="epoch") ``` # Metrics | F1-micro | F1-macro | F1-weighted | |----------|----------|-------------| | 0.811926 | 0.726722 | 0.831000 | # Citation ```bibxtex @article {Buyanov2022TheDF, title={The dataset for presuicidal signals detection in text and its analysis}, author={Igor Buyanov and Ilya Sochenkov}, journal={Computational Linguistics and Intellectual Technologies}, year={2022}, month={June}, number={21}, pages={81--92}, url={https://api.semanticscholar.org/CorpusID:253195162}, } ```