--- language: - bg metrics: - f1 - accuracy - precision - recall base_model: - rmihaylov/bert-base-bg pipeline_tag: text-classification license: apache-2.0 datasets: - sofia-uni/toxic-data-bg - wikimedia/wikipedia - oscar-corpus/oscar - petkopetkov/chitanka tags: - bert - not-for-all-audiences - medical --- Toxic language classification model of Bulgarian language, based on the [bert-base-bg](https://huggingface.co/rmihaylov/bert-base-bg) model. The model classifies between 4 classes: Toxic, MedicalTerminology, NonToxic, MinorityGroup. Classification report: | Accuracy | Precision | Recall | F1 Score | Loss Function | |----------|-----------|--------|----------|---------------| | 0.85 | 0.86 | 0.85 | 0.85 | 0.43 | More information [in the paper](https://www.researchgate.net/publication/388842558_Detecting_Toxic_Language_Ontology_and_BERT-based_Approaches_for_Bulgarian_Text). # Code and usage For training files and information how to use the model, refer to the [GitHub repository of the project](https://github.com/TsvetoslavVasev/toxic-language-classification). # Reference If you use this model in your academic project, please cite as: ```bibtex @article {berbatova2025detecting, doi={10.13140/RG.2.2.34963.18723} title={Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text}, author={Berbatova, Melania and Vasev, Tsvetoslav}, year={2025} } ```