--- library_name: peft base_model: meta-llama/Llama-2-7b-hf language: - en pipeline_tag: text-generation tags: - hate-speech - explanation-generation --- # Model Card for gllama-alarm-implicit-hate **GLlama Alarm** is a suite of knowledge-Guided versions of Llama 2 instruction fine-tuned for non-binary abusive language detection and explanation generation tasks. ## Model Details This version has been instruction fine-tuned on Implicit Hate Corpus for multi-class expressiveness detection and explanation generation (i.e., implicit hate speech, explicit hate speech, not hate) as well as on encyclopedic, commonsense and temporal linguistic knowledge. ### Model Description - **Developed by:** Chiara Di Bonaventura, Lucia Siciliani, Pierpaolo Basile - **Funded by:** The Alan Turing Institute, Fondazione FAIR - **Language:** English - **Finetuned from model:** meta-llama/Llama-2-7b-hf ### Model Sources - **Paper:** https://kclpure.kcl.ac.uk/ws/portalfiles/portal/316198577/2025_COLING_from_detection_to_explanation.pdf ## Uses **GLlama Alarm** is intended for research use in English, especially for NLP tasks in the domain of social media, which might contain offensive content. Our suite can be used to **detect different levels of offensiveness and expressiveness of abusive language** (e.g. offensive comments, implicit hate speech, which has proven to be hard for many LLMs) and to **generate structured textual explanations** entailing why the text contains abusive language. In any case, language models, including ours, can potentially be used for language generation in a harmful way. GLlama Alarm should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application. ## Training Details **GLlama Alarm** builds on top of the foundational model Llama 2 (7B), which is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 was trained on a mix of publicly available online data between January 2023 and July 2023. We select the base version of Llama 2, which has 7B parameters. We instruction-funed Llama 2 on the following datasets: HateXplain and Implicit Hate Corpus, separately. This version is the one instruction fine-tuned on Implicit Hate Corpus. These datasets contain publicly available data designed for hate speech detection, thus ensuring data privacy and protection. To instruction fine-tune Llama 2, we created knowledge-guided prompts following our paradigm. The template is shown in Table 9 of the paper. We instruction fine-tuned Llama 2 with 17k knowledge-guided prompts for HateXplain and Implicit Hate for 5 epochs, while setting the other parameters as suggested by [Taori et al., 2023](https://github.com/tatsu-lab/stanford_alpaca). ## Citation **BibTeX:** @inproceedings{dibonaventura2025gllama_alarm, title={From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research}, author={Di Bonaventura, Chiara and Siciliani, Lucia and Basile, Pierpaolo and Merono-Penuela, Albert and McGillivray, Barbara}, booktitle={Proceedings of the 2025 International Conference on Computational Linguistics (COLING 2025)}, year={2025} } **APA:** Di Bonaventura, C., Siciliani, L., Basile, P., Merono-Penuela, A., & McGillivray, B. 2025. From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research. In Proceedings of the 2025 International Conference on Computational Linguistics (COLING 2025). ## Model Card Contact chiara.di_bonaventura@kcl.ac.uk ### Framework versions - PEFT 0.10.0