File size: 11,421 Bytes
32c6b56 998f997 de095e9 998f997 0bfcf10 998f997 3676651 67a2b6c 3676651 67a2b6c 3676651 67a2b6c 3676651 67a2b6c 3676651 4bf4f99 d9ff944 4bf4f99 a397f21 4bf4f99 d9ff944 4bf4f99 d9ff944 4bf4f99 d9ff944 4bf4f99 3676651 998f997 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
---
license: cc-by-nc-sa-4.0
datasets:
- QCRI/LlamaLens-English
- QCRI/LlamaLens-Arabic
- QCRI/LlamaLens-Hindi
language:
- ar
- en
- hi
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
tags:
- Social-Media
- Hate-Speech
- Summarization
- offensive-language
- News-Genre
---
# LlamaLens: Specialized Multilingual LLM forAnalyzing News and Social Media Content
## Overview
LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 19 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi.
<p align="center">
<picture>
<img width="352" alt="capablities_tasks_datasets" src="./llamalens-avatar.png">
</picture>
</p>
## Dataset
The model was trained on the [LlamaLens dataset](https://huggingface.co/collections/QCRI/llamalens-672f7e0604a0498c6a2f0fe9).
## To Replicate the Experiments
The code to replicate the experiments is available on [GitHub](https://github.com/firojalam/LlamaLens).
## Model Inference
To utilize the LlamaLens model for inference, follow these steps:
1. **Install the Required Libraries**:
Ensure you have the necessary libraries installed. You can do this using pip:
```bash
pip install transformers torch
```
2. **Load the Model and Tokenizer:**:
Use the transformers library to load the LlamaLens model and its tokenizer:
```python
from transformers import pipeline
model_name = "QCRI/LlamaLens"
pipe = pipeline("text-generation", model=model_name)
```
3. **Prepare the Input:**:
Tokenize your input text:
```python
input_text = "Your input text here"
system_message = "Your system message text here"
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": input_text},
]
```
4. **Generate the Output:**:
Generate a response using the model:
```python
generated_text = pipe(messages, num_return_sequences=1)
print(generated_text)
```
## Results
Below, we present the performance of **LlamaLens** compared to existing SOTA (if available) and the Llama-Instruct baseline, The βΞβ (Delta) column here is
calculated as **(LLamalens β SOTA)**.
---
## Arabic
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) |
|------------------------|---------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:|
| News Summarization | xlsum | R-2 | 0.137 | 0.034 | 0.075 | -0.062 |
| News Genre | ASND | Ma-F1 | 0.770 | 0.587 | 0.938 | 0.168 |
| News Genre | SANADAkhbarona | Acc | 0.940 | 0.784 | 0.922 | -0.018 |
| News Genre | SANADAlArabiya | Acc | 0.974 | 0.893 | 0.986 | 0.012 |
| News Genre | SANADAlkhaleej | Acc | 0.986 | 0.865 | 0.967 | -0.019 |
| News Genre | UltimateDataset | Ma-F1 | 0.970 | 0.376 | 0.883 | -0.087 |
| News Credibility | NewsCredibility | Acc | 0.899 | 0.455 | 0.494 | -0.405 |
| Emotion | Emotional-Tone | W-F1 | 0.658 | 0.358 | 0.748 | 0.090 |
| Emotion | NewsHeadline | Acc | 1.000 | 0.406 | 0.551 | -0.449 |
| Sarcasm | ArSarcasm-v2 | F1_Pos | 0.584 | 0.477 | 0.307 | -0.277 |
| Sentiment | ar_reviews_100k | F1_Pos | β | 0.343 | 0.665 | β |
| Sentiment | ArSAS | Acc | 0.920 | 0.603 | 0.795 | -0.125 |
| Stance | stance | Ma-F1 | 0.767 | 0.608 | 0.936 | 0.169 |
| Stance | Mawqif-Arabic-Stance | Ma-F1 | 0.789 | 0.764 | 0.867 | 0.078 |
| Att.worthiness | CT22Attentionworthy | W-F1 | 0.412 | 0.158 | 0.544 | 0.132 |
| Checkworthiness | CT24_T1 | F1_Pos | 0.569 | 0.404 | 0.877 | 0.308 |
| Claim | CT22Claim | Acc | 0.703 | 0.581 | 0.778 | 0.075 |
| Factuality | Arafacts | Mi-F1 | 0.850 | 0.210 | 0.534 | -0.316 |
| Factuality | COVID19Factuality | W-F1 | 0.831 | 0.492 | 0.781 | -0.050 |
| Propaganda | ArPro | Mi-F1 | 0.767 | 0.597 | 0.762 | -0.005 |
| Cyberbullying | ArCyc_CB | Acc | 0.863 | 0.766 | 0.753 | -0.110 |
| Harmfulness | CT22Harmful | F1_Pos | 0.557 | 0.507 | 0.508 | -0.049 |
| Hate Speech | annotated-hatetweets-4 | W-F1 | 0.630 | 0.257 | 0.549 | -0.081 |
| Hate Speech | OSACT4SubtaskB | Mi-F1 | 0.950 | 0.819 | 0.802 | -0.148 |
| Offensive | ArCyc_OFF | Ma-F1 | 0.878 | 0.489 | 0.652 | -0.226 |
| Offensive | OSACT4SubtaskA | Ma-F1 | 0.905 | 0.782 | 0.899 | -0.006 |
---
## English
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) |
|----------------------|---------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:|
| News Summarization | xlsum | R-2 | 0.152 | 0.074 | 0.141 | -0.011 |
| News Genre | CNN_News_Articles | Acc | 0.940 | 0.644 | 0.915 | -0.025 |
| News Genre | News_Category | Ma-F1 | 0.769 | 0.970 | 0.505 | -0.264 |
| News Genre | SemEval23T3-ST1 | Mi-F1 | 0.815 | 0.687 | 0.241 | -0.574 |
| Subjectivity | CT24_T2 | Ma-F1 | 0.744 | 0.535 | 0.508 | -0.236 |
| Emotion | emotion | Ma-F1 | 0.790 | 0.353 | 0.878 | 0.088 |
| Sarcasm | News-Headlines | Acc | 0.897 | 0.668 | 0.956 | 0.059 |
| Sentiment | NewsMTSC | Ma-F1 | 0.817 | 0.628 | 0.627 | -0.190 |
| Checkworthiness | CT24_T1 | F1_Pos | 0.753 | 0.404 | 0.877 | 0.124 |
| Claim | claim-detection | Mi-F1 | β | 0.545 | 0.915 | β |
| Factuality | News_dataset | Acc | 0.920 | 0.654 | 0.946 | 0.026 |
| Factuality | Politifact | W-F1 | 0.490 | 0.121 | 0.290 | -0.200 |
| Propaganda | QProp | Ma-F1 | 0.667 | 0.759 | 0.851 | 0.184 |
| Cyberbullying | Cyberbullying | Acc | 0.907 | 0.175 | 0.847 | -0.060 |
| Offensive | Offensive_Hateful | Mi-F1 | β | 0.692 | 0.805 | β |
| Offensive | offensive_language | Mi-F1 | 0.994 | 0.646 | 0.884 | -0.110 |
| Offensive & Hate | hate-offensive-speech | Acc | 0.945 | 0.602 | 0.924 | -0.021 |
---
## Hindi
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) |
|------------------------|------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:|
| NLI | NLI_dataset | W-F1 | 0.646 | 0.633 | 0.655 | 0.009 |
| News Summarization | xlsum | R-2 | 0.136 | 0.078 | 0.117 | -0.019 |
| Sentiment | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.669 | -0.028 |
| Factuality | fake-news | Mi-F1 | β | 0.759 | 0.713 | β |
| Hate Speech | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.994 | 0.355 |
| Hate Speech | Hindi-Hostility | W-F1 | 0.841 | 0.469 | 0.720 | -0.121 |
| Offensive | Offensive Speech | Mi-F1 | 0.723 | 0.621 | 0.847 | 0.124 |
| Cyberbullying | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.587 | -0.022 |
## Paper
For an in-depth understanding, refer to our paper: [**LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content**](https://arxiv.org/pdf/2410.15308).
# License
This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
# Citation
Please cite [our paper](https://arxiv.org/pdf/2410.15308) when using this model:
```
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
|