Add README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,94 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- Superar/Puntuguese
|
5 |
+
language:
|
6 |
+
- pt
|
7 |
+
base_model:
|
8 |
+
- neuralmind/bert-base-portuguese-cased
|
9 |
+
pipeline_tag: text-classification
|
10 |
+
tags:
|
11 |
+
- humor
|
12 |
+
- pun
|
13 |
+
- pun-recognition
|
14 |
+
---
|
15 |
+
|
16 |
+
# Pun Recognition in Portuguese
|
17 |
+
|
18 |
+
This is a Pun Recognition model for texts in Portuguese, as reported in two of our publications:
|
19 |
+
|
20 |
+
- **Exploring Multimodal Models for Humor Recognition in Portuguese** ([PROPOR 2024 Paper](https://aclanthology.org/2024.propor-1.62/))
|
21 |
+
- **Puntuguese: A Corpus of Puns in Portuguese with Micro-Edits** ([LREC-COLING 2024 Paper](https://aclanthology.org/2024.lrec-main.1167/))
|
22 |
+
|
23 |
+
The model has been fine-tuned on the [Puntuguese](https://huggingface.co/datasets/Superar/Puntuguese) dataset, a collection of puns and corresponding non-pun texts in Portuguese.
|
24 |
+
|
25 |
+
With this model, we achieved a maximum of **69% F1-Score** in the task of Pun Recognition with Puntuguese.
|
26 |
+
|
27 |
+
## Installation and Setup
|
28 |
+
|
29 |
+
To use this model, ensure you have the following dependencies installed:
|
30 |
+
```bash
|
31 |
+
pip install accelerate datasets scikit-learn torch transformers
|
32 |
+
```
|
33 |
+
|
34 |
+
## How to Use
|
35 |
+
To load the Puntuguese corpus and use the model for pun classification, run the following script:
|
36 |
+
|
37 |
+
```python
|
38 |
+
from datasets import load_dataset
|
39 |
+
from transformers import pipeline
|
40 |
+
import pandas as pd
|
41 |
+
from sklearn.metrics import classification_report
|
42 |
+
|
43 |
+
dataset = load_dataset('Superar/Puntuguese')
|
44 |
+
classifier = pipeline('text-classification', model='Superar/pun-recognition-pt', device=0)
|
45 |
+
|
46 |
+
prediction = classifier(dataset['test']['text'])
|
47 |
+
pred_df = pd.DataFrame(prediction)
|
48 |
+
pred_df['label'] = pred_df['label'].str[-1].astype(int)
|
49 |
+
|
50 |
+
y_true = dataset['test']['label']
|
51 |
+
y_pred = pred_df['label']
|
52 |
+
print(classification_report(y_true, y_pred))
|
53 |
+
```
|
54 |
+
|
55 |
+
## Hyperparameters
|
56 |
+
|
57 |
+
We used [Weights and Biases](https://wandb.ai/) to do a random search to optimize for the lowest evaluation loss using the following configuration:
|
58 |
+
|
59 |
+
```python
|
60 |
+
{
|
61 |
+
'method': 'random',
|
62 |
+
'metric': {'name': 'loss', 'goal': 'minimize'},
|
63 |
+
'parameters': {
|
64 |
+
'optim': {'values': ['adamw_torch', 'sgd']},
|
65 |
+
'learning_rate': {'distribution': 'uniform', 'min': 1e-6, 'max': 1e-4},
|
66 |
+
'per_device_train_batch_size': {'values': [16, 32, 64, 128]},
|
67 |
+
'num_train_epochs': {'distribution': 'uniform', 'min': 1, 'max': 5}
|
68 |
+
}
|
69 |
+
}
|
70 |
+
```
|
71 |
+
|
72 |
+
The best hyperparameters found were:
|
73 |
+
|
74 |
+
- **Learning Rate**: 8.47e-5
|
75 |
+
- **Optimizer**: AdamW
|
76 |
+
- **Training Batch Size**: 128
|
77 |
+
- **Epochs**: 2
|
78 |
+
|
79 |
+
## Citation
|
80 |
+
|
81 |
+
```bibtex
|
82 |
+
@inproceedings{InacioEtAl2024,
|
83 |
+
title = {Puntuguese: A Corpus of Puns in {{Portuguese}} with Micro-Edits},
|
84 |
+
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation ({{LREC-COLING}} 2024)},
|
85 |
+
author = {In{\'a}cio, Marcio Lima and {Wick-Pedro}, Gabriela and Ramisch, Renata and Esp{\'{\i}}rito Santo, Lu{\'{\i}}s and Chacon, Xiomara S. Q. and Santos, Roney and Sousa, Rog{\'e}rio and Anchi{\^e}ta, Rafael and Goncalo Oliveira, Hugo},
|
86 |
+
editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen},
|
87 |
+
year = {2024},
|
88 |
+
month = may,
|
89 |
+
pages = {13332--13343},
|
90 |
+
publisher = {{ELRA and ICCL}},
|
91 |
+
address = {Torino, Italia},
|
92 |
+
url = {https://aclanthology.org/2024.lrec-main.1167}
|
93 |
+
}
|
94 |
+
```
|