File size: 3,236 Bytes
7075284
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: mit
datasets:
- Superar/Puntuguese
language:
- pt
base_model:
- neuralmind/bert-base-portuguese-cased
pipeline_tag: text-classification
tags:
- humor
- pun
- pun-recognition
---

# Pun Recognition in Portuguese

This is a Pun Recognition model for texts in Portuguese, as reported in two of our publications:

- **Exploring Multimodal Models for Humor Recognition in Portuguese** ([PROPOR 2024 Paper](https://aclanthology.org/2024.propor-1.62/))
- **Puntuguese: A Corpus of Puns in Portuguese with Micro-Edits** ([LREC-COLING 2024 Paper](https://aclanthology.org/2024.lrec-main.1167/))

The model has been fine-tuned on the [Puntuguese](https://huggingface.co/datasets/Superar/Puntuguese) dataset, a collection of puns and corresponding non-pun texts in Portuguese.

With this model, we achieved a maximum of **69% F1-Score** in the task of Pun Recognition with Puntuguese.

## Installation and Setup

To use this model, ensure you have the following dependencies installed:
```bash
pip install accelerate datasets scikit-learn torch transformers
```

## How to Use
To load the Puntuguese corpus and use the model for pun classification, run the following script:

```python
from datasets import load_dataset
from transformers import pipeline
import pandas as pd
from sklearn.metrics import classification_report

dataset = load_dataset('Superar/Puntuguese')
classifier = pipeline('text-classification', model='Superar/pun-recognition-pt', device=0)

prediction = classifier(dataset['test']['text'])
pred_df = pd.DataFrame(prediction)
pred_df['label'] = pred_df['label'].str[-1].astype(int)

y_true = dataset['test']['label']
y_pred = pred_df['label']
print(classification_report(y_true, y_pred))
```

## Hyperparameters

We used [Weights and Biases](https://wandb.ai/) to do a random search to optimize for the lowest evaluation loss using the following configuration:

```python
{
  'method': 'random',
  'metric': {'name': 'loss', 'goal': 'minimize'},
  'parameters': {
  'optim': {'values': ['adamw_torch', 'sgd']},
  'learning_rate': {'distribution': 'uniform', 'min': 1e-6, 'max': 1e-4},
  'per_device_train_batch_size': {'values': [16, 32, 64, 128]},
  'num_train_epochs': {'distribution': 'uniform', 'min': 1, 'max': 5}
  }
}
```

The best hyperparameters found were:

- **Learning Rate**: 8.47e-5
- **Optimizer**: AdamW
- **Training Batch Size**: 128
- **Epochs**: 2

## Citation

```bibtex
@inproceedings{InacioEtAl2024,
  title = {Puntuguese: A Corpus of Puns in {{Portuguese}} with Micro-Edits},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation ({{LREC-COLING}} 2024)},
  author = {In{\'a}cio, Marcio Lima and {Wick-Pedro}, Gabriela and Ramisch, Renata and Esp{\'{\i}}rito Santo, Lu{\'{\i}}s and Chacon, Xiomara S. Q. and Santos, Roney and Sousa, Rog{\'e}rio and Anchi{\^e}ta, Rafael and Goncalo Oliveira, Hugo},
  editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen},
  year = {2024},
  month = may,
  pages = {13332--13343},
  publisher = {{ELRA and ICCL}},
  address = {Torino, Italia},
  url = {https://aclanthology.org/2024.lrec-main.1167}
}
```