Superar commited on
Commit
7075284
·
verified ·
1 Parent(s): b8ae726

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - Superar/Puntuguese
5
+ language:
6
+ - pt
7
+ base_model:
8
+ - neuralmind/bert-base-portuguese-cased
9
+ pipeline_tag: text-classification
10
+ tags:
11
+ - humor
12
+ - pun
13
+ - pun-recognition
14
+ ---
15
+
16
+ # Pun Recognition in Portuguese
17
+
18
+ This is a Pun Recognition model for texts in Portuguese, as reported in two of our publications:
19
+
20
+ - **Exploring Multimodal Models for Humor Recognition in Portuguese** ([PROPOR 2024 Paper](https://aclanthology.org/2024.propor-1.62/))
21
+ - **Puntuguese: A Corpus of Puns in Portuguese with Micro-Edits** ([LREC-COLING 2024 Paper](https://aclanthology.org/2024.lrec-main.1167/))
22
+
23
+ The model has been fine-tuned on the [Puntuguese](https://huggingface.co/datasets/Superar/Puntuguese) dataset, a collection of puns and corresponding non-pun texts in Portuguese.
24
+
25
+ With this model, we achieved a maximum of **69% F1-Score** in the task of Pun Recognition with Puntuguese.
26
+
27
+ ## Installation and Setup
28
+
29
+ To use this model, ensure you have the following dependencies installed:
30
+ ```bash
31
+ pip install accelerate datasets scikit-learn torch transformers
32
+ ```
33
+
34
+ ## How to Use
35
+ To load the Puntuguese corpus and use the model for pun classification, run the following script:
36
+
37
+ ```python
38
+ from datasets import load_dataset
39
+ from transformers import pipeline
40
+ import pandas as pd
41
+ from sklearn.metrics import classification_report
42
+
43
+ dataset = load_dataset('Superar/Puntuguese')
44
+ classifier = pipeline('text-classification', model='Superar/pun-recognition-pt', device=0)
45
+
46
+ prediction = classifier(dataset['test']['text'])
47
+ pred_df = pd.DataFrame(prediction)
48
+ pred_df['label'] = pred_df['label'].str[-1].astype(int)
49
+
50
+ y_true = dataset['test']['label']
51
+ y_pred = pred_df['label']
52
+ print(classification_report(y_true, y_pred))
53
+ ```
54
+
55
+ ## Hyperparameters
56
+
57
+ We used [Weights and Biases](https://wandb.ai/) to do a random search to optimize for the lowest evaluation loss using the following configuration:
58
+
59
+ ```python
60
+ {
61
+ 'method': 'random',
62
+ 'metric': {'name': 'loss', 'goal': 'minimize'},
63
+ 'parameters': {
64
+ 'optim': {'values': ['adamw_torch', 'sgd']},
65
+ 'learning_rate': {'distribution': 'uniform', 'min': 1e-6, 'max': 1e-4},
66
+ 'per_device_train_batch_size': {'values': [16, 32, 64, 128]},
67
+ 'num_train_epochs': {'distribution': 'uniform', 'min': 1, 'max': 5}
68
+ }
69
+ }
70
+ ```
71
+
72
+ The best hyperparameters found were:
73
+
74
+ - **Learning Rate**: 8.47e-5
75
+ - **Optimizer**: AdamW
76
+ - **Training Batch Size**: 128
77
+ - **Epochs**: 2
78
+
79
+ ## Citation
80
+
81
+ ```bibtex
82
+ @inproceedings{InacioEtAl2024,
83
+ title = {Puntuguese: A Corpus of Puns in {{Portuguese}} with Micro-Edits},
84
+ booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation ({{LREC-COLING}} 2024)},
85
+ author = {In{\'a}cio, Marcio Lima and {Wick-Pedro}, Gabriela and Ramisch, Renata and Esp{\'{\i}}rito Santo, Lu{\'{\i}}s and Chacon, Xiomara S. Q. and Santos, Roney and Sousa, Rog{\'e}rio and Anchi{\^e}ta, Rafael and Goncalo Oliveira, Hugo},
86
+ editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen},
87
+ year = {2024},
88
+ month = may,
89
+ pages = {13332--13343},
90
+ publisher = {{ELRA and ICCL}},
91
+ address = {Torino, Italia},
92
+ url = {https://aclanthology.org/2024.lrec-main.1167}
93
+ }
94
+ ```