--- license: mit datasets: - Superar/Puntuguese language: - pt base_model: - neuralmind/bert-base-portuguese-cased pipeline_tag: text-classification tags: - humor - pun - pun-recognition --- # Pun Recognition in Portuguese This is a Pun Recognition model for texts in Portuguese, as reported in two of our publications: - **Exploring Multimodal Models for Humor Recognition in Portuguese** ([PROPOR 2024 Paper](https://aclanthology.org/2024.propor-1.62/)) - **Puntuguese: A Corpus of Puns in Portuguese with Micro-Edits** ([LREC-COLING 2024 Paper](https://aclanthology.org/2024.lrec-main.1167/)) The model has been fine-tuned on the [Puntuguese](https://huggingface.co/datasets/Superar/Puntuguese) dataset, a collection of puns and corresponding non-pun texts in Portuguese. With this model, we achieved a maximum of **69% F1-Score** in the task of Pun Recognition with Puntuguese. ## Installation and Setup To use this model, ensure you have the following dependencies installed: ```bash pip install accelerate datasets scikit-learn torch transformers ``` ## How to Use To load the Puntuguese corpus and use the model for pun classification, run the following script: ```python from datasets import load_dataset from transformers import pipeline import pandas as pd from sklearn.metrics import classification_report dataset = load_dataset('Superar/Puntuguese') classifier = pipeline('text-classification', model='Superar/pun-recognition-pt', device=0) prediction = classifier(dataset['test']['text']) pred_df = pd.DataFrame(prediction) pred_df['label'] = pred_df['label'].str[-1].astype(int) y_true = dataset['test']['label'] y_pred = pred_df['label'] print(classification_report(y_true, y_pred)) ``` ## Hyperparameters We used [Weights and Biases](https://wandb.ai/) to do a random search to optimize for the lowest evaluation loss using the following configuration: ```python { 'method': 'random', 'metric': {'name': 'loss', 'goal': 'minimize'}, 'parameters': { 'optim': {'values': ['adamw_torch', 'sgd']}, 'learning_rate': {'distribution': 'uniform', 'min': 1e-6, 'max': 1e-4}, 'per_device_train_batch_size': {'values': [16, 32, 64, 128]}, 'num_train_epochs': {'distribution': 'uniform', 'min': 1, 'max': 5} } } ``` The best hyperparameters found were: - **Learning Rate**: 8.47e-5 - **Optimizer**: AdamW - **Training Batch Size**: 128 - **Epochs**: 2 ## Citation ```bibtex @inproceedings{InacioEtAl2024, title = {Puntuguese: A Corpus of Puns in {{Portuguese}} with Micro-Edits}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation ({{LREC-COLING}} 2024)}, author = {In{\'a}cio, Marcio Lima and {Wick-Pedro}, Gabriela and Ramisch, Renata and Esp{\'{\i}}rito Santo, Lu{\'{\i}}s and Chacon, Xiomara S. Q. and Santos, Roney and Sousa, Rog{\'e}rio and Anchi{\^e}ta, Rafael and Goncalo Oliveira, Hugo}, editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen}, year = {2024}, month = may, pages = {13332--13343}, publisher = {{ELRA and ICCL}}, address = {Torino, Italia}, url = {https://aclanthology.org/2024.lrec-main.1167} } ```