vihsd / README.md
phucdev's picture
Downgrade gradio version
69305c1

A newer version of the Gradio SDK is available: 5.21.0

Upgrade
metadata
title: ViHSD
emoji: 💻
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 3.5
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  ViHSD is a Vietnamese Hate Speech Detection dataset. This space implements
  accuracy and f1 to evaluate models on ViHSD.

Metric Card for ViHSD

Metric Description

This metric is used to compute the accuracy and F1 score of models on the ViHSD dataset from A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts by Luu et al. (2021). The ViHSD dataset is a large-scale dataset for hate speech detection on Vietnamese social media texts. It contains over 30,000 comments, each labeled as CLEAN, OFFENSIVE, or HATE. The dataset is used to evaluate the quality of hate speech detection models, including deep learning and transformer models.

How to Use

from evaluate import load

vishd_metric = load("phucdev/vihsd")
references = [0, 1]
predictions = [0, 1]
results = vishd_metric.compute(predictions=predictions, references=references)

Output Values

The output of the compute function is a dictionary containing the accuracy and F1 score of the model on the ViHSD dataset:

accuracy: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see accuracy for more information).

f1: the harmonic mean of the precision and recall (see F1 score for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

Values from Popular Papers

The authors of the dataset have reported the following values in their paper):

Model Pre-trained Model Accuracy (%) F1-macro (%)
DNN Models
Text CNN fastText 86.69 61.11
GRU fastText 85.41 60.47
Transformer Models
BERT bert-base-multilingual-uncased 86.60 62.38
bert-base-multilingual-cased 86.88 62.69
XLM-R xlm-roberta-base 86.12 61.28
DistilBERT distilbert-base-multilingual-cased 86.22 62.42

Details regarding the training process and hyperparameters can be found in the original paper:

  • Pre-processing: Segmenting words with pyvi, removing stopwords, converting text to lowercase, and removing special characters (e.g., hashtags, URLs).
  • Text-CNN Model: Trained for 50 epochs with a batch size of 256, sequence length of 100, dropout of 0.5, and a 2D convolution layer with 32 filters of sizes 2, 3, and 5.
  • GRU Model: Also trained for 50 epochs with sequence length 100, dropout of 0.5, and a bidirectional GRU layer, using the Adam optimizer.
  • Transformer Models (BERT, XLM-R, DistilBERT): Trained and evaluated with a batch size of 16, 4 epochs, sequence length 100, and a manual seed of 4.

Nguyen et al. (2023) trained a specialized transformer model for Vietnamese social media texts and report the following results in their paper:

Model Accuracy (%) Weighted F1 (%) Micro F1 (%)
viBERT 85.34 85.01 62.07
viELECTRA 86.96 86.37 63.95
PhoBERT Base 87.12 86.81 65.01
PhoBERT Large 87.32 86.98 65.14
mBERT (cased) 83.55 83.99 60.62
mBERT (uncased) 83.38 81.27 58.92
XLM-RBase 86.36 86.08 63.39
XLM-RLarge 87.15 86.86 65.13
XLM-T 86.22 86.12 63.48
TwHIN-BERTBase 86.63 86.23 63.67
TwHIN-BERTLarge 87.23 86.78 65.23
Bernice 86.12 86.48 64.32
ViSoBERT 88.51 88.31 68.77

Nguyen et al. (2023) used a batch size of 40, a maximum token length of 128, a learning rate of 2e-5, and AdamW optimizer with an epsilon of 1e-8. They executed a 10-epoch training process and evaluated downstream tasks using the best-performing model from those epochs. They used the raw texts of the ViHSD dataset for training and testing, without any additional preprocessing.

Examples

from evaluate import load

vishd_metric = load("phucdev/vihsd")
references = [0, 1]
predictions = [0, 1]
results = vishd_metric.compute(predictions=predictions, references=references)

Limitations and Bias

  • Data Imbalance: The dataset is heavily imbalanced across categories, with a high concentration in the non-hate speech class.
  • Coverage of Hate Speech Variants: Due to the diversity of Vietnamese slang, regional dialects, and evolving language use in online spaces, the dataset may not fully capture the range of expressions used in hate speech, potentially limiting model generalizability.
  • Annotation Bias: Annotator subjectivity in identifying hate speech could lead to inconsistencies. Personal interpretations of what constitutes hate speech, especially for subtle or context-dependent language, may result in labeling bias.
  • Linguistic Complexity: Vietnamese language complexities, like diacritics and slang (teencode), can impact accurate labeling and model training, as hate speech often includes informal or deliberately misspelled words that are harder to classify correctly.
  • Domain-Specific Limitations: Data gathered from specific social media platforms may not generalize well to other platforms, limiting the dataset's applicability in varied online environments.

Citation

@InProceedings{10.1007/978-3-030-79457-6_35,
    author="Luu, Son T.
    and Nguyen, Kiet Van
    and Nguyen, Ngan Luu-Thuy",
    editor="Fujita, Hamido
    and Selamat, Ali
    and Lin, Jerry Chun-Wei
    and Ali, Moonis",
    title="A Large-Scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts",
    booktitle="Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices",
    year="2021",
    publisher="Springer International Publishing",
    address="Cham",
    pages="415--426",
    abstract="In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social media, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluate the dataset by deep learning and transformer models.",
    isbn="978-3-030-79457-6"
}
@misc{nguyen2023visobertpretrainedlanguagemodel,
      title={ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing}, 
      author={Quoc-Nam Nguyen and Thang Chau Phan and Duc-Vu Nguyen and Kiet Van Nguyen},
      year={2023},
      eprint={2310.11166},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.11166}, 
}

Further References