Wav2Vec2-Large-Robust finetuned on the revised ETRI Data of Korean-English for Pronunciation Model

This repository contains a fine-tuned Wav2Vec2-Large-Robust model for phoneme recognition tasks. The model was trained and evaluated on our in-house English pronunciations of Korean learners dataset, which was made with ETRI and revised by SNU SLP lab.

Data Information

  • Dataset Name: English Pronunciation of Korean Learners (made with ETRI) revised by SNU SLP lab

  • Data Type: Speech recordings of Korean learners speaking English, annotated with phoneme sequences.

  • Annotation: Each utterance is transcribed at the phoneme level, including pronunciation errors marked with _err. These errors highlight phoneme substitutions, insertions, and deletions that occur due to the influence of the Korean language on English pronunciation.

  • Train Set: 14,305 samples

  • Valid Set: 1,590 samples

  • Test Set: 3,974 samples

Training Procedure

The model was fine-tuned for phoneme recognition using the Hugging Face transformers library. Below are the training steps:

  1. Data preprocessing to align audio with phoneme labels.
  2. Wav2Vec2-Large-Robust model fine-tuning with CTC loss.
  3. Evaluation on validation and test sets.

Training Hyperparameters

  • Epochs: 50
  • Learning Rate: 0.0001
  • Warmup Ratio: 0.1
  • Scheduler: Linear
  • Batch Size: 8
  • Loss Reduction: Mean
  • Feature Extractor Freeze: Enabled

Training Results

The following metrics were achieved during training:

  • Final Training Loss: 0.2415
  • Phoneme Error Rate (PER) on Training Set: 0.0508
  • Validation Loss: 0.3999
  • Phoneme Error Rate (PER) on Validation Set: 0.1622

Test Results

The model was evaluated on the test dataset with the following performance:

  • Phoneme Error Rate (PER): 0.0905

Phoneme Data Example

Below is an example of how the dataset is structured for phoneme recognition tasks:

Sample :

  • Provided Sentence: I'M LOOKING FOR MY PUPPY LUKE HE RAN AWAY THIS MORNING
  • True Phonemes of Korean pronunciation: ay m l uh k ih_err ng f er m ay p ah p iy l uw k hh iy l ah n ah w ey dh ih s m ao r n ih_err n
  • Predicted Phonemes: ay m l uh k ih ng f er m ay p ah p iy l uh k hh iy l ah n ah w ey dh ih s m ao r n ih_err ng

Training Logs

TensorBoard logs are available for detailed training analysis:

  • events.out.tfevents.1737043507.oem-WS-C621E-SAGE-Series.1534005.0
  • events.out.tfevents.1737088179.oem-WS-C621E-SAGE-Series.1534005.1

Use the following command to visualize logs:

tensorboard --logdir=./logs/
Downloads last month
490
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.