janhi/HuBert_fine_tuning_for_SER_Speech_Emotion_Recognition

Model Overview

Architecture: HuBERT for Sequence Classification

Number of Labels: 6 (ANG, DIS, FEA, HAP, NEU, SAD)

Input Type: Audio waveform (.wav format)

Sample Rate: 16,000 Hz

Model Layers: 24 hidden layers, 16 attention heads per layer

Dataset

The model is trained on the CREMA-D dataset, a collection of emotional audio clips with six distinct labels:

Anger (ANG)

Disgust (DIS)

Fear (FEA)

Happiness (HAP)

Neutral (NEU)

Sadness (SAD)

Data Structure

Training Data Path: /workspace/dataset/CREMA-D/train_only_augmented

Validation Data Path: /workspace/dataset/CREMA-D/val_only

The audio files are in .wav format, and each filename encodes the emotion label.

Training Procedure

Optimizer: AdamW with learning rate 1e-5

Loss Function: CrossEntropyLoss

Scheduler: Learning rate scheduler reduces by half every 10 epochs

Batch Size: 32

Patience for Early Stopping: 15 epochs

The model was trained for up to 50 epochs, with early stopping in case the validation loss did not improve for 15 consecutive epochs.

Checkpoints

The best model checkpoint is saved at:

/workspace/UndergraduateResearchAssistant/GraduateProject/jihan/HuBERT-Crema-D/checkpoints/best_model_epoch.pth

Evaluation Metrics

The model is evaluated based on:

Accuracy: Overall classification accuracy

Precision, Recall, F1-score: Calculated for each class to assess model performance in detail

Confusion Matrix: Provides insight into model predictions versus actual labels

Visualizations include:

Loss and Accuracy Plots: Training vs. Validation metrics across epochs

Saved as loss_plot_agmented.png and accuracy_plot_agmented.png in the checkpoint directory

How to Use

Loading the Model

To load the fine-tuned model:

import torch from transformers import HubertForSequenceClassification

model = HubertForSequenceClassification.from_pretrained("facebook/hubert-large-ls960-ft", num_labels=6) model.load_state_dict(torch.load("/workspace/UndergraduateResearchAssistant/GraduateProject/jihan/HuBERT-Crema-D/checkpoints/best_model_epoch.pth")) model.eval()

Inference

To perform inference on an audio file:

import torchaudio from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft") waveform, sample_rate = torchaudio.load("path/to/audio.wav")

if sample_rate != 16000: waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

input_values = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_values outputs = model(input_values) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_label = torch.argmax(probs, dim=-1).item() print(f"Predicted Emotion: {EMOTION_LABELS[predicted_label]}")

Requirements

Python 3.8+

PyTorch 1.9+

Transformers (Hugging Face) 4.12+

torchaudio 0.9+

scikit-learn (for evaluation metrics)

matplotlib (for visualization)

Install the required packages:

pip install torch torchaudio transformers scikit-learn matplotlib

Citation

If you use this model in your work, please cite the original CREMA-D dataset paper and the HuBERT paper.

License

This project is licensed under the MIT License.