CREMA-D Emotion Classification using HuBERT
This project involves fine-tuning the HuBERT model for emotion classification on the CREMA-D dataset. The model is designed to classify audio clips into one of six emotion categories: Anger (ANG), Disgust (DIS), Fear (FEA), Happiness (HAP), Neutral (NEU), and Sadness (SAD).
Model Overview
Architecture: HuBERT for Sequence Classification
Number of Labels: 6 (ANG, DIS, FEA, HAP, NEU, SAD)
Input Type: Audio waveform (.wav format)
Sample Rate: 16,000 Hz
Model Layers: 24 hidden layers, 16 attention heads per layer
Dataset
The model is trained on the CREMA-D dataset, a collection of emotional audio clips with six distinct labels:
Anger (ANG)
Disgust (DIS)
Fear (FEA)
Happiness (HAP)
Neutral (NEU)
Sadness (SAD)
Data Structure
Training Data Path: /workspace/dataset/CREMA-D/train_only_augmented
Validation Data Path: /workspace/dataset/CREMA-D/val_only
The audio files are in .wav format, and each filename encodes the emotion label.
Training Procedure
Optimizer: AdamW with learning rate 1e-5
Loss Function: CrossEntropyLoss
Scheduler: Learning rate scheduler reduces by half every 10 epochs
Batch Size: 32
Patience for Early Stopping: 15 epochs
The model was trained for up to 50 epochs, with early stopping in case the validation loss did not improve for 15 consecutive epochs.
Checkpoints
The best model checkpoint is saved at:
/workspace/UndergraduateResearchAssistant/GraduateProject/jihan/HuBERT-Crema-D/checkpoints/best_model_epoch.pth
Evaluation Metrics
The model is evaluated based on:
Accuracy: Overall classification accuracy
Precision, Recall, F1-score: Calculated for each class to assess model performance in detail
Confusion Matrix: Provides insight into model predictions versus actual labels
Visualizations include:
Loss and Accuracy Plots: Training vs. Validation metrics across epochs
Saved as loss_plot_agmented.png and accuracy_plot_agmented.png in the checkpoint directory
How to Use
Loading the Model
To load the fine-tuned model:
import torch from transformers import HubertForSequenceClassification
model = HubertForSequenceClassification.from_pretrained("facebook/hubert-large-ls960-ft", num_labels=6) model.load_state_dict(torch.load("/workspace/UndergraduateResearchAssistant/GraduateProject/jihan/HuBERT-Crema-D/checkpoints/best_model_epoch.pth")) model.eval()
Inference
To perform inference on an audio file:
import torchaudio from transformers import Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft") waveform, sample_rate = torchaudio.load("path/to/audio.wav")
if sample_rate != 16000: waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
input_values = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_values outputs = model(input_values) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_label = torch.argmax(probs, dim=-1).item() print(f"Predicted Emotion: {EMOTION_LABELS[predicted_label]}")
Requirements
Python 3.8+
PyTorch 1.9+
Transformers (Hugging Face) 4.12+
torchaudio 0.9+
scikit-learn (for evaluation metrics)
matplotlib (for visualization)
Install the required packages:
pip install torch torchaudio transformers scikit-learn matplotlib
Citation
If you use this model in your work, please cite the original CREMA-D dataset paper and the HuBERT paper.
License
This project is licensed under the MIT License.
- Downloads last month
- 7