Model Card: Wav2vec_Classroom_FT
Model Overview
Model Name: Wav2vec_Classroom_FT
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland and Stanford University)
Date: 2025
Description:
Wav2vec_Classroom_FT is an automatic speech recognition (ASR) model trained for classroom speech transcription using direct fine-tuning on a small set of human-verified gold-standard transcriptions. Unlike NCTE-WSP-ASR, this model does not leverage weak transcriptions for intermediate training and is solely trained on high-quality annotations.
This model is adapted from Wav2vec-Classroom, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves direct fine-tuning on a limited transcribed dataset.
This model was originally trained using the fairseq library then ported into Huggingface.
Use Case:
- Speech-to-text transcription for classroom environments.
- ASR applications requiring high precision with limited data.
- Benchmarking ASR performance without weakly supervised pretraining.
Model Details
Architecture: Wav2vec2.0-based model fine-tuned with Fairseq
Training Data:
- NCTE-Gold: 13 hours of manually transcribed classroom recordings.
Training Strategy:
- Direct Fine-tuning: The model is fine-tuned directly on NCTE-Gold without any pretraining on weak transcripts.
- Evaluation: The model is tested on classroom ASR tasks to compare its performance with WSP-based models.
Evaluation Results
Word Error Rate (WER) comparison on NCTE and MPT test sets:
Training Data | NCTE WER | MPT WER |
---|---|---|
Baseline (TEDLIUM-trained ASR) | 55.82 / 50.56 | 55.11 / 50.50 |
NCTE-Gold only (NCTE-Baseline-ASR) | 21.12 / 16.47 | 31.52 / 27.93 |
NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold) | 16.54 / 13.51 | 25.07 / 23.70 |
Limitations
- The model is trained on a small dataset (13 hours), which limits its ability to generalize beyond classroom speech.
- Performance is lower than NCTE-WSP-ASR, which benefits from weak transcripts for pretraining.
- Background noise, overlapping speech, and speaker variations may still impact transcription quality.
Usage Request
If you use the NCTE-Baseline-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.
For inquiries or collaborations, please contact the authors of the original paper.
- Downloads last month
- 17
Model tree for aadel4/Wav2vec_Classroom_FT
Base model
aadel4/Wav2vec_Classroom