license: mit
base_model:
- facebook/wav2vec2-large-robust
- aadel4/Wav2vec_Classroom
pipeline_tag: automatic-speech-recognition
library_name: transformers
language: en
tags:
- audio
- automatic-speech-recognition
- wav2vec2
Model Card: Wav2vec_Classroom_WSP_FT
Model Overview
Model Name: Wav2vec_Classroom_WSP_FT
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland & Stanford University)
Date: 2025
Description:
Wav2vec_Classroom_WSP_FT is an automatic speech recognition (ASR) model trained specifically for classroom speech transcription using a weakly supervised pretraining (WSP) approach. The model first undergoes supervised pretraining on weakly transcribed classroom data (NCTE-Weak) and is then fine-tuned using a small amount of human-verified gold-standard data (NCTE-Gold). This methodology allows the model to generalize well despite the scarcity of precisely transcribed classroom speech.
This model is adapted from Wav2vec-Classroom, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves further fine-tuning to leverage weak transcriptions before final refinement on high-quality annotations.
This model was originally trained using the fairseq library then ported into Huggingface.
Use Case:
- Speech-to-text transcription for classroom environments.
- Educational research and analysis of classroom discourse.
- Low-resource ASR applications where gold-standard labels are limited.
Model Details
Architecture: Wav2vec2.0-based model fine-tuned with Fairseq
Training Data:
- NCTE-Weak: 5000 hours of weak transcriptions from the NCTE dataset.
- NCTE-Gold: 13 hours of manually transcribed classroom recordings.
Training Strategy:
- Weakly Supervised Pretraining (WSP): The model is first trained using NCTE-Weak transcripts, which contain alignment errors and omissions but provide useful weak supervision.
- Precise Fine-tuning: The pretrained model is fine-tuned on NCTE-Gold, ensuring it adapts to high-quality transcriptions.
Evaluation Results
Word Error Rate (WER) comparison on NCTE and MPT test sets:
Training Data | NCTE WER | MPT WER |
---|---|---|
Baseline (TEDLIUM-trained ASR) | 55.82 / 50.56 | 55.11 / 50.50 |
NCTE-Weak only | 36.23 / 32.30 | 50.84 / 46.09 |
NCTE-Gold only | 21.12 / 16.47 | 31.52 / 27.93 |
Self-training | 17.45 / 15.09 | 27.42 / 26.24 |
NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold) | 16.54 / 13.51 | 25.07 / 23.70 |
Limitations
- The model relies on weak supervision, and transcription quality is dependent on the balance between weak and gold-standard data.
- Classroom noise, overlapping speech, and spontaneous interactions may still lead to recognition errors.
- The model was trained specifically on elementary math classrooms and may not generalize well to other educational settings without further adaptation.
Usage Request
If you use the NCTE-WSP-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.
For inquiries or collaborations, don't hesitate to contact me at [email protected] or [email protected]