Model Card: Wav2vec_Classroom_WSP_FT

Model Overview

Model Name: Wav2vec_Classroom_WSP_FT
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland & Stanford University) Date: 2025

Description:
Wav2vec_Classroom_WSP_FT is an automatic speech recognition (ASR) model trained specifically for classroom speech transcription using a weakly supervised pretraining (WSP) approach. The model first undergoes supervised pretraining on weakly transcribed classroom data (NCTE-Weak) and is then fine-tuned using a small amount of human-verified gold-standard data (NCTE-Gold). This methodology allows the model to generalize well despite the scarcity of precisely transcribed classroom speech.

This model is adapted from Wav2vec-Classroom, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves further fine-tuning to leverage weak transcriptions before final refinement on high-quality annotations.

This model was originally trained using the fairseq library then ported into Huggingface.

Use Case:

  • Speech-to-text transcription for classroom environments.
  • Educational research and analysis of classroom discourse.
  • Low-resource ASR applications where gold-standard labels are limited.

Model Details

Architecture: Wav2vec2.0-based model fine-tuned with Fairseq

Training Data:

  • NCTE-Weak: 5000 hours of weak transcriptions from the NCTE dataset.
  • NCTE-Gold: 13 hours of manually transcribed classroom recordings.

Training Strategy:

  1. Weakly Supervised Pretraining (WSP): The model is first trained using NCTE-Weak transcripts, which contain alignment errors and omissions but provide useful weak supervision.
  2. Precise Fine-tuning: The pretrained model is fine-tuned on NCTE-Gold, ensuring it adapts to high-quality transcriptions.

Evaluation Results

Word Error Rate (WER) comparison on NCTE and MPT test sets:

Training Data NCTE WER MPT WER
Baseline (TEDLIUM-trained ASR) 55.82 / 50.56 55.11 / 50.50
NCTE-Weak only 36.23 / 32.30 50.84 / 46.09
NCTE-Gold only 21.12 / 16.47 31.52 / 27.93
Self-training 17.45 / 15.09 27.42 / 26.24
NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold) 16.54 / 13.51 25.07 / 23.70

Limitations

  • The model relies on weak supervision, and transcription quality is dependent on the balance between weak and gold-standard data.
  • Classroom noise, overlapping speech, and spontaneous interactions may still lead to recognition errors.
  • The model was trained specifically on elementary math classrooms and may not generalize well to other educational settings without further adaptation.

Usage Request

If you use the NCTE-WSP-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

For inquiries or collaborations, don't hesitate to contact me at [email protected] or [email protected]

Downloads last month
89
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for aadel4/Wav2vec_Classroom_WSP_FT

Finetuned
(2)
this model