README.md · aadel4/Wav2vec_Classroom_WSP

metadata

license: mit
base_model:
  - facebook/wav2vec2-large-robust
  - aadel4/Wav2vec_Classroom
pipeline_tag: automatic-speech-recognition
library_name: transformers
language: en
tags:
  - audio
  - automatic-speech-recognition
  - wav2vec2

Model Card: Wav2vec_Classroom_WSP_FT

Model Overview

Model Name: Wav2vec_Classroom_WSP_FT
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland & Stanford University) Date: 2025

Description:
Wav2vec_Classroom_WSP_FT is an automatic speech recognition (ASR) model trained specifically for classroom speech transcription using a weakly supervised pretraining (WSP) approach. The model first undergoes supervised pretraining on weakly transcribed classroom data (NCTE-Weak) and is then fine-tuned using a small amount of human-verified gold-standard data (NCTE-Gold). This methodology allows the model to generalize well despite the scarcity of precisely transcribed classroom speech.

This model is adapted from Wav2vec-Classroom, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves further fine-tuning to leverage weak transcriptions before final refinement on high-quality annotations.

This model was originally trained using the fairseq library then ported into Huggingface.

Use Case:

Speech-to-text transcription for classroom environments.
Educational research and analysis of classroom discourse.
Low-resource ASR applications where gold-standard labels are limited.

Model Details

Architecture: Wav2vec2.0-based model fine-tuned with Fairseq

Training Data:

NCTE-Weak: 5000 hours of weak transcriptions from the NCTE dataset.
NCTE-Gold: 13 hours of manually transcribed classroom recordings.

Training Strategy:

Weakly Supervised Pretraining (WSP): The model is first trained using NCTE-Weak transcripts, which contain alignment errors and omissions but provide useful weak supervision.
Precise Fine-tuning: The pretrained model is fine-tuned on NCTE-Gold, ensuring it adapts to high-quality transcriptions.

Evaluation Results

Word Error Rate (WER) comparison on NCTE and MPT test sets:

Training Data	NCTE WER	MPT WER
Baseline (TEDLIUM-trained ASR)	55.82 / 50.56	55.11 / 50.50
NCTE-Weak only	36.23 / 32.30	50.84 / 46.09
NCTE-Gold only	21.12 / 16.47	31.52 / 27.93
Self-training	17.45 / 15.09	27.42 / 26.24
NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold)	16.54 / 13.51	25.07 / 23.70

Limitations

The model relies on weak supervision, and transcription quality is dependent on the balance between weak and gold-standard data.
Classroom noise, overlapping speech, and spontaneous interactions may still lead to recognition errors.
The model was trained specifically on elementary math classrooms and may not generalize well to other educational settings without further adaptation.

Usage Request

If you use the NCTE-WSP-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

For inquiries or collaborations, don't hesitate to contact me at [email protected] or [email protected]