aadel4/Wav2vec_Classroom

Model Card: Wav2vec-Classroom

Model Overview

Model Name: Wav2vec-Classroom
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland & Stanford University)
Date: 2025

Description:
Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using Continued Pretraining (CPT) on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.

Use Case:

Speech-to-text transcription for classroom recordings.
Automatic feedback generation for educational AI tools.
ASR research in low-resource, noisy environments.

Model Details

Architecture: Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq

Training Data:

Unlabeled Classroom Audio (NCTE dataset): 5235 hours of classroom recordings used for self-supervised CPT.
NCTE-Gold: 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.

Training Strategy:

Continued Pretraining (CPT): The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
Supervised Fine-tuning: The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.

Evaluation Results

Word Error Rate (WER) comparison on NCTE and MPT test sets:

Training Data	NCTE WER	MPT WER
Pretraining from Scratch (W2V-SCR)	30.25 / 38.59	51.39 / 38.59
Wav2vec2.0-LV60K (No CPT)	30.39 / 33.56	39.11 / 37.82
Wav2vec2.0-Robust (No CPT)	27.99 / 31.49	35.07 / 36.36
Wav2vec2.0-Robust (CPT)	17.71 / 26.50	25.04 / 30.97

Limitations

The model is optimized for classroom speech and may not generalize well to other domains.
Background noise, overlapping speech, and speaker variations may still impact performance.
The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.

Usage Request

If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments
Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson
ICASSP 2025

@article{attia2024cpt_wav2vec,
  title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
  author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
  journal={ICASSP 2025},
  year={2024}
}