aadel4/Wav2vec_Classroom_FT

Model Card: Wav2vec_Classroom_FT

Model Overview

Model Name: Wav2vec_Classroom_FT
Version: 1.0
Developed By: Ahmed Adel Attia (University of Maryland and Stanford University)
Date: 2025

Description:
Wav2vec_Classroom_FT is an automatic speech recognition (ASR) model trained for classroom speech transcription using direct fine-tuning on a small set of human-verified gold-standard transcriptions. Unlike NCTE-WSP-ASR, this model does not leverage weak transcriptions for intermediate training and is solely trained on high-quality annotations.

This model is adapted from Wav2vec-Classroom, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves direct fine-tuning on a limited transcribed dataset.

This model was originally trained using the fairseq library then ported into Huggingface.

Use Case:

Speech-to-text transcription for classroom environments.
ASR applications requiring high precision with limited data.
Benchmarking ASR performance without weakly supervised pretraining.

Model Details

Architecture: Wav2vec2.0-based model fine-tuned with Fairseq

Training Data:

NCTE-Gold: 13 hours of manually transcribed classroom recordings.

Training Strategy:

Direct Fine-tuning: The model is fine-tuned directly on NCTE-Gold without any pretraining on weak transcripts.
Evaluation: The model is tested on classroom ASR tasks to compare its performance with WSP-based models.

Evaluation Results

Word Error Rate (WER) comparison on NCTE and MPT test sets:

Training Data	NCTE WER	MPT WER
Baseline (TEDLIUM-trained ASR)	55.82 / 50.56	55.11 / 50.50
NCTE-Gold only (NCTE-Baseline-ASR)	21.12 / 16.47	31.52 / 27.93
NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold)	16.54 / 13.51	25.07 / 23.70

Limitations

The model is trained on a small dataset (13 hours), which limits its ability to generalize beyond classroom speech.
Performance is lower than NCTE-WSP-ASR, which benefits from weak transcripts for pretraining.
Background noise, overlapping speech, and speaker variations may still impact transcription quality.

Usage Request

If you use the NCTE-Baseline-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

For inquiries or collaborations, please contact the authors of the original paper.