OLMoASR

OLMoASR is a series of English automatic speech recognition (ASR) models proposed in the OLMoASR: Open Models and Data for Training Robust Speech Recognition Models paper by Huong Ngo et al. from Ai2. Trained on 440K hours of weakly-supervised audio-text pairs collected from the public internet, OLMoASR demonstrates strong robustness and zero-shot capabilities. Visit the OLMoASR repository for access to data processing, training and evaluation code.

Model Details

OLMoASR uses a Transformer-based encoder-decoder architecture and is an audio language model (LM), where there is an audio encoder and language decoder. OLMoASR has 5 different model sizes and all checkpoints are trained with English-only data. Below is a table enumerating the different model sizes and associated parameter count.

Size	Parameters
tiny	39 M
base	74 M
small	244 M
medium	769 M
large	1.5 B
large-v2	1.5 B

Training Data

OLMoASR is trained on 440K hours of weakly-supervised data subsampled from OLMoASR-Mix, a filtered version of OLMoASR-Pool. OLMoASR-Mix is a collection 1M hours of audio-text pairs, curated from the 3M hours of OLMoASR-Pool.

Usage

To perform transcription, you can run

import olmoasr

model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)

Evaluation

To perform evaluation, you can visit the OLMoASR repository for more details.

License

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

BibTeX entry and citation info

allenai
/

OLMoASR