--- license: apache-2.0 language: - en metrics: - wer base_model: - facebook/wav2vec2-base-960h tags: - pytorch - Transformers - speech - audio --- # Model Description This model is a fine-tuned version of facebook/wav2vec2-base-960h for automatic speech recognition (ASR). It has been trained using the [LibriSpeech dataset](https://paperswithcode.com/dataset/librispeech) and is designed to improve transcription accuracy over the base model. The fine-tuning process involved: - Selecting a subset of speakers from the `dev-clean` and `test-clean` datasets. - Preprocessing audio files and their corresponding transcriptions. - Training with gradient accumulation, mixed precision (if available), and periodic evaluation. - Saving the fine-tuned model for inference. *[GitHub](https://github.com/LucasTramonte/SpeechRecognition)* *Authors*: Lucas Tramonte, Kiyoshi Araki # Usage To transcribe audio files, the model can be used as follows: ```python from transformers import AutoProcessor, AutoModelForCTC import torch import librosa # Load model and processor processor = AutoProcessor.from_pretrained("deepl-project/conformer-finetunning") model = AutoModelForCTC.from_pretrained("deepl-project/conformer-finetunning") # Load and preprocess an audio file file_path = "path/to/audio/file.wav" speech, sr = librosa.load(file_path, sr=16000) inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True) # Perform inference with torch.no_grad(): logits = model(**inputs).logits # Decode transcription predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) print("Transcription:", transcription[0]) ``` # References - [LibriSpeech Dataset](https://paperswithcode.com/dataset/librispeech) - [Conformer Model Paper](https://paperswithcode.com/paper/conformer-based-target-speaker-automatic) - [Whisper Model Paper](https://arxiv.org/abs/2212.04356)