voc2vec / README.md
alkiskoudounas's picture
Updated README
e7bf015 verified
---
license: apache-2.0
tags:
- non-verbal-vocalization
- audio-classification
- baby-crying
model-index:
- name: voc2vec
results: []
language:
- en
pipeline_tag: audio-classification
library_name: transformers
---
# voc2vec
voc2vec is a foundation model specifically designed for non-verbal human data.
We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a [Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)-like model.
## Model description
Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup.
The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound.
## Task and datasets description
We evaluate voc2vec on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.
## Available Models
| Model | Description | Link |
|--------|-------------|------|
| **voc2vec** | Pre-trained model on **125 hours of non-verbal audio**. | [πŸ”— Model](https://huggingface.co/alkiskoudounas/voc2vec) |
| **voc2vec-as-pt** | Continues pre-training from a model that was **initially trained on the AudioSet dataset**. | [πŸ”— Model](https://huggingface.co/alkiskoudounas/voc2vec-as-pt) |
| **voc2vec-ls-pt** | Continues pre-training from a model that was **initially trained on the LibriSpeech dataset**. | [πŸ”— Model](https://huggingface.co/alkiskoudounas/voc2vec-ls-pt) |
## Usage examples
You can use the model directly in the following manner:
```python
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec")
## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
## Compute logits
logits = model(**inputs).logits
```
## BibTeX entry and citation info
```bibtex
@INPROCEEDINGS{koudounas2025icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={},
keywords={},
doi={}}
```