griko
/

age_reg_svr_ecapa_voxceleb2

speaker-characteristics

speaker-recognition

audio-regression

Model card Files Files and versions Community

age_reg_svr_ecapa_voxceleb2 / README.md

griko's picture

Upload folder using huggingface_hub

1d2356a verified 14 days ago

|

history blame contribute delete

3.36 kB

	---
	language: multilingual
	license: apache-2.0
	datasets:
	- voxceleb2
	libraries:
	- speechbrain
	tags:
	- age-estimation
	- speaker-characteristics
	- speaker-recognition
	- audio-regression
	- voice-analysis
	---

	# Age Estimation Model

	This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker age from audio input. The model was trained on the VoxCeleb2 dataset.

	## Model Performance Comparison

	We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:

	\| Model \| Architecture \| Features \| Training Data \| Test MAE \| Best For \|
	\|-------\|-------------\|----------\|---------------\|-----------\|----------\|
	\| VoxCeleb2 SVR (223) \| SVR \| ECAPA + Librosa (223-dim) \| VoxCeleb2 \| 7.88 years \| Best performance on VoxCeleb2 \|
	\| VoxCeleb2 SVR (192) \| SVR \| ECAPA only (192-dim) \| VoxCeleb2 \| 7.89 years \| Lightweight deployment \|
	\| TIMIT ANN (192) \| ANN \| ECAPA only (192-dim) \| TIMIT \| 4.95 years \| Clean studio recordings \|
	\| Combined ANN (223) \| ANN \| ECAPA + Librosa (223-dim) \| VoxCeleb2 + TIMIT \| 6.93 years \| Best general performance \|

	You may find other models [here](https://huggingface.co/griko).

	## Model Details
	- Input: Audio file (will be converted to 16kHz, mono, single channel)
	- Output: Predicted age in years (continuous value)
	- Features: SpeechBrain ECAPA-TDNN embedding [192 features]
	- Regressor: Support Vector Regression optimized through Optuna
	- Performance:
	- VoxCeleb2 test set: 7.89 years Mean Absolute Error (MAE)

	## Features
	1. SpeechBrain ECAPA-TDNN embeddings (192 dimensions)

	## Training Data
	The model was trained on the VoxCeleb2 dataset:
	- Audio preprocessing:
	- Converted to WAV format, single channel, 16kHz sampling rate
	- Applied SileroVAD for voice activity detection, taking the first voiced segment
	- Age data was collected from Wikidata and public sources
	## Installation

	```bash
	pip install git+https://github.com/griko/voice-age-regression.git#egg=voice-age-regressor[svr-ecapa-voxceleb2]
	```

	## Usage

	```python
	from age_regressor import AgeRegressionPipeline

	# Load the pipeline
	regressor = AgeRegressionPipeline.from_pretrained(
	"griko/age_reg_svr_ecapa_voxceleb2"
	)

	# Single file prediction
	result = regressor("path/to/audio.wav")
	print(f"Predicted age: {result[0]:.1f} years")

	# Batch prediction
	results = regressor(["audio1.wav", "audio2.wav"])
	print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")
	```

	## Limitations
	- Model was trained on celebrity voices from YouTube interviews recordings
	- Performance may vary on different audio qualities or recording conditions
	- Age predictions are estimates and should not be used for medical or legal purposes
	- Age estimations should be treated as approximate values, not exact measurements

	## Citation
	If you use this model in your research, please cite:
	```bibtex
	@misc{koushnir2025vanpyvoiceanalysisframework,
	title={VANPY: Voice Analysis Framework},
	author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},
	year={2025},
	eprint={2502.17579},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2502.17579},
	}
	```