File size: 3,360 Bytes
42445a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5bbe534
42445a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d2356a
 
 
 
 
 
 
 
 
42445a8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---

language: multilingual
license: apache-2.0
datasets:
- voxceleb2
libraries:
- speechbrain
tags:
- age-estimation
- speaker-characteristics
- speaker-recognition
- audio-regression
- voice-analysis
---


# Age Estimation Model

This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker age from audio input. The model was trained on the VoxCeleb2 dataset.

## Model Performance Comparison

We provide multiple pre-trained models with different architectures and feature sets. Here's a comprehensive comparison of their performance:

| Model | Architecture | Features | Training Data | Test MAE | Best For |
|-------|-------------|----------|---------------|-----------|----------|
| VoxCeleb2 SVR (223) | SVR | ECAPA + Librosa (223-dim) | VoxCeleb2 | 7.88 years | Best performance on VoxCeleb2 |
| VoxCeleb2 SVR (192) | SVR | ECAPA only (192-dim) | VoxCeleb2 | 7.89 years | Lightweight deployment |
| TIMIT ANN (192) | ANN | ECAPA only (192-dim) | TIMIT | 4.95 years | Clean studio recordings |
| Combined ANN (223) | ANN | ECAPA + Librosa (223-dim) | VoxCeleb2 + TIMIT | 6.93 years | Best general performance |

You may find other models [here](https://huggingface.co/griko).

## Model Details
- Input: Audio file (will be converted to 16kHz, mono, single channel)
- Output: Predicted age in years (continuous value)
- Features: SpeechBrain ECAPA-TDNN embedding [192 features]
- Regressor: Support Vector Regression optimized through Optuna
- Performance:
  - VoxCeleb2 test set: 7.89 years Mean Absolute Error (MAE)

## Features
1. SpeechBrain ECAPA-TDNN embeddings (192 dimensions)

## Training Data
The model was trained on the VoxCeleb2 dataset:
- Audio preprocessing:
  - Converted to WAV format, single channel, 16kHz sampling rate
  - Applied SileroVAD for voice activity detection, taking the first voiced segment
- Age data was collected from Wikidata and public sources
## Installation

```bash

pip install git+https://github.com/griko/voice-age-regression.git#egg=voice-age-regressor[svr-ecapa-voxceleb2]

```

## Usage

```python

from age_regressor import AgeRegressionPipeline



# Load the pipeline

regressor = AgeRegressionPipeline.from_pretrained(

    "griko/age_reg_svr_ecapa_voxceleb2"

)



# Single file prediction

result = regressor("path/to/audio.wav")

print(f"Predicted age: {result[0]:.1f} years")



# Batch prediction

results = regressor(["audio1.wav", "audio2.wav"])

print(f"Predicted ages: {[f'{age:.1f}' for age in results]} years")

```

## Limitations
- Model was trained on celebrity voices from YouTube interviews recordings
- Performance may vary on different audio qualities or recording conditions
- Age predictions are estimates and should not be used for medical or legal purposes
- Age estimations should be treated as approximate values, not exact measurements

## Citation
If you use this model in your research, please cite:
```bibtex

@misc{koushnir2025vanpyvoiceanalysisframework,

      title={VANPY: Voice Analysis Framework}, 

      author={Gregory Koushnir and Michael Fire and Galit Fuhrmann Alpert and Dima Kagan},

      year={2025},

      eprint={2502.17579},

      archivePrefix={arXiv},

      primaryClass={cs.SD},

      url={https://arxiv.org/abs/2502.17579}, 

}

```