Upload folder using huggingface_hub
Browse files- README.md +71 -0
- config.json +1 -0
- requirements.txt +6 -0
- scaler.joblib +3 -0
- svr_model.joblib +3 -0
README.md
ADDED
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: multilingual
|
3 |
+
license: apache-2.0
|
4 |
+
datasets:
|
5 |
+
- voxceleb2
|
6 |
+
libraries:
|
7 |
+
- speechbrain
|
8 |
+
tags:
|
9 |
+
- height-estimation
|
10 |
+
- speaker-characteristics
|
11 |
+
- speaker-recognition
|
12 |
+
- audio-classification
|
13 |
+
- voice-analysis
|
14 |
+
---
|
15 |
+
|
16 |
+
# Height Estimation Model
|
17 |
+
|
18 |
+
This model combines the SpeechBrain ECAPA-TDNN speaker embedding model with an SVR regressor to predict speaker height from audio input. The model was trained on the VoxCeleb2 and evaluated on the VoxCeleb2 and TIMIT datasets.
|
19 |
+
|
20 |
+
## Model Details
|
21 |
+
- Input: Audio file (will be converted to 16kHz, mono, single channel)
|
22 |
+
- Output: Predicted height in centimeters (continuous value)
|
23 |
+
- Speaker embedding: 192-dimensional ECAPA-TDNN embedding from SpeechBrain
|
24 |
+
- Regressor: Support Vector Regression optimized through Optuna
|
25 |
+
- Performance:
|
26 |
+
- VoxCeleb2 test set: 6.01 cm Mean Absolute Error (MAE)
|
27 |
+
- TIMIT test set: 6.02 cm Mean Absolute Error (MAE)
|
28 |
+
|
29 |
+
## Training Data
|
30 |
+
The model was trained on VoxCeleb2 dataset:
|
31 |
+
- Audio preprocessing:
|
32 |
+
- Converted to WAV format, single channel, 16kHz sampling rate, 256 kp/s bitrate
|
33 |
+
- Applied SileroVAD for voice activity detection, taking the first voiced segment
|
34 |
+
|
35 |
+
## Installation
|
36 |
+
|
37 |
+
You can install the package directly from GitHub:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
pip install git+https://github.com/griko/voice-height-regression.git
|
41 |
+
```
|
42 |
+
|
43 |
+
## Usage
|
44 |
+
|
45 |
+
```python
|
46 |
+
from height_regressor import HeightRegressionPipeline
|
47 |
+
|
48 |
+
# Load the pipeline
|
49 |
+
regressor = HeightRegressionPipeline.from_pretrained(
|
50 |
+
"griko/height_reg_svr_ecapa_voxceleb"
|
51 |
+
)
|
52 |
+
|
53 |
+
# Single file prediction
|
54 |
+
result = regressor("path/to/audio.wav")
|
55 |
+
print(f"Predicted height: {result[0]:.1f} cm")
|
56 |
+
|
57 |
+
# Batch prediction
|
58 |
+
results = regressor(["audio1.wav", "audio2.wav"])
|
59 |
+
print(f"Predicted heights: {[f'{h:.1f}' for h in results]} cm")
|
60 |
+
```
|
61 |
+
|
62 |
+
## Limitations
|
63 |
+
- Model was trained on celebrity voices from YouTube interviews
|
64 |
+
- Performance may vary on different audio qualities or recording conditions
|
65 |
+
- Height predictions are estimates and should not be used for medical or legal purposes
|
66 |
+
|
67 |
+
## Citation
|
68 |
+
If you use this model in your research, please cite:
|
69 |
+
```bibtex
|
70 |
+
TBD
|
71 |
+
```
|
config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"feature_names": ["0_speechbrain_embedding", "1_speechbrain_embedding", "2_speechbrain_embedding", "3_speechbrain_embedding", "4_speechbrain_embedding", "5_speechbrain_embedding", "6_speechbrain_embedding", "7_speechbrain_embedding", "8_speechbrain_embedding", "9_speechbrain_embedding", "10_speechbrain_embedding", "11_speechbrain_embedding", "12_speechbrain_embedding", "13_speechbrain_embedding", "14_speechbrain_embedding", "15_speechbrain_embedding", "16_speechbrain_embedding", "17_speechbrain_embedding", "18_speechbrain_embedding", "19_speechbrain_embedding", "20_speechbrain_embedding", "21_speechbrain_embedding", "22_speechbrain_embedding", "23_speechbrain_embedding", "24_speechbrain_embedding", "25_speechbrain_embedding", "26_speechbrain_embedding", "27_speechbrain_embedding", "28_speechbrain_embedding", "29_speechbrain_embedding", "30_speechbrain_embedding", "31_speechbrain_embedding", "32_speechbrain_embedding", "33_speechbrain_embedding", "34_speechbrain_embedding", "35_speechbrain_embedding", "36_speechbrain_embedding", "37_speechbrain_embedding", "38_speechbrain_embedding", "39_speechbrain_embedding", "40_speechbrain_embedding", "41_speechbrain_embedding", "42_speechbrain_embedding", "43_speechbrain_embedding", "44_speechbrain_embedding", "45_speechbrain_embedding", "46_speechbrain_embedding", "47_speechbrain_embedding", "48_speechbrain_embedding", "49_speechbrain_embedding", "50_speechbrain_embedding", "51_speechbrain_embedding", "52_speechbrain_embedding", "53_speechbrain_embedding", "54_speechbrain_embedding", "55_speechbrain_embedding", "56_speechbrain_embedding", "57_speechbrain_embedding", "58_speechbrain_embedding", "59_speechbrain_embedding", "60_speechbrain_embedding", "61_speechbrain_embedding", "62_speechbrain_embedding", "63_speechbrain_embedding", "64_speechbrain_embedding", "65_speechbrain_embedding", "66_speechbrain_embedding", "67_speechbrain_embedding", "68_speechbrain_embedding", "69_speechbrain_embedding", "70_speechbrain_embedding", "71_speechbrain_embedding", "72_speechbrain_embedding", "73_speechbrain_embedding", "74_speechbrain_embedding", "75_speechbrain_embedding", "76_speechbrain_embedding", "77_speechbrain_embedding", "78_speechbrain_embedding", "79_speechbrain_embedding", "80_speechbrain_embedding", "81_speechbrain_embedding", "82_speechbrain_embedding", "83_speechbrain_embedding", "84_speechbrain_embedding", "85_speechbrain_embedding", "86_speechbrain_embedding", "87_speechbrain_embedding", "88_speechbrain_embedding", "89_speechbrain_embedding", "90_speechbrain_embedding", "91_speechbrain_embedding", "92_speechbrain_embedding", "93_speechbrain_embedding", "94_speechbrain_embedding", "95_speechbrain_embedding", "96_speechbrain_embedding", "97_speechbrain_embedding", "98_speechbrain_embedding", "99_speechbrain_embedding", "100_speechbrain_embedding", "101_speechbrain_embedding", "102_speechbrain_embedding", "103_speechbrain_embedding", "104_speechbrain_embedding", "105_speechbrain_embedding", "106_speechbrain_embedding", "107_speechbrain_embedding", "108_speechbrain_embedding", "109_speechbrain_embedding", "110_speechbrain_embedding", "111_speechbrain_embedding", "112_speechbrain_embedding", "113_speechbrain_embedding", "114_speechbrain_embedding", "115_speechbrain_embedding", "116_speechbrain_embedding", "117_speechbrain_embedding", "118_speechbrain_embedding", "119_speechbrain_embedding", "120_speechbrain_embedding", "121_speechbrain_embedding", "122_speechbrain_embedding", "123_speechbrain_embedding", "124_speechbrain_embedding", "125_speechbrain_embedding", "126_speechbrain_embedding", "127_speechbrain_embedding", "128_speechbrain_embedding", "129_speechbrain_embedding", "130_speechbrain_embedding", "131_speechbrain_embedding", "132_speechbrain_embedding", "133_speechbrain_embedding", "134_speechbrain_embedding", "135_speechbrain_embedding", "136_speechbrain_embedding", "137_speechbrain_embedding", "138_speechbrain_embedding", "139_speechbrain_embedding", "140_speechbrain_embedding", "141_speechbrain_embedding", "142_speechbrain_embedding", "143_speechbrain_embedding", "144_speechbrain_embedding", "145_speechbrain_embedding", "146_speechbrain_embedding", "147_speechbrain_embedding", "148_speechbrain_embedding", "149_speechbrain_embedding", "150_speechbrain_embedding", "151_speechbrain_embedding", "152_speechbrain_embedding", "153_speechbrain_embedding", "154_speechbrain_embedding", "155_speechbrain_embedding", "156_speechbrain_embedding", "157_speechbrain_embedding", "158_speechbrain_embedding", "159_speechbrain_embedding", "160_speechbrain_embedding", "161_speechbrain_embedding", "162_speechbrain_embedding", "163_speechbrain_embedding", "164_speechbrain_embedding", "165_speechbrain_embedding", "166_speechbrain_embedding", "167_speechbrain_embedding", "168_speechbrain_embedding", "169_speechbrain_embedding", "170_speechbrain_embedding", "171_speechbrain_embedding", "172_speechbrain_embedding", "173_speechbrain_embedding", "174_speechbrain_embedding", "175_speechbrain_embedding", "176_speechbrain_embedding", "177_speechbrain_embedding", "178_speechbrain_embedding", "179_speechbrain_embedding", "180_speechbrain_embedding", "181_speechbrain_embedding", "182_speechbrain_embedding", "183_speechbrain_embedding", "184_speechbrain_embedding", "185_speechbrain_embedding", "186_speechbrain_embedding", "187_speechbrain_embedding", "188_speechbrain_embedding", "189_speechbrain_embedding", "190_speechbrain_embedding", "191_speechbrain_embedding"]}
|
requirements.txt
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
scikit-learn
|
2 |
+
pandas
|
3 |
+
soundfile
|
4 |
+
speechbrain
|
5 |
+
torch
|
6 |
+
torchaudio
|
scaler.joblib
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f2d7b9497213c91acc99e733a887de9a15959b9b7af412638f85ac9e57c11337
|
3 |
+
size 11559
|
svr_model.joblib
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8dc5204d15b3682d705ffcd25a81db4e185e161c55e10f63ee07db59636443a6
|
3 |
+
size 38068815
|