NeMo
rlangman commited on
Commit
ef634ec
·
verified ·
1 Parent(s): b0e8aa9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -18,7 +18,9 @@ padding: 0;
18
  | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
 
21
- The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis. The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
 
 
22
 
23
  | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
24
  |:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
@@ -102,7 +104,7 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
102
 
103
  ## Performance
104
 
105
- We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
106
 
107
  | Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
108
  |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
 
18
  | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
 
21
+ The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis.
22
+
23
+ The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
24
 
25
  | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
26
  |:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
 
104
 
105
  ## Performance
106
 
107
+ We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and [SI-SDR](https://arxiv.org/abs/1811.02508) for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
108
 
109
  | Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
110
  |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|