# Pretrained models
The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`.
The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
| Model | Download link | File size |
| -------- | ------- | ------- |
| Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M |
| Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M |
| Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G |
| Flow prediction network, large 44.1kHz | mmaudio_large_44k.pth | 3.9G |
| Flow prediction network, large 44.1kHz, v2 **(recommended)** | mmaudio_large_44k_v2.pth | 3.9G |
| 16kHz VAE | v1-16.pth | 655M |
| 16kHz BigVGAN vocoder (from Make-An-Audio 2) |best_netG.pt | 429M |
| 44.1kHz VAE |v1-44.pth | 1.2G |
| Synchformer visual encoder |synchformer_state_dict.pth | 907M |
To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes.
The 44.1kHz vocoder will be downloaded automatically.
The `_v2` model performs worse in benchmarking (e.g., in Fréchet distance), but, in my experience, generalizes better to new data.
The expected directory structure (full):
```bash
MMAudio
├── ext_weights
│ ├── best_netG.pt
│ ├── synchformer_state_dict.pth
│ ├── v1-16.pth
│ └── v1-44.pth
├── weights
│ ├── mmaudio_small_16k.pth
│ ├── mmaudio_small_44k.pth
│ ├── mmaudio_medium_44k.pth
│ ├── mmaudio_large_44k.pth
│ └── mmaudio_large_44k_v2.pth
└── ...
```
The expected directory structure (minimal, for the recommended model only):
```bash
MMAudio
├── ext_weights
│ ├── synchformer_state_dict.pth
│ └── v1-44.pth
├── weights
│ └── mmaudio_large_44k_v2.pth
└── ...
```