# Pretrained models The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main | Model | Download link | File size | | -------- | ------- | ------- | | Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M | | Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M | | Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G | | Flow prediction network, large 44.1kHz | mmaudio_large_44k.pth | 3.9G | | Flow prediction network, large 44.1kHz, v2 **(recommended)** | mmaudio_large_44k_v2.pth | 3.9G | | 16kHz VAE | v1-16.pth | 655M | | 16kHz BigVGAN vocoder (from Make-An-Audio 2) |best_netG.pt | 429M | | 44.1kHz VAE |v1-44.pth | 1.2G | | Synchformer visual encoder |synchformer_state_dict.pth | 907M | To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically. The `_v2` model performs worse in benchmarking (e.g., in Fréchet distance), but, in my experience, generalizes better to new data. The expected directory structure (full): ```bash MMAudio ├── ext_weights │ ├── best_netG.pt │ ├── synchformer_state_dict.pth │ ├── v1-16.pth │ └── v1-44.pth ├── weights │ ├── mmaudio_small_16k.pth │ ├── mmaudio_small_44k.pth │ ├── mmaudio_medium_44k.pth │ ├── mmaudio_large_44k.pth │ └── mmaudio_large_44k_v2.pth └── ... ``` The expected directory structure (minimal, for the recommended model only): ```bash MMAudio ├── ext_weights │ ├── synchformer_state_dict.pth │ └── v1-44.pth ├── weights │ └── mmaudio_large_44k_v2.pth └── ... ```