Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Ant Group
## 📌 Updates
* [2025.01.10] 🔥 We release our inference [codes](https://github.com/antgroup/ditto-talkinghead) and [models](https://huggingface.co/digital-avatar/ditto-talkinghead).
* [2024.11.29] 🔥 Our [paper](https://arxiv.org/abs/2411.19509) is in public on arxiv.
## 🛠️ Installation
Tested Environment
- System: Centos 7.2
- GPU: A100
- Python: 3.10
- tensorRT: 8.6.1
Clone the codes from [GitHub](https://github.com/antgroup/ditto-talkinghead):
```bash
git clone https://github.com/antgroup/ditto-talkinghead
cd ditto-talkinghead
```
Create `conda` environment:
```bash
conda env create -f environment.yaml
conda activate ditto
```
## 📥 Download Checkpoints
Download checkpoints from [HuggingFace](https://huggingface.co/digital-avatar/ditto-talkinghead) and put them in `checkpoints` dir:
```bash
git lfs install
git clone https://huggingface.co/digital-avatar/ditto-talkinghead checkpoints
```
The `checkpoints` should be like:
```text
./checkpoints/
├── ditto_cfg
│ ├── v0.4_hubert_cfg_trt.pkl
│ └── v0.4_hubert_cfg_trt_online.pkl
├── ditto_onnx
│ ├── appearance_extractor.onnx
│ ├── blaze_face.onnx
│ ├── decoder.onnx
│ ├── face_mesh.onnx
│ ├── hubert.onnx
│ ├── insightface_det.onnx
│ ├── landmark106.onnx
│ ├── landmark203.onnx
│ ├── libgrid_sample_3d_plugin.so
│ ├── lmdm_v0.4_hubert.onnx
│ ├── motion_extractor.onnx
│ ├── stitch_network.onnx
│ └── warp_network.onnx
└── ditto_trt_Ampere_Plus
├── appearance_extractor_fp16.engine
├── blaze_face_fp16.engine
├── decoder_fp16.engine
├── face_mesh_fp16.engine
├── hubert_fp32.engine
├── insightface_det_fp16.engine
├── landmark106_fp16.engine
├── landmark203_fp16.engine
├── lmdm_v0.4_hubert_fp32.engine
├── motion_extractor_fp32.engine
├── stitch_network_fp16.engine
└── warp_network_fp16.engine
```
- The `ditto_cfg/v0.4_hubert_cfg_trt_online.pkl` is online config
- The `ditto_cfg/v0.4_hubert_cfg_trt.pkl` is offline config
## 🚀 Inference
Run `inference.py`:
```shell
python inference.py \
--data_root "" \
--cfg_pkl "" \
--audio_path "" \
--source_path "" \
--output_path ""
```
For example:
```shell
python inference.py \
--data_root "./checkpoints/ditto_trt_Ampere_Plus" \
--cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_trt.pkl" \
--audio_path "./example/audio.wav" \
--source_path "./example/image.png" \
--output_path "./tmp/result.mp4"
```
❗Note:
We have provided the tensorRT model with `hardware-compatibility-level=Ampere_Plus` (`checkpoints/ditto_trt_Ampere_Plus/`). If your GPU does not support it, please execute the `cvt_onnx_to_trt.py` script to convert from the general onnx model (`checkpoints/ditto_onnx/`) to the tensorRT model.
```bash
python script/cvt_onnx_to_trt.py --onnx_dir "./checkpoints/ditto_onnx" --trt_dir "./checkpoints/ditto_trt_custom"
```
Then run `inference.py` with `--data_root=./checkpoints/ditto_trt_custom`.
## 📧 Acknowledgement
Our implementation is based on [S2G-MDDiffusion](https://github.com/thuhcsi/S2G-MDDiffusion) and [LivePortrait](https://github.com/KwaiVGI/LivePortrait). Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.
## ⚖️ License
This repository is released under the Apache-2.0 license as found in the [LICENSE](LICENSE) file.
## 📚 Citation
If you find this codebase useful for your research, please use the following entry.
```BibTeX
@article{li2024ditto,
title={Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis},
author={Li, Tianqi and Zheng, Ruobing and Yang, Minghui and Chen, Jingdong and Yang, Ming},
journal={arXiv preprint arXiv:2411.19509},
year={2024}
}
```