File size: 9,754 Bytes
a79d631
 
7f0c9de
 
 
a79d631
1bc351a
a79d631
1bc351a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f0c9de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
license: mit
pipeline_tag: image-to-video
tags:
- human-animation
---
<h1 align='center'>Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks</h1>

<div align='center'>
    <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1</sup>&emsp;
    <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1</sup>&emsp;
    <a href='https://github.com/subazinga' target='_blank'>Yun Zhan</a><sup>1</sup>&emsp;
    <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp;
    <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup>&emsp;
    <a href='https://github.com/mayuqi7777' target='_blank'>Yuqi Ma</a><sup>1</sup>&emsp;
    <a href='https://github.com/AricGamma' target='_blank'>Shan Mu</a><sup>1</sup>&emsp;
</div>
<div align='center'>
    <a href='https://hangz-nju-cuhk.github.io/' target='_blank'>Hang Zhou</a><sup>2</sup>&emsp;
    <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp;
    <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1โœ‰๏ธ</sup>&emsp;
</div>

<div align='center'>
    <sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp;
</div>

<br>
<div align='center'>
    <a href='https://github.com/fudan-generative-vision/hallo3'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo3?style=social'></a>
</div>
<br>

## ๐Ÿ“ธ Showcase

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
  <tr>
      <td>
          <video src="https://github.com/user-attachments/assets/3fc44086-bdbf-4a54-bfe3-62cfd9dfb191" width="100%" controls autoplay loop></video>
      </td>
      <td>
          <video src="https://github.com/user-attachments/assets/ad5a87cf-b50e-48d6-af35-774e3b1713e7" width="100%" controls autoplay loop></video>
      </td>
       <td>
          <video src="https://github.com/user-attachments/assets/78c7acc3-4fa2-447e-b77d-3462d411c81c" width="100%" controls autoplay loop></video>
     </td>
  </tr>
  <tr>
      <td>
          <video src="https://github.com/user-attachments/assets/f62f2b6d-9846-40be-a976-56cc7d5a8a5b" width="100%" controls autoplay loop></video>
      </td>
       <td>
          <video src="https://github.com/user-attachments/assets/42b6968e-c68a-4473-b773-406ccf5d90b1" width="100%" controls autoplay loop></video>
     </td>
      <td>
          <video src="https://github.com/user-attachments/assets/015f1d6d-31a8-4454-b51a-5431d3c953c2" width="100%" controls autoplay loop></video>
     </td>
  </tr>
</table>

Visit our [project page](https://fudan-generative-vision.github.io/hallo3/#/) to view more cases.

## โš™๏ธ Installation

- System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
- Tested GPUs: H100

Download the codes:

```bash
  git clone https://github.com/fudan-generative-vision/hallo3
  cd hallo3
```

Create conda environment:

```bash
  conda create -n hallo python=3.10
  conda activate hallo
```

Install packages with `pip`

```bash
  pip install -r requirements.txt
```

Besides, ffmpeg is also needed:

```bash
  apt-get install ffmpeg
```

### ๐Ÿ“ฅ Download Pretrained Models

You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/fudan-generative-ai/hallo3).

Using `huggingface-cli` to download the models:

```shell
cd $ProjectRootDir
pip install huggingface-cli
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
```

Or you can download them separately from their source repo:

- [hallo3](https://huggingface.co/fudan-generative-ai/hallo3/tree/main/hallo3): Our checkpoints.
- [Cogvidex](https://github.com/THUDM/CogVideo): Cogvideox-5b-i2v pretrained model, consisting of transformer and 3d vae
- [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
- [audio_separator](https://huggingface.co/huangjackson/Kim_Vocal_2): Kim Vocal_2 MDX-Net vocal removal model.
- [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h).
- [insightface](https://github.com/deepinsight/insightface/tree/master/python-package#model-zoo): 2D and 3D Face Analysis placed into `pretrained_models/face_analysis/models/`. (_Thanks to deepinsight_)
- [face landmarker](https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task): Face detection & mesh model from [mediapipe](https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker#models) placed into `pretrained_models/face_analysis/models`.

Finally, these pretrained models should be organized as follows:

```text
./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
|   |-- transformer/
|       |--1/
|           |-- mp_rank_00_model_states.pt  
|       `--latest
|   `-- vae/
|           |-- 3d-vae.pt
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- hallo3
|   |--1/
|       |-- mp_rank_00_model_states.pt 
|   `--latest
|-- t5-v1_1-xxl/
|   |-- added_tokens.json
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   |-- tokenizer_config.json
|   
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json
```

### ๐Ÿ› ๏ธ Prepare Inference Data

Hallo3 has a few simple requirements for the input data of inference:
1. Reference image must be 1:1 or 3:2 aspect ratio.
2. Driving audio must be in WAV format.
3. Audio must be in English since our training datasets are only in this language.
4. Ensure the vocals of audio are clear; background music is acceptable.

### ๐ŸŽฎ Run Inference

Simply to run the `scripts/inference_long_batch.sh`:

```bash
bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output
```

Animation results will be saved at `./output`. You can find more examples for inference at [examples folder](https://github.com/fudan-generative-vision/hallo3/tree/main/examples).


## Training

#### prepare data for training
Organize your raw videos into the following directory structure:
```text
dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   `-- 0003.mp4
|-- caption/
|   |-- 0001.txt
|   |-- 0002.txt
|   `-- 0003.txt
```
You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.

Next, process the videos with the following commands:
```bash
bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}
```

#### Training

Update the data meta path settings in the configuration YAML files, `configs/sft_s1.yaml` and `configs/sft_s2.yaml`:

```yaml
#sft_s1.yaml
train_data: [
    "./data/output_name.json"
]

#sft_s2.yaml
train_data: [
    "./data/output_name.json"
]
```

Start training with the following command:
```bash
# stage1
bash scripts/finetune_multi_gpus_s1.sh

# stage2
bash scripts/finetune_multi_gpus_s2.sh
```

## ๐Ÿ“ Citation

If you find our work useful for your research, please consider citing the paper:

```
@misc{cui2024hallo3,
	title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks}, 
	author={Jiahao Cui and Hui Li and Yun Zhang and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang Zhou and Jingdong Wang and Siyu Zhu},
	year={2024},
	eprint={2412.00733},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
}
```

## โš ๏ธ Social Risks and Mitigations

The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology.

## ๐Ÿค— Acknowledgements

This model is a fine-tuned derivative version based on the **CogVideo-5B I2V** model. CogVideo-5B is an open-source text-to-video generation model developed by the CogVideoX team. Its original code and model parameters are governed by the [CogVideo-5B LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).

As a derivative work of CogVideo-5B, the use, distribution, and modification of this model must comply with the license terms of CogVideo-5B.

## ๐Ÿ‘ Community Contributors

Thank you to all the contributors who have helped to make this project better!

<a href="https://github.com/fudan-generative-vision/hallo2/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=fudan-generative-vision/hallo3" />
</a>