Spaces:

Skywork
/

skyreels-a1-talking-head

Running on L40S

App Files Files Community

multimodalart HF staff commited on 22 days ago

Commit

38e20ed

verified ·

1 Parent(s): e3cc724

Upload 83 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +35 -0
LICENSE.txt +38 -0
README.md +194 -14
assets/.DS_Store +0 -0
assets/demo.gif +3 -0
assets/driving_audio/1.wav +3 -0
assets/driving_audio/2.wav +3 -0
assets/driving_audio/3.wav +3 -0
assets/driving_audio/4.wav +3 -0
assets/driving_audio/5.wav +3 -0
assets/driving_audio/6.wav +3 -0
assets/driving_video/.DS_Store +0 -0
assets/driving_video/1.mp4 +3 -0
assets/driving_video/2.mp4 +3 -0
assets/driving_video/3.mp4 +3 -0
assets/driving_video/4.mp4 +3 -0
assets/driving_video/5.mp4 +3 -0
assets/driving_video/6.mp4 +3 -0
assets/driving_video/7.mp4 +3 -0
assets/driving_video/8.mp4 +3 -0
assets/logo.png +0 -0
assets/ref_images/1.png +3 -0
assets/ref_images/10.png +3 -0
assets/ref_images/11.png +3 -0
assets/ref_images/12.png +3 -0
assets/ref_images/13.png +3 -0
assets/ref_images/14.png +3 -0
assets/ref_images/15.png +3 -0
assets/ref_images/16.png +3 -0
assets/ref_images/17.png +3 -0
assets/ref_images/18.png +3 -0
assets/ref_images/19.png +3 -0
assets/ref_images/2.png +3 -0
assets/ref_images/20.png +0 -0
assets/ref_images/3.png +3 -0
assets/ref_images/4.png +3 -0
assets/ref_images/5.png +3 -0
assets/ref_images/6.png +3 -0
assets/ref_images/7.png +3 -0
assets/ref_images/8.png +3 -0
diffposetalk/common.py +46 -0
diffposetalk/diff_talking_head.py +536 -0
diffposetalk/diffposetalk.py +228 -0
diffposetalk/hubert.py +51 -0
diffposetalk/utils/__init__.py +1 -0
diffposetalk/utils/common.py +378 -0
diffposetalk/utils/media.py +35 -0
diffposetalk/utils/renderer.py +147 -0
diffposetalk/utils/rotation_conversions.py +569 -0
diffposetalk/wav2vec2.py +119 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,38 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/demo.gif filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/1.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/2.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/3.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/4.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/5.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_audio/6.wav filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/2.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/3.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/4.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/5.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/6.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/7.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/driving_video/8.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/1.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/10.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/11.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/12.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/13.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/14.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/15.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/16.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/17.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/18.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/19.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/2.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/3.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/4.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/5.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/6.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/7.png filter=lfs diff=lfs merge=lfs -text
+assets/ref_images/8.png filter=lfs diff=lfs merge=lfs -text
+skyreels_a1/src/media_pipe/mp_models/face_landmarker_v2_with_blendshapes.task filter=lfs diff=lfs merge=lfs -text
+skyreels_a1/src/media_pipe/mp_models/pose_landmarker_heavy.task filter=lfs diff=lfs merge=lfs -text

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,38 @@

+---
+language:
+  - en
+  - zh
+license: other
+tasks:
+  - text-generation
+---
+<!-- markdownlint-disable first-line-h1 -->
+<!-- markdownlint-disable html -->
+# <span id="Terms">声明与协议/Terms and Conditions</span>
+## 声明
+我们在此声明，不要利用Skywork模型进行任何危害国家社会安全或违法的活动。另外，我们也要求使用者不要将 Skywork 模型用于未经适当安全审查和备案的互联网服务。我们希望所有的使用者都能遵守这个原则，确保科技的发展能在规范和合法的环境下进行。
+我们已经尽我们所能，来确保模型训练过程中使用的数据的合规性。然而，尽管我们已经做出了巨大的努力，但由于模型和数据的复杂性，仍有可能存在一些无法预见的问题。因此，如果由于使用skywork开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们将不承担任何责任。
+We hereby declare that the Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Additionally, we request users not to deploy the Skywork model for internet services without appropriate security reviews and records. We hope that all users will adhere to this principle to ensure that technological advancements occur in a regulated and lawful environment.
+We have done our utmost to ensure the compliance of the data used during the model's training process. However, despite our extensive efforts, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, if any problems arise as a result of using the Skywork open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized, we will not assume any responsibility.
+## 协议
+社区使用Skywork模型需要遵循[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)。Skywork模型支持商业用途，如果您计划将Skywork模型或其衍生品用于商业目的，无需再次申请， 但请您仔细阅读[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)并严格遵守相关条款。
+The community usage of Skywork model requires [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf). The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf).
+[《Skywork 模型社区许可协议》》]:https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf
+[[email protected]]: mailto:[email protected]

README.md CHANGED Viewed

@@ -1,14 +1,194 @@
----
-title: Skyreels Talking Head
-emoji: 😻
-colorFrom: yellow
-colorTo: green
-sdk: gradio
-sdk_version: 5.20.0
-app_file: app.py
-pinned: false
-license: mit
-short_description: audio to talking face
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<p align="center">
+  <img src="assets/logo.png" alt="Skyreels Logo" width="50%">
+</p>
+<h1 align="center">SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers</h1>
+<div align='center'>
+    <a href='https://scholar.google.com/citations?user=6D_nzucAAAAJ&hl=en' target='_blank'>Di Qiu</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=_43YnBcAAAAJ&hl=zh-CN' target='_blank'>Zhengcong Fei</a>&emsp;
+    <a href='' target='_blank'>Rui Wang</a>&emsp;
+    <a href='' target='_blank'>Jialin Bai</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=Hv-vj2sAAAAJ&hl=en' target='_blank'>Changqian Yu</a>&emsp;
+</div>
+<div align='center'>
+  <a href='https://scholar.google.com.au/citations?user=ePIeVuUAAAAJ&hl=en' target='_blank'>Mingyuan Fan</a>&emsp;
+  <a href='https://scholar.google.com/citations?user=HukWSw4AAAAJ&hl=en' target='_blank'>Guibin Chen</a>&emsp;
+  <a href='https://scholar.google.com.tw/citations?user=RvAuMk0AAAAJ&hl=zh-CN' target='_blank'>Xiang Wen</a>&emsp;
+</div>
+<div align='center'>
+    <small><strong>Skywork AI</strong></small>
+</div>
+<br>
+<div align="center">
+  <!-- <a href='LICENSE'><img src='https://img.shields.io/badge/license-MIT-yellow'></a> -->
+  <a href='https://arxiv.org/abs/2502.10841'><img src='https://img.shields.io/badge/arXiv-SkyReels A1-red'></a>
+  <a href='https://skyworkai.github.io/skyreels-a1.github.io/'><img src='https://img.shields.io/badge/Project-SkyReels A1-green'></a>
+  <a href='https://huggingface.co/Skywork/SkyReels-A1'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
+  <a href='https://www.skyreels.ai/home?utm_campaign=github_A1'><img src='https://img.shields.io/badge/Playground-Spaces-yellow'></a>
+  <br>
+</div>
+<br>
+<p align="center">
+  <img src="./assets/demo.gif" alt="showcase">
+  <br>
+  🔥 For more results, visit our <a href="https://skyworkai.github.io/skyreels-a1.github.io/"><strong>homepage</strong></a> 🔥
+</p>
+<p align="center">
+    👋 Join our <a href="https://discord.gg/PwM6NYtccQ" target="_blank"><strong>Discord</strong></a>
+</p>
+This repo, named **SkyReels-A1**, contains the official PyTorch implementation of our paper [SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers](https://arxiv.org).
+## 🔥🔥🔥 News!!
+* Mar 4, 2025: 🔥 We release audio-driven portrait image animation pipeline.
+* Feb 18, 2025: 👋 We release the inference code and model weights of SkyReels-A1. [Download](https://huggingface.co/Skywork/SkyReels-A1)
+* Feb 18, 2025: 🎉 We have made our technical report available as open source. [Read](https://skyworkai.github.io/skyreels-a1.github.io/report.pdf)
+* Feb 18, 2025: 🔥 Our online demo of LipSync is available on SkyReels now! Try out [LipSync](https://www.skyreels.ai/home/tools/lip-sync?refer=navbar).
+* Feb 18, 2025: 🔥 We have open-sourced I2V video generation model [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1). This is the first and most advanced open-source human-centric video foundation model.
+## 📑 TODO List
+- [x] Checkpoints
+- [x] Inference Code
+- [x] Web Demo (Gradio)
+- [x] Audio-driven Portrait Image Animation Pipeline
+- [ ] Inference Code for Long Videos
+- [ ] User-Level GPU Inference on RTX4090
+- [ ] ComfyUI
+## Getting Started 🏁
+### 1. Clone the code and prepare the environment 🛠️
+First git clone the repository with code:
+```bash
+git clone https://github.com/SkyworkAI/SkyReels-A1.git
+cd SkyReels-A1
+# create env using conda
+conda create -n skyreels-a1 python=3.10
+conda activate skyreels-a1
+```
+Then, install the remaining dependencies:
+```bash
+pip install -r requirements.txt
+```
+### 2. Download pretrained weights 📥
+You can download the pretrained weights is from HuggingFace:
+```bash
+# !pip install -U "huggingface_hub[cli]"
+huggingface-cli download SkyReels-A1 --local-dir local_path --exclude "*.git*" "README.md" "docs"
+```
+The FLAME, mediapipe, and smirk models are located in the SkyReels-A1/extra_models folder.
+The directory structure of our SkyReels-A1 code is formulated as:
+```text
+pretrained_models
+├── FLAME
+├── SkyReels-A1-5B
+│   ├── pose_guider
+│   ├── scheduler
+│   ├── tokenizer
+│   ├── siglip-so400m-patch14-384
+│   ├── transformer
+│   ├── vae
+│   └── text_encoder
+├── mediapipe
+└── smirk
+```
+#### Download DiffposeTalk assets and pretrained weights (For Audio-driven)
+- We use [diffposetalk](https://github.com/DiffPoseTalk/DiffPoseTalk/tree/main) to generate flame coefficients from audio, thereby constructing motion signals.
+- Download the diffposetalk code and follow its README to download the weights and related data.
+- Then place them in the specified directory.
+```bash
+cp -r ${diffposetalk_root}/style pretrained_models/diffposetalk
+cp ${diffposetalk_root}/experiments/DPT/head-SA-hubert-WM/checkpoints/iter_0110000.pt pretrained_models/diffposetalk
+cp ${diffposetalk_root}/datasets/HDTF_TFHP/lmdb/stats_train.npz pretrained_models/diffposetalk
+```
+```text
+pretrained_models
+├── FLAME
+├── SkyReels-A1-5B
+├── mediapipe
+├── diffposetalk
+│   ├── style
+│   ├── iter_0110000.pt
+│   ├── states_train.npz
+└── smirk
+```
+### 3. Inference 🚀
+You can simply run the inference scripts as:
+```bash
+python inference.py
+# inference audio to video
+python inference_audio.py
+```
+If the script runs successfully, you will get an output mp4 file. This file includes the following results: driving video, input image or video, and generated result.
+## Gradio Interface 🤗
+We provide a [Gradio](https://huggingface.co/docs/hub/spaces-sdks-gradio) interface for a better experience, just run by:
+```bash
+python app.py
+```
+The graphical interactive interface is shown as below:
+![gradio](https://github.com/user-attachments/assets/ed56f08c-f31c-4fbe-ac1d-c4d4e87a8719)
+## Metric Evaluation 👓
+We also provide all scripts for automatically calculating the metrics, including SimFace, FID, and L1 distance between expression and motion, reported in the paper.
+All codes can be found in the ```eval``` folder. After setting the video result path, run the following commands in sequence:
+```bash
+python arc_score.py
+python expression_score.py
+python pose_score.py
+```
+## Acknowledgements 💐
+We would like to thank the contributors of [CogvideoX](https://github.com/THUDM/CogVideo), [finetrainers](https://github.com/a-r-r-o-w/finetrainers) and [DiffPoseTalk](https://github.com/DiffPoseTalk/DiffPoseTalk)repositories, for their open research and contributions.
+## Citation 💖
+If you find SkyReels-A1 useful for your research, welcome to 🌟 this repo and cite our work using the following BibTeX:
+```bibtex
+@article{qiu2025skyreels,
+  title={SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers},
+  author={Qiu, Di and Fei, Zhengcong and Wang, Rui and Bai, Jialin and Yu, Changqian and Fan, Mingyuan and Chen, Guibin and Wen, Xiang},
+  journal={arXiv preprint arXiv:2502.10841},
+  year={2025}
+}
+```

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/demo.gif ADDED Viewed

Git LFS Details

SHA256: 1b8c13b7c718a9e2645dd4490dfe645a880781121b7d207ed43cc7cd3d0a35e4
Pointer size: 132 Bytes
Size of remote file: 3.08 MB

assets/driving_audio/1.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:38dab65a002455f4c186d4b0bde848c964415441d9636d94a48d5b32f23b0f6f
+size 575850

assets/driving_audio/2.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:15b998fd3fbabf22e9fde210f93df4f7b1647e19fe81b2d33b2b74470fea32b5
+size 3891278

assets/driving_audio/3.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:448576f545e18c4cb33cb934a00c0bc3f331896eba4d1cb6a077c1b9382d0628
+size 910770

assets/driving_audio/4.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3e0d37fc6e235b5a09eb4b7e3b0b5d5f566e2204c5c815c23ca2215dcbf9c93
+size 553038

assets/driving_audio/5.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:95275c9299919e38e52789cb3af17ddc4691b7afea82f26c7edc640addce057d
+size 856142

assets/driving_audio/6.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7558a976b8b64d214f33c65e503c77271d3be0cd116a00ddadcb2b2fc53a6396
+size 2641742

assets/driving_video/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/driving_video/1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b7da4f10cf9e692ba8c75848bacceb3c4d30ee8d3b07719435560c44a8da6544
+size 306996

assets/driving_video/2.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e795b7be655c4b5ae8cac0733a32e8d321ccebd13f2cac07cc15dfc8f61a547
+size 2875843

assets/driving_video/3.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02f5ee85c1028c9673c70682b533a4f22e203173eddd40de42bad0cb57f18abb
+size 1020948

assets/driving_video/4.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f7ddbb17b198a580f658d57f4d83bee7489aa4d8a677f2c45b76b1ec01ae461
+size 215144

assets/driving_video/5.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9637fea5ef83b494a0aa8b7c526ae1efc6ec94d79dfa94381de8d6f38eec238e
+size 556047

assets/driving_video/6.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ac7ee3c2419046f11dc230b6db33c2391a98334eba2b1d773e7eb9627992622f
+size 1064930

assets/driving_video/7.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1dc94c1fec7ef7dc831c8a49f0e1788ae568812cb68e62f6875d9070f573d02a
+size 187263

assets/driving_video/8.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3047ba66296d96b8a4584e412e61493d7bc0fa5149c77b130e7feea375e698bd
+size 232859

assets/logo.png ADDED Viewed

assets/ref_images/1.png ADDED Viewed

Git LFS Details

SHA256: 93429c6e7408723b04f3681cc06ac98072f8ce4fd69476ee612466a335ca152c
Pointer size: 132 Bytes
Size of remote file: 1.07 MB

assets/ref_images/10.png ADDED Viewed

Git LFS Details

SHA256: ef7456fd5eb3b31584f0933d1b71c25f92a8a9cb428466c1c4daf4eede2db9d3
Pointer size: 131 Bytes
Size of remote file: 508 kB

assets/ref_images/11.png ADDED Viewed

Git LFS Details

SHA256: 99bfeeecbefa2bf408d0f15688ae89fa4c71f881d78baced1591bef128367efc
Pointer size: 131 Bytes
Size of remote file: 634 kB

assets/ref_images/12.png ADDED Viewed

Git LFS Details

SHA256: c258cba0979585f3fac4d63d9ca0fc3e51604afde62a272f069146ae43d1a996
Pointer size: 131 Bytes
Size of remote file: 793 kB

assets/ref_images/13.png ADDED Viewed

Git LFS Details

SHA256: c31191bc70144def9c0de388483d0a9257b0e4eb72128474232bbaa234f5a0a5
Pointer size: 131 Bytes
Size of remote file: 633 kB

assets/ref_images/14.png ADDED Viewed

Git LFS Details

SHA256: 8058fc784284c59f1954269638f1ad937ac35cf58563b935736d3f34e6355045
Pointer size: 131 Bytes
Size of remote file: 517 kB

assets/ref_images/15.png ADDED Viewed

Git LFS Details

SHA256: 4c3e49512a2253b2a7291ad6b1636521e66b10050dba37a0b9d47c9a5666fb61
Pointer size: 131 Bytes
Size of remote file: 641 kB

assets/ref_images/16.png ADDED Viewed

Git LFS Details

SHA256: 5e65a2f40f5f971b0e91023e774ce8aff56a1da723c1f8ffdfc5ec616690cde2
Pointer size: 131 Bytes
Size of remote file: 392 kB

assets/ref_images/17.png ADDED Viewed

Git LFS Details

SHA256: 202b2a66e87de425c55e223554942da71a4b0a27757bc2f90ec4c8d51133934b
Pointer size: 131 Bytes
Size of remote file: 750 kB

assets/ref_images/18.png ADDED Viewed

Git LFS Details

SHA256: 06a756c3e0a0b5d786428b0968126281c292e1df2c286cb683bac059821c0122
Pointer size: 131 Bytes
Size of remote file: 183 kB

assets/ref_images/19.png ADDED Viewed

Git LFS Details

SHA256: 277fc3ecf3c0299f87cb59d056c5d484feb2fa7897c9d0f80ee0854eba2c3487
Pointer size: 131 Bytes
Size of remote file: 283 kB

assets/ref_images/2.png ADDED Viewed

Git LFS Details

SHA256: 5c972790d52fc6adf7e5bcb4611720570e260f56c52f063acfea5e4d2f52c07f
Pointer size: 131 Bytes
Size of remote file: 762 kB

assets/ref_images/20.png ADDED Viewed

assets/ref_images/3.png ADDED Viewed

Git LFS Details

SHA256: bce73675d41349d0792e9903d08ad12280d0e1b3af21e686720a7dac5dcaa649
Pointer size: 131 Bytes
Size of remote file: 737 kB

assets/ref_images/4.png ADDED Viewed

Git LFS Details

SHA256: 03ff23c5be3ff225969ddd97a26971bab40af4cc6012f0f859971a12cd8e9003
Pointer size: 131 Bytes
Size of remote file: 348 kB

assets/ref_images/5.png ADDED Viewed

Git LFS Details

SHA256: 6b9c2279c99ef4f354fa9e2ea8f1751e8f35ed2ed937e5a2b0b3c918fb49f947
Pointer size: 131 Bytes
Size of remote file: 375 kB

assets/ref_images/6.png ADDED Viewed

Git LFS Details

SHA256: d127961dece864d4000351c1c14a71d3c1bc54c51c2cce6d9dd1c74bdea0ec4c
Pointer size: 131 Bytes
Size of remote file: 370 kB

assets/ref_images/7.png ADDED Viewed

Git LFS Details

SHA256: c1e2c11b7f9832b2acbf454065b2beebf95f6817f623ee1fe56ff2fafc0caf1d
Pointer size: 131 Bytes
Size of remote file: 542 kB

assets/ref_images/8.png ADDED Viewed

Git LFS Details

SHA256: 8c8aa92c1bea3f5f0b1b3859b35ed801fc4022f064b3ebba09e621157a2ac4c6
Pointer size: 131 Bytes
Size of remote file: 358 kB

diffposetalk/common.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model, dropout=0.1, max_len=600):
+        super().__init__()
+        self.dropout = nn.Dropout(p=dropout)
+        # vanilla sinusoidal encoding
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.register_buffer('pe', pe)
+    def forward(self, x):
+        x = x + self.pe[:, x.shape[1], :]
+        return self.dropout(x)
+def enc_dec_mask(T, S, frame_width=2, expansion=0, device='cuda'):
+    mask = torch.ones(T, S)
+    for i in range(T):
+        mask[i, max(0, (i - expansion) * frame_width):(i + expansion + 1) * frame_width] = 0
+    return (mask == 1).to(device=device)
+def pad_audio(audio, audio_unit=320, pad_threshold=80):
+    batch_size, audio_len = audio.shape
+    n_units = audio_len // audio_unit
+    side_len = math.ceil((audio_unit * n_units + pad_threshold - audio_len) / 2)
+    if side_len >= 0:
+        reflect_len = side_len // 2
+        replicate_len = side_len % 2
+        if reflect_len > 0:
+            audio = F.pad(audio, (reflect_len, reflect_len), mode='reflect')
+            audio = F.pad(audio, (reflect_len, reflect_len), mode='reflect')
+        if replicate_len > 0:
+            audio = F.pad(audio, (1, 1), mode='replicate')
+    return audio

diffposetalk/diff_talking_head.py ADDED Viewed

	@@ -0,0 +1,536 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .common import PositionalEncoding, enc_dec_mask, pad_audio
+class DiffusionSchedule(nn.Module):
+    def __init__(self, num_steps, mode='linear', beta_1=1e-4, beta_T=0.02, s=0.008):
+        super().__init__()
+        if mode == 'linear':
+            betas = torch.linspace(beta_1, beta_T, num_steps)
+        elif mode == 'quadratic':
+            betas = torch.linspace(beta_1 ** 0.5, beta_T ** 0.5, num_steps) ** 2
+        elif mode == 'sigmoid':
+            betas = torch.sigmoid(torch.linspace(-5, 5, num_steps)) * (beta_T - beta_1) + beta_1
+        elif mode == 'cosine':
+            steps = num_steps + 1
+            x = torch.linspace(0, num_steps, steps)
+            alpha_bars = torch.cos(((x / num_steps) + s) / (1 + s) * torch.pi * 0.5) ** 2
+            alpha_bars = alpha_bars / alpha_bars[0]
+            betas = 1 - (alpha_bars[1:] / alpha_bars[:-1])
+            betas = torch.clip(betas, 0.0001, 0.999)
+        else:
+            raise ValueError(f'Unknown diffusion schedule {mode}!')
+        betas = torch.cat([torch.zeros(1), betas], dim=0)  # Padding beta_0 = 0
+        alphas = 1 - betas
+        log_alphas = torch.log(alphas)
+        for i in range(1, log_alphas.shape[0]):  # 1 to T
+            log_alphas[i] += log_alphas[i - 1]
+        alpha_bars = log_alphas.exp()
+        sigmas_flex = torch.sqrt(betas)
+        sigmas_inflex = torch.zeros_like(sigmas_flex)
+        for i in range(1, sigmas_flex.shape[0]):
+            sigmas_inflex[i] = ((1 - alpha_bars[i - 1]) / (1 - alpha_bars[i])) * betas[i]
+        sigmas_inflex = torch.sqrt(sigmas_inflex)
+        self.num_steps = num_steps
+        self.register_buffer('betas', betas)
+        self.register_buffer('alphas', alphas)
+        self.register_buffer('alpha_bars', alpha_bars)
+        self.register_buffer('sigmas_flex', sigmas_flex)
+        self.register_buffer('sigmas_inflex', sigmas_inflex)
+    def uniform_sample_t(self, batch_size):
+        ts = torch.randint(1, self.num_steps + 1, (batch_size,))
+        return ts.tolist()
+    def get_sigmas(self, t, flexibility=0):
+        assert 0 <= flexibility <= 1
+        sigmas = self.sigmas_flex[t] * flexibility + self.sigmas_inflex[t] * (1 - flexibility)
+        return sigmas
+class DiffTalkingHead(nn.Module):
+    def __init__(self, args, device='cuda'):
+        super().__init__()
+        # Model parameters
+        self.target = args.target
+        self.architecture = args.architecture
+        self.use_style = args.style_enc_ckpt is not None
+        self.motion_feat_dim = 50
+        if args.rot_repr == 'aa':
+            self.motion_feat_dim += 1 if args.no_head_pose else 4
+        else:
+            raise ValueError(f'Unknown rotation representation {args.rot_repr}!')
+        self.fps = args.fps
+        self.n_motions = args.n_motions
+        self.n_prev_motions = args.n_prev_motions
+        if self.use_style:
+            self.style_feat_dim = args.d_style
+        # Audio encoder
+        self.audio_model = args.audio_model
+        if self.audio_model == 'wav2vec2':
+            from .wav2vec2 import Wav2Vec2Model
+            self.audio_encoder = Wav2Vec2Model.from_pretrained('facebook/wav2vec2-base-960h')
+            # wav2vec 2.0 weights initialization
+            self.audio_encoder.feature_extractor._freeze_parameters()
+        elif self.audio_model == 'hubert':
+            from .hubert import HubertModel
+            self.audio_encoder = HubertModel.from_pretrained('facebook/hubert-base-ls960')
+            self.audio_encoder.feature_extractor._freeze_parameters()
+            frozen_layers = [0, 1]
+            for name, param in self.audio_encoder.named_parameters():
+                if name.startswith("feature_projection"):
+                    param.requires_grad = False
+                if name.startswith("encoder.layers"):
+                    layer = int(name.split(".")[2])
+                    if layer in frozen_layers:
+                        param.requires_grad = False
+        else:
+            raise ValueError(f'Unknown audio model {self.audio_model}!')
+        if args.architecture == 'decoder':
+            self.audio_feature_map = nn.Linear(768, args.feature_dim)
+            self.start_audio_feat = nn.Parameter(torch.randn(1, self.n_prev_motions, args.feature_dim))
+        else:
+            raise ValueError(f'Unknown architecture {args.architecture}!')
+        self.start_motion_feat = nn.Parameter(torch.randn(1, self.n_prev_motions, self.motion_feat_dim))
+        # Diffusion model
+        self.denoising_net = DenoisingNetwork(args, device)
+        # diffusion schedule
+        self.diffusion_sched = DiffusionSchedule(args.n_diff_steps, args.diff_schedule)
+        # Classifier-free settings
+        self.cfg_mode = args.cfg_mode
+        guiding_conditions = args.guiding_conditions.split(',') if args.guiding_conditions else []
+        self.guiding_conditions = [cond for cond in guiding_conditions if cond in ['style', 'audio']]
+        if 'style' in self.guiding_conditions:
+            if not self.use_style:
+                raise ValueError('Cannot use style guiding without enabling it!')
+            self.null_style_feat = nn.Parameter(torch.randn(1, 1, self.style_feat_dim))
+        if 'audio' in self.guiding_conditions:
+            audio_feat_dim = args.feature_dim
+            self.null_audio_feat = nn.Parameter(torch.randn(1, 1, audio_feat_dim))
+        self.to(device)
+    @property
+    def device(self):
+        return next(self.parameters()).device
+    def forward(self, motion_feat, audio_or_feat, shape_feat, style_feat=None,
+                prev_motion_feat=None, prev_audio_feat=None, time_step=None, indicator=None):
+        """
+        Args:
+            motion_feat: (N, L, d_coef) motion coefficients or features
+            audio_or_feat: (N, L_audio) raw audio or audio feature
+            shape_feat: (N, d_shape) or (N, 1, d_shape)
+            style_feat: (N, d_style)
+            prev_motion_feat: (N, n_prev_motions, d_motion) previous motion coefficients or feature
+            prev_audio_feat: (N, n_prev_motions, d_audio) previous audio features
+            time_step: (N,)
+            indicator: (N, L) 0/1 indicator of real (unpadded) motion coefficients
+        Returns:
+           motion_feat_noise: (N, L, d_motion)
+        """
+        if self.use_style:
+            assert style_feat is not None, 'Missing style features!'
+        batch_size = motion_feat.shape[0]
+        if audio_or_feat.ndim == 2:
+            # Extract audio features
+            assert audio_or_feat.shape[1] == 16000 * self.n_motions / self.fps, \
+                f'Incorrect audio length {audio_or_feat.shape[1]}'
+            audio_feat_saved = self.extract_audio_feature(audio_or_feat)  # (N, L, feature_dim)
+        elif audio_or_feat.ndim == 3:
+            assert audio_or_feat.shape[1] == self.n_motions, f'Incorrect audio feature length {audio_or_feat.shape[1]}'
+            audio_feat_saved = audio_or_feat
+        else:
+            raise ValueError(f'Incorrect audio input shape {audio_or_feat.shape}')
+        audio_feat = audio_feat_saved.clone()
+        if shape_feat.ndim == 2:
+            shape_feat = shape_feat.unsqueeze(1)  # (N, 1, d_shape)
+        if style_feat is not None and style_feat.ndim == 2:
+            style_feat = style_feat.unsqueeze(1)  # (N, 1, d_style)
+        if prev_motion_feat is None:
+            prev_motion_feat = self.start_motion_feat.expand(batch_size, -1, -1)  # (N, n_prev_motions, d_motion)
+        if prev_audio_feat is None:
+            # (N, n_prev_motions, feature_dim)
+            prev_audio_feat = self.start_audio_feat.expand(batch_size, -1, -1)
+        # Classifier-free guidance
+        if len(self.guiding_conditions) > 0:
+            assert len(self.guiding_conditions) <= 2, 'Only support 1 or 2 CFG conditions!'
+            if len(self.guiding_conditions) == 1 or self.cfg_mode == 'independent':
+                null_cond_prob = 0.5 if len(self.guiding_conditions) >= 2 else 0.1
+                if 'style' in self.guiding_conditions:
+                    mask_style = torch.rand(batch_size, device=self.device) < null_cond_prob
+                    style_feat = torch.where(mask_style.view(-1, 1, 1),
+                                             self.null_style_feat.expand(batch_size, -1, -1),
+                                             style_feat)
+                if 'audio' in self.guiding_conditions:
+                    mask_audio = torch.rand(batch_size, device=self.device) < null_cond_prob
+                    audio_feat = torch.where(mask_audio.view(-1, 1, 1),
+                                             self.null_audio_feat.expand(batch_size, self.n_motions, -1),
+                                             audio_feat)
+            else:
+                # len(self.guiding_conditions) > 1 and self.cfg_mode == 'incremental'
+                # full (0.45), w/o style (0.45), w/o style or audio (0.1)
+                mask_flag = torch.rand(batch_size, device=self.device)
+                if 'style' in self.guiding_conditions:
+                    mask_style = mask_flag > 0.55
+                    style_feat = torch.where(mask_style.view(-1, 1, 1),
+                                             self.null_style_feat.expand(batch_size, -1, -1),
+                                             style_feat)
+                if 'audio' in self.guiding_conditions:
+                    mask_audio = mask_flag > 0.9
+                    audio_feat = torch.where(mask_audio.view(-1, 1, 1),
+                                             self.null_audio_feat.expand(batch_size, self.n_motions, -1),
+                                             audio_feat)
+        if style_feat is None:
+            # The model only accepts audio and shape features, i.e., self.use_style = False
+            person_feat = shape_feat
+        else:
+            person_feat = torch.cat([shape_feat, style_feat], dim=-1)
+        if time_step is None:
+            # Sample time step
+            time_step = self.diffusion_sched.uniform_sample_t(batch_size)  # (N,)
+        # The forward diffusion process
+        alpha_bar = self.diffusion_sched.alpha_bars[time_step]  # (N,)
+        c0 = torch.sqrt(alpha_bar).view(-1, 1, 1)  # (N, 1, 1)
+        c1 = torch.sqrt(1 - alpha_bar).view(-1, 1, 1)  # (N, 1, 1)
+        eps = torch.randn_like(motion_feat)  # (N, L, d_motion)
+        motion_feat_noisy = c0 * motion_feat + c1 * eps
+        # The reverse diffusion process
+        motion_feat_target = self.denoising_net(motion_feat_noisy, audio_feat, person_feat,
+                                                prev_motion_feat, prev_audio_feat, time_step, indicator)
+        return eps, motion_feat_target, motion_feat.detach(), audio_feat_saved.detach()
+    def extract_audio_feature(self, audio, frame_num=None):
+        frame_num = frame_num or self.n_motions
+        # # Strategy 1: resample during audio feature extraction
+        # hidden_states = self.audio_encoder(pad_audio(audio), self.fps, frame_num=frame_num).last_hidden_state  # (N, L, 768)
+        # Strategy 2: resample after audio feature extraction (BackResample)
+        hidden_states = self.audio_encoder(pad_audio(audio), self.fps,
+                                           frame_num=frame_num * 2).last_hidden_state  # (N, 2L, 768)
+        hidden_states = hidden_states.transpose(1, 2)  # (N, 768, 2L)
+        hidden_states = F.interpolate(hidden_states, size=frame_num, align_corners=False, mode='linear')  # (N, 768, L)
+        hidden_states = hidden_states.transpose(1, 2)  # (N, L, 768)
+        audio_feat = self.audio_feature_map(hidden_states)  # (N, L, feature_dim)
+        return audio_feat
+    @torch.no_grad()
+    def sample(self, audio_or_feat, shape_feat, style_feat=None, prev_motion_feat=None, prev_audio_feat=None,
+               motion_at_T=None, indicator=None, cfg_mode=None, cfg_cond=None, cfg_scale=1.15, flexibility=0,
+               dynamic_threshold=None, ret_traj=False):
+        # Check and convert inputs
+        batch_size = audio_or_feat.shape[0]
+        # Check CFG conditions
+        if cfg_mode is None:  # Use default CFG mode
+            cfg_mode = self.cfg_mode
+        if cfg_cond is None:  # Use default CFG conditions
+            cfg_cond = self.guiding_conditions
+        cfg_cond = [c for c in cfg_cond if c in ['audio', 'style']]
+        if not isinstance(cfg_scale, list):
+            cfg_scale = [cfg_scale] * len(cfg_cond)
+        # sort cfg_cond and cfg_scale
+        if len(cfg_cond) > 0:
+            cfg_cond, cfg_scale = zip(*sorted(zip(cfg_cond, cfg_scale), key=lambda x: ['audio', 'style'].index(x[0])))
+        else:
+            cfg_cond, cfg_scale = [], []
+        if 'style' in cfg_cond:
+            assert self.use_style and style_feat is not None
+        if self.use_style:
+            if style_feat is None:  # use null style feature
+                style_feat = self.null_style_feat.expand(batch_size, -1, -1)
+        else:
+            assert style_feat is None, 'This model does not support style feature input!'
+        if audio_or_feat.ndim == 2:
+            # Extract audio features
+            assert audio_or_feat.shape[1] == 16000 * self.n_motions / self.fps, \
+                f'Incorrect audio length {audio_or_feat.shape[1]}'
+            audio_feat = self.extract_audio_feature(audio_or_feat)  # (N, L, feature_dim)
+        elif audio_or_feat.ndim == 3:
+            assert audio_or_feat.shape[1] == self.n_motions, f'Incorrect audio feature length {audio_or_feat.shape[1]}'
+            audio_feat = audio_or_feat
+        else:
+            raise ValueError(f'Incorrect audio input shape {audio_or_feat.shape}')
+        if shape_feat.ndim == 2:
+            shape_feat = shape_feat.unsqueeze(1)  # (N, 1, d_shape)
+        if style_feat is not None and style_feat.ndim == 2:
+            style_feat = style_feat.unsqueeze(1)  # (N, 1, d_style)
+        if prev_motion_feat is None:
+            prev_motion_feat = self.start_motion_feat.expand(batch_size, -1, -1)  # (N, n_prev_motions, d_motion)
+        if prev_audio_feat is None:
+            # (N, n_prev_motions, feature_dim)
+            prev_audio_feat = self.start_audio_feat.expand(batch_size, -1, -1)
+        if motion_at_T is None:
+            motion_at_T = torch.randn((batch_size, self.n_motions, self.motion_feat_dim)).to(self.device)
+        # Prepare input for the reverse diffusion process (including optional classifier-free guidance)
+        if 'audio' in cfg_cond:
+            audio_feat_null = self.null_audio_feat.expand(batch_size, self.n_motions, -1)
+        else:
+            audio_feat_null = audio_feat
+        if 'style' in cfg_cond:
+            person_feat_null = torch.cat([shape_feat, self.null_style_feat.expand(batch_size, -1, -1)], dim=-1)
+        else:
+            if self.use_style:
+                person_feat_null = torch.cat([shape_feat, style_feat], dim=-1)
+            else:
+                person_feat_null = shape_feat
+        audio_feat_in = [audio_feat_null]
+        person_feat_in = [person_feat_null]
+        for cond in cfg_cond:
+            if cond == 'audio':
+                audio_feat_in.append(audio_feat)
+                person_feat_in.append(person_feat_null)
+            elif cond == 'style':
+                if cfg_mode == 'independent':
+                    audio_feat_in.append(audio_feat_null)
+                elif cfg_mode == 'incremental':
+                    audio_feat_in.append(audio_feat)
+                else:
+                    raise NotImplementedError(f'Unknown cfg_mode {cfg_mode}')
+                person_feat_in.append(torch.cat([shape_feat, style_feat], dim=-1))
+        n_entries = len(audio_feat_in)
+        audio_feat_in = torch.cat(audio_feat_in, dim=0)
+        person_feat_in = torch.cat(person_feat_in, dim=0)
+        prev_motion_feat_in = torch.cat([prev_motion_feat] * n_entries, dim=0)
+        prev_audio_feat_in = torch.cat([prev_audio_feat] * n_entries, dim=0)
+        indicator_in = torch.cat([indicator] * n_entries, dim=0) if indicator is not None else None
+        traj = {self.diffusion_sched.num_steps: motion_at_T}
+        for t in range(self.diffusion_sched.num_steps, 0, -1):
+            if t > 1:
+                z = torch.randn_like(motion_at_T)
+            else:
+                z = torch.zeros_like(motion_at_T)
+            alpha = self.diffusion_sched.alphas[t]
+            alpha_bar = self.diffusion_sched.alpha_bars[t]
+            alpha_bar_prev = self.diffusion_sched.alpha_bars[t - 1]
+            sigma = self.diffusion_sched.get_sigmas(t, flexibility)
+            motion_at_t = traj[t]
+            motion_in = torch.cat([motion_at_t] * n_entries, dim=0)
+            step_in = torch.tensor([t] * batch_size, device=self.device)
+            step_in = torch.cat([step_in] * n_entries, dim=0)
+            results = self.denoising_net(motion_in, audio_feat_in, person_feat_in, prev_motion_feat_in,
+                                         prev_audio_feat_in, step_in, indicator_in)
+            # Apply thresholding if specified
+            if dynamic_threshold:
+                dt_ratio, dt_min, dt_max = dynamic_threshold
+                abs_results = results[:, -self.n_motions:].reshape(batch_size * n_entries, -1).abs()
+                s = torch.quantile(abs_results, dt_ratio, dim=1)
+                s = torch.clamp(s, min=dt_min, max=dt_max)
+                s = s[..., None, None]
+                results = torch.clamp(results, min=-s, max=s)
+            results = results.chunk(n_entries)
+            # Unconditional target (CFG) or the conditional target (non-CFG)
+            target_theta = results[0][:, -self.n_motions:]
+            # Classifier-free Guidance (optional)
+            for i in range(0, n_entries - 1):
+                if cfg_mode == 'independent':
+                    target_theta += cfg_scale[i] * (
+                                results[i + 1][:, -self.n_motions:] - results[0][:, -self.n_motions:])
+                elif cfg_mode == 'incremental':
+                    target_theta += cfg_scale[i] * (
+                                results[i + 1][:, -self.n_motions:] - results[i][:, -self.n_motions:])
+                else:
+                    raise NotImplementedError(f'Unknown cfg_mode {cfg_mode}')
+            if self.target == 'noise':
+                c0 = 1 / torch.sqrt(alpha)
+                c1 = (1 - alpha) / torch.sqrt(1 - alpha_bar)
+                motion_next = c0 * (motion_at_t - c1 * target_theta) + sigma * z
+            elif self.target == 'sample':
+                c0 = (1 - alpha_bar_prev) * torch.sqrt(alpha) / (1 - alpha_bar)
+                c1 = (1 - alpha) * torch.sqrt(alpha_bar_prev) / (1 - alpha_bar)
+                motion_next = c0 * motion_at_t + c1 * target_theta + sigma * z
+            else:
+                raise ValueError('Unknown target type: {}'.format(self.target))
+            traj[t - 1] = motion_next.detach()  # Stop gradient and save trajectory.
+            traj[t] = traj[t].cpu()  # Move previous output to CPU memory.
+            if not ret_traj:
+                del traj[t]
+        if ret_traj:
+            return traj, motion_at_T, audio_feat
+        else:
+            return traj[0], motion_at_T, audio_feat
+class DenoisingNetwork(nn.Module):
+    def __init__(self, args, device='cuda'):
+        super().__init__()
+        # Model parameters
+        self.use_style = args.style_enc_ckpt is not None
+        self.motion_feat_dim = 50
+        if args.rot_repr == 'aa':
+            self.motion_feat_dim += 1 if args.no_head_pose else 4
+        else:
+            raise ValueError(f'Unknown rotation representation {args.rot_repr}!')
+        self.shape_feat_dim = 100
+        if self.use_style:
+            self.style_feat_dim = args.d_style
+            self.person_feat_dim = self.shape_feat_dim + self.style_feat_dim
+        else:
+            self.person_feat_dim = self.shape_feat_dim
+        self.use_indicator = args.use_indicator
+        # Transformer
+        self.architecture = args.architecture
+        self.feature_dim = args.feature_dim
+        self.n_heads = args.n_heads
+        self.n_layers = args.n_layers
+        self.mlp_ratio = args.mlp_ratio
+        self.align_mask_width = args.align_mask_width
+        self.use_learnable_pe = not args.no_use_learnable_pe
+        # sequence length
+        self.n_prev_motions = args.n_prev_motions
+        self.n_motions = args.n_motions
+        # Temporal embedding for the diffusion time step
+        self.TE = PositionalEncoding(self.feature_dim, max_len=args.n_diff_steps + 1)
+        self.diff_step_map = nn.Sequential(
+            nn.Linear(self.feature_dim, self.feature_dim),
+            nn.GELU(),
+            nn.Linear(self.feature_dim, self.feature_dim)
+        )
+        if self.use_learnable_pe:
+            # Learnable positional encoding
+            self.PE = nn.Parameter(torch.randn(1, 1 + self.n_prev_motions + self.n_motions, self.feature_dim))
+        else:
+            self.PE = PositionalEncoding(self.feature_dim)
+        self.person_proj = nn.Linear(self.person_feat_dim, self.feature_dim)
+        # Transformer decoder
+        if self.architecture == 'decoder':
+            self.feature_proj = nn.Linear(self.motion_feat_dim + (1 if self.use_indicator else 0),
+                                          self.feature_dim)
+            decoder_layer = nn.TransformerDecoderLayer(
+                d_model=self.feature_dim, nhead=self.n_heads, dim_feedforward=self.mlp_ratio * self.feature_dim,
+                activation='gelu', batch_first=True
+            )
+            self.transformer = nn.TransformerDecoder(decoder_layer, num_layers=self.n_layers)
+            if self.align_mask_width > 0:
+                motion_len = self.n_prev_motions + self.n_motions
+                alignment_mask = enc_dec_mask(motion_len, motion_len, 1, self.align_mask_width - 1)
+                alignment_mask = F.pad(alignment_mask, (0, 0, 1, 0), value=False)
+                self.register_buffer('alignment_mask', alignment_mask)
+            else:
+                self.alignment_mask = None
+        else:
+            raise ValueError(f'Unknown architecture: {self.architecture}')
+        # Motion decoder
+        self.motion_dec = nn.Sequential(
+            nn.Linear(self.feature_dim, self.feature_dim // 2),
+            nn.GELU(),
+            nn.Linear(self.feature_dim // 2, self.motion_feat_dim)
+        )
+        self.to(device)
+    @property
+    def device(self):
+        return next(self.parameters()).device
+    def forward(self, motion_feat, audio_feat, person_feat, prev_motion_feat, prev_audio_feat, step, indicator=None):
+        """
+        Args:
+            motion_feat: (N, L, d_motion). Noisy motion feature
+            audio_feat: (N, L, feature_dim)
+            person_feat: (N, 1, d_person)
+            prev_motion_feat: (N, L_p, d_motion). Padded previous motion coefficients or feature
+            prev_audio_feat: (N, L_p, d_audio). Padded previous motion coefficients or feature
+            step: (N,)
+            indicator: (N, L). 0/1 indicator for the real (unpadded) motion feature
+        Returns:
+            motion_feat_target: (N, L_p + L, d_motion)
+        """
+        # Diffusion time step embedding
+        diff_step_embedding = self.diff_step_map(self.TE.pe[0, step]).unsqueeze(1)  # (N, 1, diff_step_dim)
+        person_feat = self.person_proj(person_feat)  # (N, 1, feature_dim)
+        person_feat = person_feat + diff_step_embedding
+        if indicator is not None:
+            indicator = torch.cat([torch.zeros((indicator.shape[0], self.n_prev_motions), device=indicator.device),
+                                   indicator], dim=1)  # (N, L_p + L)
+            indicator = indicator.unsqueeze(-1)  # (N, L_p + L, 1)
+        # Concat features and embeddings
+        if self.architecture == 'decoder':
+            feats_in = torch.cat([prev_motion_feat, motion_feat], dim=1)  # (N, L_p + L, d_motion)
+        else:
+            raise ValueError(f'Unknown architecture: {self.architecture}')
+        if self.use_indicator:
+            feats_in = torch.cat([feats_in, indicator], dim=-1)  # (N, L_p + L, d_motion + d_audio + 1)
+        feats_in = self.feature_proj(feats_in)  # (N, L_p + L, feature_dim)
+        feats_in = torch.cat([person_feat, feats_in], dim=1)  # (N, 1 + L_p + L, feature_dim)
+        if self.use_learnable_pe:
+            feats_in = feats_in + self.PE
+        else:
+            feats_in = self.PE(feats_in)
+        # Transformer
+        if self.architecture == 'decoder':
+            audio_feat_in = torch.cat([prev_audio_feat, audio_feat], dim=1)  # (N, L_p + L, d_audio)
+            feat_out = self.transformer(feats_in, audio_feat_in, memory_mask=self.alignment_mask)
+        else:
+            raise ValueError(f'Unknown architecture: {self.architecture}')
+        # Decode predicted motion feature noise / sample
+        motion_feat_target = self.motion_dec(feat_out[:, 1:])  # (N, L_p + L, d_motion)
+        return motion_feat_target

diffposetalk/diffposetalk.py ADDED Viewed

	@@ -0,0 +1,228 @@

+import math
+import tempfile
+import warnings
+from pathlib import Path
+import cv2
+import librosa
+import numpy as np
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+from pydantic import BaseModel
+from .diff_talking_head import DiffTalkingHead
+from .utils import NullableArgs, coef_dict_to_vertices, get_coef_dict
+from .utils.media import combine_video_and_audio, convert_video, reencode_audio
+warnings.filterwarnings('ignore', message='PySoundFile failed. Trying audioread instead.')
+class DiffPoseTalkConfig(BaseModel):
+    no_context_audio_feat: bool = False
+    model_path: str = "pretrained_models/diffposetalk/iter_0110000.pt" # DPT/head-SA-hubert-WM
+    coef_stats: str = "pretrained_models/diffposetalk/stats_train.npz"
+    style_path: str = "pretrained_models/diffposetalk/style/L4H4-T0.1-BS32/iter_0034000/normal.npy"
+    dynamic_threshold_ratio: float = 0.99
+    dynamic_threshold_min: float = 1.0
+    dynamic_threshold_max: float = 4.0
+    scale_audio: float = 1.15
+    scale_style: float = 3.0
+class DiffPoseTalk:
+    def __init__(self, config: DiffPoseTalkConfig = DiffPoseTalkConfig(), device="cuda"):
+        self.cfg = config
+        self.device = device
+        self.no_context_audio_feat = self.cfg.no_context_audio_feat
+        model_data = torch.load(self.cfg.model_path, map_location=self.device)
+        self.model_args = NullableArgs(model_data['args'])
+        self.model = DiffTalkingHead(self.model_args, self.device)
+        model_data['model'].pop('denoising_net.TE.pe')
+        self.model.load_state_dict(model_data['model'], strict=False)
+        self.model.to(self.device)
+        self.model.eval()
+        self.use_indicator = self.model_args.use_indicator
+        self.rot_repr = self.model_args.rot_repr
+        self.predict_head_pose = not self.model_args.no_head_pose
+        if self.model.use_style:
+            style_dir = Path(self.model_args.style_enc_ckpt)
+            style_dir = Path(*style_dir.with_suffix('').parts[-3::2])
+            self.style_dir = style_dir
+        # sequence
+        self.n_motions = self.model_args.n_motions
+        self.n_prev_motions = self.model_args.n_prev_motions
+        self.fps = self.model_args.fps
+        self.audio_unit = 16000. / self.fps  # num of samples per frame
+        self.n_audio_samples = round(self.audio_unit * self.n_motions)
+        self.pad_mode = self.model_args.pad_mode
+        self.coef_stats = dict(np.load(self.cfg.coef_stats))
+        self.coef_stats = {k: torch.from_numpy(v).to(self.device) for k, v in self.coef_stats.items()}
+        if self.cfg.dynamic_threshold_ratio > 0:
+            self.dynamic_threshold = (self.cfg.dynamic_threshold_ratio, self.cfg.dynamic_threshold_min,
+                                      self.cfg.dynamic_threshold_max)
+        else:
+            self.dynamic_threshold = None
+    def infer_from_file(self, audio_path, shape_coef):
+        n_repetitions = 1
+        cfg_mode = None
+        cfg_cond = self.model.guiding_conditions
+        cfg_scale = []
+        for cond in cfg_cond:
+            if cond == 'audio':
+                cfg_scale.append(self.cfg.scale_audio)
+            elif cond == 'style':
+                cfg_scale.append(self.cfg.scale_style)
+        coef_dict = self.infer_coeffs(audio_path, shape_coef, self.cfg.style_path, n_repetitions,
+                                      cfg_mode, cfg_cond, cfg_scale, include_shape=True)
+        return coef_dict
+    @torch.no_grad()
+    def infer_coeffs(self, audio, shape_coef, style_feat=None, n_repetitions=1,
+                     cfg_mode=None, cfg_cond=None, cfg_scale=1.15, include_shape=False):
+        # Returns dict[str, (n_repetitions, L, *)]
+        # Step 1: Preprocessing
+        # Preprocess audio
+        if isinstance(audio, (str, Path)):
+            audio, _ = librosa.load(audio, sr=16000, mono=True)
+        if isinstance(audio, np.ndarray):
+            audio = torch.from_numpy(audio).to(self.device)
+        assert audio.ndim == 1, 'Audio must be 1D tensor.'
+        audio_mean, audio_std = torch.mean(audio), torch.std(audio)
+        audio = (audio - audio_mean) / (audio_std + 1e-5)
+        # Preprocess shape coefficient
+        if isinstance(shape_coef, (str, Path)):
+            shape_coef = np.load(shape_coef)
+            if not isinstance(shape_coef, np.ndarray):
+                shape_coef = shape_coef['shape']
+        if isinstance(shape_coef, np.ndarray):
+            shape_coef = torch.from_numpy(shape_coef).float().to(self.device)
+        assert shape_coef.ndim <= 2, 'Shape coefficient must be 1D or 2D tensor.'
+        if shape_coef.ndim > 1:
+            # use the first frame as the shape coefficient
+            shape_coef = shape_coef[0]
+        original_shape_coef = shape_coef.clone()
+        if self.coef_stats is not None:
+            shape_coef = (shape_coef - self.coef_stats['shape_mean']) / self.coef_stats['shape_std']
+        shape_coef = shape_coef.unsqueeze(0).expand(n_repetitions, -1)
+        # Preprocess style feature if given
+        if style_feat is not None:
+            assert self.model.use_style
+            if isinstance(style_feat, (str, Path)):
+                style_feat = Path(style_feat)
+                if not style_feat.exists() and not style_feat.is_absolute():
+                    style_feat = style_feat.parent / self.style_dir / style_feat.name
+                style_feat = np.load(style_feat)
+                if not isinstance(style_feat, np.ndarray):
+                    style_feat = style_feat['style']
+            if isinstance(style_feat, np.ndarray):
+                style_feat = torch.from_numpy(style_feat).float().to(self.device)
+            assert style_feat.ndim == 1, 'Style feature must be 1D tensor.'
+            style_feat = style_feat.unsqueeze(0).expand(n_repetitions, -1)
+        # Step 2: Predict motion coef
+        # divide into synthesize units and do synthesize
+        clip_len = int(len(audio) / 16000 * self.fps)
+        stride = self.n_motions
+        if clip_len <= self.n_motions:
+            n_subdivision = 1
+        else:
+            n_subdivision = math.ceil(clip_len / stride)
+        # Prepare audio input
+        n_padding_audio_samples = self.n_audio_samples * n_subdivision - len(audio)
+        n_padding_frames = math.ceil(n_padding_audio_samples / self.audio_unit)
+        if n_padding_audio_samples > 0:
+            if self.pad_mode == 'zero':
+                padding_value = 0
+            elif self.pad_mode == 'replicate':
+                padding_value = audio[-1]
+            else:
+                raise ValueError(f'Unknown pad mode: {self.pad_mode}')
+            audio = F.pad(audio, (0, n_padding_audio_samples), value=padding_value)
+        if not self.no_context_audio_feat:
+            audio_feat = self.model.extract_audio_feature(audio.unsqueeze(0), self.n_motions * n_subdivision)
+        # Generate `self.n_motions` new frames at one time, and use the last `self.n_prev_motions` frames
+        # from the previous generation as the initial motion condition
+        coef_list = []
+        for i in range(0, n_subdivision):
+            start_idx = i * stride
+            end_idx = start_idx + self.n_motions
+            indicator = torch.ones((n_repetitions, self.n_motions)).to(self.device) if self.use_indicator else None
+            if indicator is not None and i == n_subdivision - 1 and n_padding_frames > 0:
+                indicator[:, -n_padding_frames:] = 0
+            if not self.no_context_audio_feat:
+                audio_in = audio_feat[:, start_idx:end_idx].expand(n_repetitions, -1, -1)
+            else:
+                audio_in = audio[round(start_idx * self.audio_unit):round(end_idx * self.audio_unit)].unsqueeze(0)
+            # generate motion coefficients
+            if i == 0:
+                # -> (N, L, d_motion=n_code_per_frame * code_dim)
+                motion_feat, noise, prev_audio_feat = self.model.sample(audio_in, shape_coef, style_feat,
+                                                                        indicator=indicator, cfg_mode=cfg_mode,
+                                                                        cfg_cond=cfg_cond, cfg_scale=cfg_scale,
+                                                                        dynamic_threshold=self.dynamic_threshold)
+            else:
+                motion_feat, noise, prev_audio_feat = self.model.sample(audio_in, shape_coef, style_feat,
+                                                                        prev_motion_feat, prev_audio_feat, noise,
+                                                                        indicator=indicator, cfg_mode=cfg_mode,
+                                                                        cfg_cond=cfg_cond, cfg_scale=cfg_scale,
+                                                                        dynamic_threshold=self.dynamic_threshold)
+            prev_motion_feat = motion_feat[:, -self.n_prev_motions:].clone()
+            prev_audio_feat = prev_audio_feat[:, -self.n_prev_motions:]
+            motion_coef = motion_feat
+            if i == n_subdivision - 1 and n_padding_frames > 0:
+                motion_coef = motion_coef[:, :-n_padding_frames]  # delete padded frames
+            coef_list.append(motion_coef)
+        motion_coef = torch.cat(coef_list, dim=1)
+        # Step 3: restore to coef dict
+        coef_dict = get_coef_dict(motion_coef, None, self.coef_stats, self.predict_head_pose, self.rot_repr)
+        if include_shape:
+            coef_dict['shape'] = original_shape_coef[None, None].expand(n_repetitions, motion_coef.shape[1], -1)
+        return self.coef_to_a1_format(coef_dict)
+    def coef_to_a1_format(self, coef_dict):
+        n_frames = coef_dict['exp'].shape[1]
+        new_coef_dict = []
+        for i in range(n_frames):
+            new_coef_dict.append({
+                "expression_params": coef_dict["exp"][0, i:i+1],
+                "jaw_params": coef_dict["pose"][0, i:i+1, 3:],
+                "eye_pose_params": torch.zeros(1, 6).type_as(coef_dict["pose"]),
+                "pose_params": coef_dict["pose"][0, i:i+1, :3],
+                "eyelid_params": None
+            })
+        return new_coef_dict
+    @staticmethod
+    def _pad_coef(coef, n_frames, elem_ndim=1):
+        if coef.ndim == elem_ndim:
+            coef = coef[None]
+        elem_shape = coef.shape[1:]
+        if coef.shape[0] >= n_frames:
+            new_coef = coef[:n_frames]
+        else:
+            # repeat the last coef frame
+            new_coef = torch.cat([coef, coef[[-1]].expand(n_frames - coef.shape[0], *elem_shape)], dim=0)
+        return new_coef  # (n_frames, *elem_shape)

diffposetalk/hubert.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from transformers import HubertModel
+from transformers.modeling_outputs import BaseModelOutput
+from .wav2vec2 import linear_interpolation
+_CONFIG_FOR_DOC = 'HubertConfig'
+class HubertModel(HubertModel):
+    def __init__(self, config):
+        super().__init__(config)
+    def forward(self, input_values, output_fps=25, attention_mask=None, output_attentions=None,
+                output_hidden_states=None, return_dict=None, frame_num=None):
+        self.config.output_attentions = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        extract_features = self.feature_extractor(input_values)  # (N, C, L)
+        # Resample the audio feature @ 50 fps to `output_fps`.
+        if frame_num is not None:
+            extract_features_len = round(frame_num * 50 / output_fps)
+            extract_features = extract_features[:, :, :extract_features_len]
+        extract_features = linear_interpolation(extract_features, 50, output_fps, output_len=frame_num)
+        extract_features = extract_features.transpose(1, 2)  # (N, L, C)
+        if attention_mask is not None:
+            # compute reduced attention_mask corresponding to feature vectors
+            attention_mask = self._get_feature_vector_attention_mask(extract_features.shape[1], attention_mask)
+        hidden_states = self.feature_projection(extract_features)
+        hidden_states = self._mask_hidden_states(hidden_states)
+        encoder_outputs = self.encoder(
+            hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = encoder_outputs[0]
+        if not return_dict:
+            return (hidden_states,) + encoder_outputs[1:]
+        return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_outputs.hidden_states,
+                               attentions=encoder_outputs.attentions, )

diffposetalk/utils/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .common import *

diffposetalk/utils/common.py ADDED Viewed

	@@ -0,0 +1,378 @@

+from functools import reduce
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+class NullableArgs:
+    def __init__(self, namespace):
+        for key, value in namespace.__dict__.items():
+            setattr(self, key, value)
+    def __getattr__(self, key):
+        # when an attribute lookup has not found the attribute
+        if key == 'align_mask_width':
+            if 'use_alignment_mask' in self.__dict__:
+                return 1 if self.use_alignment_mask else 0
+            else:
+                return 0
+        if key == 'no_head_pose':
+            return not self.predict_head_pose
+        if key == 'no_use_learnable_pe':
+            return not self.use_learnable_pe
+        return None
+def count_parameters(model):
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+def get_option_text(args, parser):
+    message = ''
+    for k, v in sorted(vars(args).items()):
+        comment = ''
+        default = parser.get_default(k)
+        if v != default:
+            comment = f'\t[default: {str(default)}]'
+        message += f'{str(k):>30}: {str(v):<30}{comment}\n'
+    return message
+def get_model_path(exp_name, iteration, model_type='DPT'):
+    exp_root_dir = Path(__file__).parent.parent / 'experiments' / model_type
+    exp_dir = exp_root_dir / exp_name
+    if not exp_dir.exists():
+        exp_dir = next(exp_root_dir.glob(f'{exp_name}*'))
+    model_path = exp_dir / f'checkpoints/iter_{iteration:07}.pt'
+    return model_path, exp_dir.relative_to(exp_root_dir)
+def get_pose_input(coef_dict, rot_repr, with_global_pose):
+    if rot_repr == 'aa':
+        pose_input = coef_dict['pose'] if with_global_pose else coef_dict['pose'][..., -3:]
+        # Remove mouth rotation round y, z axis
+        pose_input = pose_input[..., :-2]
+    else:
+        raise ValueError(f'Unknown rotation representation: {rot_repr}')
+    return pose_input
+def get_motion_coef(coef_dict, rot_repr, with_global_pose=False, norm_stats=None):
+    if norm_stats is not None:
+        if rot_repr == 'aa':
+            keys = ['exp', 'pose']
+        else:
+            raise ValueError(f'Unknown rotation representation {rot_repr}!')
+        coef_dict = {k: (coef_dict[k] - norm_stats[f'{k}_mean']) / norm_stats[f'{k}_std'] for k in keys}
+    pose_coef = get_pose_input(coef_dict, rot_repr, with_global_pose)
+    return torch.cat([coef_dict['exp'], pose_coef], dim=-1)
+def get_coef_dict(motion_coef, shape_coef=None, denorm_stats=None, with_global_pose=False, rot_repr='aa'):
+    coef_dict = {
+        'exp': motion_coef[..., :50]
+    }
+    if rot_repr == 'aa':
+        if with_global_pose:
+            coef_dict['pose'] = motion_coef[..., 50:]
+        else:
+            placeholder = torch.zeros_like(motion_coef[..., :3])
+            coef_dict['pose'] = torch.cat([placeholder, motion_coef[..., -1:]], dim=-1)
+        # Add back rotation around y, z axis
+        coef_dict['pose'] = torch.cat([coef_dict['pose'], torch.zeros_like(motion_coef[..., :2])], dim=-1)
+    else:
+        raise ValueError(f'Unknown rotation representation {rot_repr}!')
+    if shape_coef is not None:
+        if motion_coef.ndim == 3:
+            if shape_coef.ndim == 2:
+                shape_coef = shape_coef.unsqueeze(1)
+            if shape_coef.shape[1] == 1:
+                shape_coef = shape_coef.expand(-1, motion_coef.shape[1], -1)
+        coef_dict['shape'] = shape_coef
+    if denorm_stats is not None:
+        coef_dict = {k: coef_dict[k] * denorm_stats[f'{k}_std'] + denorm_stats[f'{k}_mean'] for k in coef_dict}
+    if not with_global_pose:
+        if rot_repr == 'aa':
+            coef_dict['pose'][..., :3] = 0
+        else:
+            raise ValueError(f'Unknown rotation representation {rot_repr}!')
+    return coef_dict
+def coef_dict_to_vertices(coef_dict, flame, rot_repr='aa', ignore_global_rot=False, flame_batch_size=512):
+    shape = coef_dict['exp'].shape[:-1]
+    coef_dict = {k: v.view(-1, v.shape[-1]) for k, v in coef_dict.items()}
+    n_samples = reduce(lambda x, y: x * y, shape, 1)
+    # Convert to vertices
+    vert_list = []
+    for i in range(0, n_samples, flame_batch_size):
+        batch_coef_dict = {k: v[i:i + flame_batch_size] for k, v in coef_dict.items()}
+        if rot_repr == 'aa':
+            vert, _, _ = flame(
+                batch_coef_dict['shape'], batch_coef_dict['exp'], batch_coef_dict['pose'],
+                pose2rot=True, ignore_global_rot=ignore_global_rot, return_lm2d=False, return_lm3d=False)
+        else:
+            raise ValueError(f'Unknown rot_repr: {rot_repr}')
+        vert_list.append(vert)
+    vert_list = torch.cat(vert_list, dim=0)  # (n_samples, 5023, 3)
+    vert_list = vert_list.view(*shape, -1, 3)  # (..., 5023, 3)
+    return vert_list
+def compute_loss(args, is_starting_sample, shape_coef, motion_coef_gt, noise, target, prev_motion_coef, coef_stats,
+                 flame, end_idx=None):
+    if args.criterion.lower() == 'l2':
+        criterion_func = F.mse_loss
+    elif args.criterion.lower() == 'l1':
+        criterion_func = F.l1_loss
+    else:
+        raise NotImplementedError(f'Criterion {args.criterion} not implemented.')
+    loss_vert = None
+    loss_vel = None
+    loss_smooth = None
+    loss_head_angle = None
+    loss_head_vel = None
+    loss_head_smooth = None
+    loss_head_trans_vel = None
+    loss_head_trans_accel = None
+    loss_head_trans = None
+    if args.target == 'noise':
+        loss_noise = criterion_func(noise, target[:, args.n_prev_motions:], reduction='none')
+    elif args.target == 'sample':
+        if is_starting_sample:
+            target = target[:, args.n_prev_motions:]
+        else:
+            motion_coef_gt = torch.cat([prev_motion_coef, motion_coef_gt], dim=1)
+            if args.no_constrain_prev:
+                target = torch.cat([prev_motion_coef, target[:, args.n_prev_motions:]], dim=1)
+        loss_noise = criterion_func(motion_coef_gt, target, reduction='none')
+        if args.l_vert > 0 or args.l_vel > 0:
+            coef_gt = get_coef_dict(motion_coef_gt, shape_coef, coef_stats, with_global_pose=False,
+                                    rot_repr=args.rot_repr)
+            coef_pred = get_coef_dict(target, shape_coef, coef_stats, with_global_pose=False,
+                                      rot_repr=args.rot_repr)
+            seq_len = target.shape[1]
+            if args.rot_repr == 'aa':
+                verts_gt, _, _ = flame(coef_gt['shape'].view(-1, 100), coef_gt['exp'].view(-1, 50),
+                                       coef_gt['pose'].view(-1, 6), return_lm2d=False, return_lm3d=False)
+                verts_pred, _, _ = flame(coef_pred['shape'].view(-1, 100), coef_pred['exp'].view(-1, 50),
+                                         coef_pred['pose'].view(-1, 6), return_lm2d=False, return_lm3d=False)
+            else:
+                raise ValueError(f'Unknown rotation representation {args.rot_repr}!')
+            verts_gt = verts_gt.view(-1, seq_len, 5023, 3)
+            verts_pred = verts_pred.view(-1, seq_len, 5023, 3)
+            if args.l_vert > 0:
+                loss_vert = criterion_func(verts_gt, verts_pred, reduction='none')
+            if args.l_vel > 0:
+                vel_gt = verts_gt[:, 1:] - verts_gt[:, :-1]
+                vel_pred = verts_pred[:, 1:] - verts_pred[:, :-1]
+                loss_vel = criterion_func(vel_gt, vel_pred, reduction='none')
+            if args.l_smooth > 0:
+                vel_pred = verts_pred[:, 1:] - verts_pred[:, :-1]
+                loss_smooth = criterion_func(vel_pred[:, 1:], vel_pred[:, :-1], reduction='none')
+        # head pose
+        if not args.no_head_pose:
+            if args.rot_repr == 'aa':
+                head_pose_gt = motion_coef_gt[:, :, 50:53]
+                head_pose_pred = target[:, :, 50:53]
+            else:
+                raise ValueError(f'Unknown rotation representation {args.rot_repr}!')
+            if args.l_head_angle > 0:
+                loss_head_angle = criterion_func(head_pose_gt, head_pose_pred, reduction='none')
+            if args.l_head_vel > 0:
+                head_vel_gt = head_pose_gt[:, 1:] - head_pose_gt[:, :-1]
+                head_vel_pred = head_pose_pred[:, 1:] - head_pose_pred[:, :-1]
+                loss_head_vel = criterion_func(head_vel_gt, head_vel_pred, reduction='none')
+            if args.l_head_smooth > 0:
+                head_vel_pred = head_pose_pred[:, 1:] - head_pose_pred[:, :-1]
+                loss_head_smooth = criterion_func(head_vel_pred[:, 1:], head_vel_pred[:, :-1], reduction='none')
+            if not is_starting_sample and args.l_head_trans > 0:
+                # # version 1: constrain both the predicted previous and current motions (x_{-3} ~ x_{2})
+                # head_pose_trans = head_pose_pred[:, args.n_prev_motions - 3:args.n_prev_motions + 3]
+                # head_vel_pred = head_pose_trans[:, 1:] - head_pose_trans[:, :-1]
+                # head_accel_pred = head_vel_pred[:, 1:] - head_vel_pred[:, :-1]
+                # version 2: constrain only the predicted current motions (x_{0} ~ x_{2})
+                head_pose_trans = torch.cat([head_pose_gt[:, args.n_prev_motions - 3:args.n_prev_motions],
+                                             head_pose_pred[:, args.n_prev_motions:args.n_prev_motions + 3]], dim=1)
+                head_vel_pred = head_pose_trans[:, 1:] - head_pose_trans[:, :-1]
+                head_accel_pred = head_vel_pred[:, 1:] - head_vel_pred[:, :-1]
+                # will constrain x_{-2|0} ~ x_{1}
+                loss_head_trans_vel = criterion_func(head_vel_pred[:, 2:4], head_vel_pred[:, 1:3], reduction='none')
+                # will constrain x_{-3|0} ~ x_{2}
+                loss_head_trans_accel = criterion_func(head_accel_pred[:, 1:], head_accel_pred[:, :-1],
+                                                       reduction='none')
+    else:
+        raise ValueError(f'Unknown diffusion target: {args.target}')
+    if end_idx is None:
+        mask = torch.ones((target.shape[0], args.n_motions), dtype=torch.bool, device=target.device)
+    else:
+        mask = torch.arange(args.n_motions, device=target.device).expand(target.shape[0], -1) < end_idx.unsqueeze(1)
+    if args.target == 'sample' and not is_starting_sample:
+        if args.no_constrain_prev:
+            # Warning: this option will be deprecated in the future
+            mask = torch.cat([torch.zeros_like(mask[:, :args.n_prev_motions]), mask], dim=1)
+        else:
+            mask = torch.cat([torch.ones_like(mask[:, :args.n_prev_motions]), mask], dim=1)
+    loss_noise = loss_noise[mask].mean()
+    if loss_vert is not None:
+        loss_vert = loss_vert[mask].mean()
+    if loss_vel is not None:
+        loss_vel = loss_vel[mask[:, 1:]]
+        loss_vel = loss_vel.mean() if torch.numel(loss_vel) > 0 else None
+    if loss_smooth is not None:
+        loss_smooth = loss_smooth[mask[:, 2:]]
+        loss_smooth = loss_smooth.mean() if torch.numel(loss_smooth) > 0 else None
+    if loss_head_angle is not None:
+        loss_head_angle = loss_head_angle[mask].mean()
+    if loss_head_vel is not None:
+        loss_head_vel = loss_head_vel[mask[:, 1:]]
+        loss_head_vel = loss_head_vel.mean() if torch.numel(loss_head_vel) > 0 else None
+    if loss_head_smooth is not None:
+        loss_head_smooth = loss_head_smooth[mask[:, 2:]]
+        loss_head_smooth = loss_head_smooth.mean() if torch.numel(loss_head_smooth) > 0 else None
+    if loss_head_trans_vel is not None:
+        vel_mask = mask[:, args.n_prev_motions:args.n_prev_motions + 2]
+        accel_mask = mask[:, args.n_prev_motions:args.n_prev_motions + 3]
+        loss_head_trans_vel = loss_head_trans_vel[vel_mask].mean()
+        loss_head_trans_accel = loss_head_trans_accel[accel_mask].mean()
+        loss_head_trans = loss_head_trans_vel + loss_head_trans_accel
+    return loss_noise, loss_vert, loss_vel, loss_smooth, loss_head_angle, loss_head_vel, loss_head_smooth, \
+           loss_head_trans
+def _truncate_audio(audio, end_idx, pad_mode='zero'):
+    batch_size = audio.shape[0]
+    audio_trunc = audio.clone()
+    if pad_mode == 'replicate':
+        for i in range(batch_size):
+            audio_trunc[i, end_idx[i]:] = audio_trunc[i, end_idx[i] - 1]
+    elif pad_mode == 'zero':
+        for i in range(batch_size):
+            audio_trunc[i, end_idx[i]:] = 0
+    else:
+        raise ValueError(f'Unknown pad mode {pad_mode}!')
+    return audio_trunc
+def _truncate_coef_dict(coef_dict, end_idx, pad_mode='zero'):
+    batch_size = coef_dict['exp'].shape[0]
+    coef_dict_trunc = {k: v.clone() for k, v in coef_dict.items()}
+    if pad_mode == 'replicate':
+        for i in range(batch_size):
+            for k in coef_dict_trunc:
+                coef_dict_trunc[k][i, end_idx[i]:] = coef_dict_trunc[k][i, end_idx[i] - 1]
+    elif pad_mode == 'zero':
+        for i in range(batch_size):
+            for k in coef_dict:
+                coef_dict_trunc[k][i, end_idx[i]:] = 0
+    else:
+        raise ValueError(f'Unknown pad mode: {pad_mode}!')
+    return coef_dict_trunc
+def truncate_coef_dict_and_audio(audio, coef_dict, n_motions, audio_unit=640, pad_mode='zero'):
+    batch_size = audio.shape[0]
+    end_idx = torch.randint(1, n_motions, (batch_size,), device=audio.device)
+    audio_end_idx = (end_idx * audio_unit).long()
+    # mask = torch.arange(n_motions, device=audio.device).expand(batch_size, -1) < end_idx.unsqueeze(1)
+    # truncate audio
+    audio_trunc = _truncate_audio(audio, audio_end_idx, pad_mode=pad_mode)
+    # truncate coef dict
+    coef_dict_trunc = _truncate_coef_dict(coef_dict, end_idx, pad_mode=pad_mode)
+    return audio_trunc, coef_dict_trunc, end_idx
+def truncate_motion_coef_and_audio(audio, motion_coef, n_motions, audio_unit=640, pad_mode='zero'):
+    batch_size = audio.shape[0]
+    end_idx = torch.randint(1, n_motions, (batch_size,), device=audio.device)
+    audio_end_idx = (end_idx * audio_unit).long()
+    # mask = torch.arange(n_motions, device=audio.device).expand(batch_size, -1) < end_idx.unsqueeze(1)
+    # truncate audio
+    audio_trunc = _truncate_audio(audio, audio_end_idx, pad_mode=pad_mode)
+    # prepare coef dict and stats
+    coef_dict = {'exp': motion_coef[..., :50], 'pose_any': motion_coef[..., 50:]}
+    # truncate coef dict
+    coef_dict_trunc = _truncate_coef_dict(coef_dict, end_idx, pad_mode=pad_mode)
+    motion_coef_trunc = torch.cat([coef_dict_trunc['exp'], coef_dict_trunc['pose_any']], dim=-1)
+    return audio_trunc, motion_coef_trunc, end_idx
+def nt_xent_loss(feature_a, feature_b, temperature):
+    """
+    Normalized temperature-scaled cross entropy loss.
+    (Adapted from https://github.com/sthalles/SimCLR/blob/master/simclr.py)
+    Args:
+        feature_a (torch.Tensor): shape (batch_size, feature_dim)
+        feature_b (torch.Tensor): shape (batch_size, feature_dim)
+        temperature (float): temperature scaling factor
+    Returns:
+        torch.Tensor: scalar
+    """
+    batch_size = feature_a.shape[0]
+    device = feature_a.device
+    features = torch.cat([feature_a, feature_b], dim=0)
+    labels = torch.cat([torch.arange(batch_size), torch.arange(batch_size)], dim=0)
+    labels = (labels.unsqueeze(0) == labels.unsqueeze(1))
+    labels = labels.to(device)
+    features = F.normalize(features, dim=1)
+    similarity_matrix = torch.matmul(features, features.T)
+    # discard the main diagonal from both: labels and similarities matrix
+    mask = torch.eye(labels.shape[0], dtype=torch.bool).to(device)
+    labels = labels[~mask].view(labels.shape[0], -1)
+    similarity_matrix = similarity_matrix[~mask].view(labels.shape[0], -1)
+    # select the positives and negatives
+    positives = similarity_matrix[labels].view(labels.shape[0], -1)
+    negatives = similarity_matrix[~labels].view(labels.shape[0], -1)
+    logits = torch.cat([positives, negatives], dim=1)
+    logits = logits / temperature
+    labels = torch.zeros(labels.shape[0], dtype=torch.long).to(device)
+    loss = F.cross_entropy(logits, labels)
+    return loss

diffposetalk/utils/media.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import shlex
+import subprocess
+from pathlib import Path
+def combine_video_and_audio(video_file, audio_file, output, quality=17, copy_audio=True):
+    audio_codec = '-c:a copy' if copy_audio else ''
+    cmd = f'ffmpeg -i {video_file} -i {audio_file} -c:v libx264 -crf {quality} -pix_fmt yuv420p ' \
+          f'{audio_codec} -fflags +shortest -y -hide_banner -loglevel error {output}'
+    assert subprocess.run(shlex.split(cmd)).returncode == 0
+def combine_frames_and_audio(frame_files, audio_file, fps, output, quality=17):
+    cmd = f'ffmpeg -framerate {fps} -i {frame_files} -i {audio_file} -c:v libx264 -crf {quality} -pix_fmt yuv420p ' \
+          f'-c:a copy -fflags +shortest -y -hide_banner -loglevel error {output}'
+    assert subprocess.run(shlex.split(cmd)).returncode == 0
+def convert_video(video_file, output, quality=17):
+    cmd = f'ffmpeg -i {video_file} -c:v libx264 -crf {quality} -pix_fmt yuv420p ' \
+          f'-fflags +shortest -y -hide_banner -loglevel error {output}'
+    assert subprocess.run(shlex.split(cmd)).returncode == 0
+def reencode_audio(audio_file, output):
+    cmd = f'ffmpeg -i {audio_file} -y -hide_banner -loglevel error {output}'
+    assert subprocess.run(shlex.split(cmd)).returncode == 0
+def extract_frames(filename, output_dir, quality=1):
+    output_dir = Path(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    cmd = f'ffmpeg -i {filename} -qmin 1 -qscale:v {quality} -y -start_number 0 -hide_banner -loglevel error ' \
+          f'{output_dir / "%06d.jpg"}'
+    assert subprocess.run(shlex.split(cmd)).returncode == 0

diffposetalk/utils/renderer.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import os
+import tempfile
+import cv2
+import kiui.mesh
+import numpy as np
+# os.environ['PYOPENGL_PLATFORM'] = 'osmesa'  # osmesa or egl
+os.environ['PYOPENGL_PLATFORM'] = 'egl'
+import pyrender
+import trimesh
+# from psbody.mesh import Mesh
+class MeshRenderer:
+    def __init__(self, size, fov=16 / 180 * np.pi, camera_pose=None, light_pose=None, black_bg=False):
+        # Camera
+        self.frustum = {'near': 0.01, 'far': 3.0}
+        self.camera = pyrender.PerspectiveCamera(yfov=fov, znear=self.frustum['near'],
+                                                 zfar=self.frustum['far'], aspectRatio=1.0)
+        # Material
+        self.primitive_material = pyrender.material.MetallicRoughnessMaterial(
+            alphaMode='BLEND',
+            baseColorFactor=[0.3, 0.3, 0.3, 1.0],
+            metallicFactor=0.8,
+            roughnessFactor=0.8
+        )
+        # Lighting
+        light_color = np.array([1., 1., 1.])
+        self.light = pyrender.DirectionalLight(color=light_color, intensity=2)
+        self.light_angle = np.pi / 6.0
+        # Scene
+        self.scene = None
+        self._init_scene(black_bg)
+        # add camera and lighting
+        self._init_camera(camera_pose)
+        self._init_lighting(light_pose)
+        # Renderer
+        self.renderer = pyrender.OffscreenRenderer(*size, point_size=1.0)
+    def _init_scene(self, black_bg=False):
+        if black_bg:
+            self.scene = pyrender.Scene(ambient_light=[.2, .2, .2], bg_color=[0, 0, 0])
+        else:
+            self.scene = pyrender.Scene(ambient_light=[.2, .2, .2], bg_color=[255, 255, 255])
+    def _init_camera(self, camera_pose=None):
+        if camera_pose is None:
+            camera_pose = np.eye(4)
+            camera_pose[:3, 3] = np.array([0, 0, 1])
+        self.camera_pose = camera_pose.copy()
+        self.camera_node = self.scene.add(self.camera, pose=camera_pose)
+    def _init_lighting(self, light_pose=None):
+        if light_pose is None:
+            light_pose = np.eye(4)
+            light_pose[:3, 3] = np.array([0, 0, 1])
+        self.light_pose = light_pose.copy()
+        light_poses = self._get_light_poses(self.light_angle, light_pose)
+        self.light_nodes = [self.scene.add(self.light, pose=light_pose) for light_pose in light_poses]
+    def set_camera_pose(self, camera_pose):
+        self.camera_pose = camera_pose.copy()
+        self.scene.set_pose(self.camera_node, pose=camera_pose)
+    def set_lighting_pose(self, light_pose):
+        self.light_pose = light_pose.copy()
+        light_poses = self._get_light_poses(self.light_angle, light_pose)
+        for light_node, light_pose in zip(self.light_nodes, light_poses):
+            self.scene.set_pose(light_node, pose=light_pose)
+    def render_mesh(self, v, f, t_center, rot=np.zeros(3), tex_img=None, tex_uv=None,
+                    camera_pose=None, light_pose=None):
+        # Prepare mesh
+        v[:] = cv2.Rodrigues(rot)[0].dot((v - t_center).T).T + t_center
+        if tex_img is not None:
+            tex = pyrender.Texture(source=tex_img, source_channels='RGB')
+            tex_material = pyrender.material.MetallicRoughnessMaterial(baseColorTexture=tex)
+            from kiui.mesh import Mesh
+            import torch
+            mesh = Mesh(
+                v=torch.from_numpy(v),
+                f=torch.from_numpy(f),
+                vt=tex_uv['vt'],
+                ft=tex_uv['ft']
+            )
+            with tempfile.NamedTemporaryFile(suffix='.obj') as f:
+                mesh.write_obj(f.name)
+                tri_mesh = trimesh.load(f.name, process=False)
+            return tri_mesh
+            # tri_mesh = self._pyrender_mesh_workaround(mesh)
+            render_mesh = pyrender.Mesh.from_trimesh(tri_mesh, material=tex_material)
+        else:
+            tri_mesh = trimesh.Trimesh(vertices=v, faces=f)
+            render_mesh = pyrender.Mesh.from_trimesh(tri_mesh, material=self.primitive_material, smooth=True)
+        mesh_node = self.scene.add(render_mesh, pose=np.eye(4))
+        # Change camera and lighting pose if necessary
+        if camera_pose is not None:
+            self.set_camera_pose(camera_pose)
+        if light_pose is not None:
+            self.set_lighting_pose(light_pose)
+        # Render
+        flags = pyrender.RenderFlags.SKIP_CULL_FACES
+        color, depth = self.renderer.render(self.scene, flags=flags)
+        # Remove mesh
+        self.scene.remove_node(mesh_node)
+        return color, depth
+    @staticmethod
+    def _get_light_poses(light_angle, light_pose):
+        light_poses = []
+        init_pos = light_pose[:3, 3].copy()
+        light_poses.append(light_pose.copy())
+        light_pose[:3, 3] = cv2.Rodrigues(np.array([light_angle, 0, 0]))[0].dot(init_pos)
+        light_poses.append(light_pose.copy())
+        light_pose[:3, 3] = cv2.Rodrigues(np.array([-light_angle, 0, 0]))[0].dot(init_pos)
+        light_poses.append(light_pose.copy())
+        light_pose[:3, 3] = cv2.Rodrigues(np.array([0, -light_angle, 0]))[0].dot(init_pos)
+        light_poses.append(light_pose.copy())
+        light_pose[:3, 3] = cv2.Rodrigues(np.array([0, light_angle, 0]))[0].dot(init_pos)
+        light_poses.append(light_pose.copy())
+        return light_poses
+    @staticmethod
+    def _pyrender_mesh_workaround(mesh):
+        # Workaround as pyrender requires number of vertices and uv coordinates to be the same
+        with tempfile.NamedTemporaryFile(suffix='.obj') as f:
+            mesh.write_obj(f.name)
+            tri_mesh = trimesh.load(f.name, process=False)
+        return tri_mesh

diffposetalk/utils/rotation_conversions.py ADDED Viewed

	@@ -0,0 +1,569 @@

+# This code is based on https://github.com/Mathux/ACTOR.git
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+# Check PYTORCH3D_LICENCE before use
+import functools
+from typing import Optional
+import torch
+import torch.nn.functional as F
+"""
+The transformation matrices returned from the functions in this file assume
+the points on which the transformation will be applied are column vectors.
+i.e. the R matrix is structured as
+    R = [
+            [Rxx, Rxy, Rxz],
+            [Ryx, Ryy, Ryz],
+            [Rzx, Rzy, Rzz],
+        ]  # (3, 3)
+This matrix can be applied to column vectors by post multiplication
+by the points e.g.
+    points = [[0], [1], [2]]  # (3 x 1) xyz coordinates of a point
+    transformed_points = R * points
+To apply the same matrix to points which are row vectors, the R matrix
+can be transposed and pre multiplied by the points:
+e.g.
+    points = [[0, 1, 2]]  # (1 x 3) xyz coordinates of a point
+    transformed_points = points * R.transpose(1, 0)
+"""
+def quaternion_to_matrix(quaternions):
+    """
+    Convert rotations given as quaternions to rotation matrices.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+def _copysign(a, b):
+    """
+    Return a tensor where each element has the absolute value taken from the,
+    corresponding element of a, with sign taken from the corresponding
+    element of b. This is like the standard copysign floating-point operation,
+    but is not careful about negative 0 and NaN.
+    Args:
+        a: source tensor.
+        b: tensor whose signs will be used, of the same shape as a.
+    Returns:
+        Tensor of the same shape as a with the signs of b.
+    """
+    signs_differ = (a < 0) != (b < 0)
+    return torch.where(signs_differ, -a, a)
+def _sqrt_positive_part(x):
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+def matrix_to_quaternion(matrix):
+    """
+    Convert rotations given as rotation matrices to quaternions.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    m00 = matrix[..., 0, 0]
+    m11 = matrix[..., 1, 1]
+    m22 = matrix[..., 2, 2]
+    o0 = 0.5 * _sqrt_positive_part(1 + m00 + m11 + m22)
+    x = 0.5 * _sqrt_positive_part(1 + m00 - m11 - m22)
+    y = 0.5 * _sqrt_positive_part(1 - m00 + m11 - m22)
+    z = 0.5 * _sqrt_positive_part(1 - m00 - m11 + m22)
+    o1 = _copysign(x, matrix[..., 2, 1] - matrix[..., 1, 2])
+    o2 = _copysign(y, matrix[..., 0, 2] - matrix[..., 2, 0])
+    o3 = _copysign(z, matrix[..., 1, 0] - matrix[..., 0, 1])
+    return torch.stack((o0, o1, o2, o3), -1)
+def _axis_angle_rotation(axis: str, angle):
+    """
+    Return the rotation matrices for one of the rotations about an axis
+    of which Euler angles describe, for each value of the angle given.
+    Args:
+        axis: Axis label "X" or "Y or "Z".
+        angle: any shape tensor of Euler angles in radians
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    cos = torch.cos(angle)
+    sin = torch.sin(angle)
+    one = torch.ones_like(angle)
+    zero = torch.zeros_like(angle)
+    if axis == "X":
+        R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
+    if axis == "Y":
+        R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
+    if axis == "Z":
+        R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
+    return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
+def euler_angles_to_matrix(euler_angles, convention: str):
+    """
+    Convert rotations given as Euler angles in radians to rotation matrices.
+    Args:
+        euler_angles: Euler angles in radians as tensor of shape (..., 3).
+        convention: Convention string of three uppercase letters from
+            {"X", "Y", and "Z"}.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
+        raise ValueError("Invalid input euler angles.")
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    matrices = map(_axis_angle_rotation, convention, torch.unbind(euler_angles, -1))
+    return functools.reduce(torch.matmul, matrices)
+def _angle_from_tan(
+    axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
+):
+    """
+    Extract the first or third Euler angle from the two members of
+    the matrix which are positive constant times its sine and cosine.
+    Args:
+        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
+        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
+            convention.
+        data: Rotation matrices as tensor of shape (..., 3, 3).
+        horizontal: Whether we are looking for the angle for the third axis,
+            which means the relevant entries are in the same row of the
+            rotation matrix. If not, they are in the same column.
+        tait_bryan: Whether the first and third axes in the convention differ.
+    Returns:
+        Euler Angles in radians for each matrix in dataset as a tensor
+        of shape (...).
+    """
+    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
+    if horizontal:
+        i2, i1 = i1, i2
+    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
+    if horizontal == even:
+        return torch.atan2(data[..., i1], data[..., i2])
+    if tait_bryan:
+        return torch.atan2(-data[..., i2], data[..., i1])
+    return torch.atan2(data[..., i2], -data[..., i1])
+def _index_from_letter(letter: str):
+    if letter == "X":
+        return 0
+    if letter == "Y":
+        return 1
+    if letter == "Z":
+        return 2
+def matrix_to_euler_angles(matrix, convention: str):
+    """
+    Convert rotations given as rotation matrices to Euler angles in radians.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+        convention: Convention string of three uppercase letters.
+    Returns:
+        Euler angles in radians as tensor of shape (..., 3).
+    """
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix  shape f{matrix.shape}.")
+    i0 = _index_from_letter(convention[0])
+    i2 = _index_from_letter(convention[2])
+    tait_bryan = i0 != i2
+    if tait_bryan:
+        central_angle = torch.asin(
+            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
+        )
+    else:
+        central_angle = torch.acos(matrix[..., i0, i0])
+    o = (
+        _angle_from_tan(
+            convention[0], convention[1], matrix[..., i2], False, tait_bryan
+        ),
+        central_angle,
+        _angle_from_tan(
+            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
+        ),
+    )
+    return torch.stack(o, -1)
+def random_quaternions(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random quaternions representing rotations,
+    i.e. versors with nonnegative real part.
+    Args:
+        n: Number of quaternions in a batch to return.
+        dtype: Type to return.
+        device: Desired device of returned tensor. Default:
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Quaternions as tensor of shape (N, 4).
+    """
+    o = torch.randn((n, 4), dtype=dtype, device=device, requires_grad=requires_grad)
+    s = (o * o).sum(1)
+    o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
+    return o
+def random_rotations(
+    n: int, dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate random rotations as 3x3 rotation matrices.
+    Args:
+        n: Number of rotation matrices in a batch to return.
+        dtype: Type to return.
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type.
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set.
+    Returns:
+        Rotation matrices as tensor of shape (n, 3, 3).
+    """
+    quaternions = random_quaternions(
+        n, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    return quaternion_to_matrix(quaternions)
+def random_rotation(
+    dtype: Optional[torch.dtype] = None, device=None, requires_grad=False
+):
+    """
+    Generate a single random 3x3 rotation matrix.
+    Args:
+        dtype: Type to return
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type
+        requires_grad: Whether the resulting tensor should have the gradient
+            flag set
+    Returns:
+        Rotation matrix as tensor of shape (3, 3).
+    """
+    return random_rotations(1, dtype, device, requires_grad)[0]
+def standardize_quaternion(quaternions):
+    """
+    Convert a unit quaternion to a standard form: one in which the real
+    part is non negative.
+    Args:
+        quaternions: Quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Standardized quaternions as tensor of shape (..., 4).
+    """
+    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
+def quaternion_raw_multiply(a, b):
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+def quaternion_multiply(a, b):
+    """
+    Multiply two quaternions representing rotations, returning the quaternion
+    representing their composition, i.e. the versor with nonnegative real part.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions of shape (..., 4).
+    """
+    ab = quaternion_raw_multiply(a, b)
+    return standardize_quaternion(ab)
+def quaternion_invert(quaternion):
+    """
+    Given a quaternion representing rotation, get the quaternion representing
+    its inverse.
+    Args:
+        quaternion: Quaternions as tensor of shape (..., 4), with real part
+            first, which must be versors (unit quaternions).
+    Returns:
+        The inverse, a tensor of quaternions of shape (..., 4).
+    """
+    return quaternion * quaternion.new_tensor([1, -1, -1, -1])
+def quaternion_apply(quaternion, point):
+    """
+    Apply the rotation given by a quaternion to a 3D point.
+    Usual torch rules for broadcasting apply.
+    Args:
+        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
+        point: Tensor of 3D points of shape (..., 3).
+    Returns:
+        Tensor of rotated points of shape (..., 3).
+    """
+    if point.size(-1) != 3:
+        raise ValueError(f"Points are not in 3D, f{point.shape}.")
+    real_parts = point.new_zeros(point.shape[:-1] + (1,))
+    point_as_quaternion = torch.cat((real_parts, point), -1)
+    out = quaternion_raw_multiply(
+        quaternion_raw_multiply(quaternion, point_as_quaternion),
+        quaternion_invert(quaternion),
+    )
+    return out[..., 1:]
+def axis_angle_to_matrix(axis_angle):
+    """
+    Convert rotations given as axis/angle to rotation matrices.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    return quaternion_to_matrix(axis_angle_to_quaternion(axis_angle))
+def matrix_to_axis_angle(matrix):
+    """
+    Convert rotations given as rotation matrices to axis/angle.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    return quaternion_to_axis_angle(matrix_to_quaternion(matrix))
+def axis_angle_to_quaternion(axis_angle):
+    """
+    Convert rotations given as axis/angle to quaternions.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = 0.5 * angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+def quaternion_to_axis_angle(quaternions):
+    """
+    Convert rotations given as quaternions to axis/angle.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
+    half_angles = torch.atan2(norms, quaternions[..., :1])
+    angles = 2 * half_angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    return quaternions[..., 1:] / sin_half_angles_over_angles
+def rotation_6d_to_matrix(d6: torch.Tensor) -> torch.Tensor:
+    """
+    Converts 6D rotation representation by Zhou et al. [1] to rotation matrix
+    using Gram--Schmidt orthogonalisation per Section B of [1].
+    Args:
+        d6: 6D rotation representation, of size (*, 6)
+    Returns:
+        batch of rotation matrices of size (*, 3, 3)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    a1, a2 = d6[..., :3], d6[..., 3:]
+    b1 = F.normalize(a1, dim=-1)
+    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
+    b2 = F.normalize(b2, dim=-1)
+    b3 = torch.cross(b1, b2, dim=-1)
+    return torch.stack((b1, b2, b3), dim=-2)
+def matrix_to_rotation_6d(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Converts rotation matrices to 6D rotation representation by Zhou et al. [1]
+    by dropping the last row. Note that 6D representation is not unique.
+    Args:
+        matrix: batch of rotation matrices of size (*, 3, 3)
+    Returns:
+        6D rotation representation, of size (*, 6)
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    return matrix[..., :2, :].clone().reshape(*matrix.size()[:-2], 6)
+def axis_angle_to_rotation_6d(axis_angle):
+    """
+    Convert rotations given as axis/angle to 6D rotation representation by Zhou
+    et al. [1].
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        6D rotation representation, of size (*, 6)
+    """
+    return matrix_to_rotation_6d(axis_angle_to_matrix(axis_angle))

diffposetalk/wav2vec2.py ADDED Viewed

	@@ -0,0 +1,119 @@

+from packaging import version
+from typing import Optional, Tuple
+import numpy as np
+import torch
+import torch.nn.functional as F
+import transformers
+from transformers import Wav2Vec2Model
+from transformers.modeling_outputs import BaseModelOutput
+_CONFIG_FOR_DOC = 'Wav2Vec2Config'
+# the implementation of Wav2Vec2Model is borrowed from
+# https://huggingface.co/transformers/_modules/transformers/models/wav2vec2/modeling_wav2vec2.html#Wav2Vec2Model
+# initialize our encoder with the pre-trained wav2vec 2.0 weights.
+def _compute_mask_indices(shape: Tuple[int, int], mask_prob: float, mask_length: int,
+                          attention_mask: Optional[torch.Tensor] = None, min_masks: int = 0, ) -> np.ndarray:
+    bsz, all_sz = shape
+    mask = np.full((bsz, all_sz), False)
+    all_num_mask = int(mask_prob * all_sz / float(mask_length) + np.random.rand())
+    all_num_mask = max(min_masks, all_num_mask)
+    mask_idcs = []
+    padding_mask = attention_mask.ne(1) if attention_mask is not None else None
+    for i in range(bsz):
+        if padding_mask is not None:
+            sz = all_sz - padding_mask[i].long().sum().item()
+            num_mask = int(mask_prob * sz / float(mask_length) + np.random.rand())
+            num_mask = max(min_masks, num_mask)
+        else:
+            sz = all_sz
+            num_mask = all_num_mask
+        lengths = np.full(num_mask, mask_length)
+        if sum(lengths) == 0:
+            lengths[0] = min(mask_length, sz - 1)
+        min_len = min(lengths)
+        if sz - min_len <= num_mask:
+            min_len = sz - num_mask - 1
+        mask_idc = np.random.choice(sz - min_len, num_mask, replace=False)
+        mask_idc = np.asarray([mask_idc[j] + offset for j in range(len(mask_idc)) for offset in range(lengths[j])])
+        mask_idcs.append(np.unique(mask_idc[mask_idc < sz]))
+    min_len = min([len(m) for m in mask_idcs])
+    for i, mask_idc in enumerate(mask_idcs):
+        if len(mask_idc) > min_len:
+            mask_idc = np.random.choice(mask_idc, min_len, replace=False)
+        mask[i, mask_idc] = True
+    return mask
+# linear interpolation layer
+def linear_interpolation(features, input_fps, output_fps, output_len=None):
+    # features: (N, C, L)
+    seq_len = features.shape[2] / float(input_fps)
+    if output_len is None:
+        output_len = int(seq_len * output_fps)
+    output_features = F.interpolate(features, size=output_len, align_corners=False, mode='linear')
+    return output_features
+class Wav2Vec2Model(Wav2Vec2Model):
+    def __init__(self, config):
+        super().__init__(config)
+        self.is_old_version = version.parse(transformers.__version__) < version.parse('4.7.0')
+    def forward(self, input_values, output_fps=25, attention_mask=None, output_attentions=None,
+                output_hidden_states=None, return_dict=None, frame_num=None):
+        self.config.output_attentions = True
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        hidden_states = self.feature_extractor(input_values)  # (N, C, L)
+        # Resample the audio feature @ 50 fps to `output_fps`.
+        if frame_num is not None:
+            hidden_states_len = round(frame_num * 50 / output_fps)
+            hidden_states = hidden_states[:, :, :hidden_states_len]
+        hidden_states = linear_interpolation(hidden_states, 50, output_fps, output_len=frame_num)
+        hidden_states = hidden_states.transpose(1, 2)  # (N, L, C)
+        if attention_mask is not None:
+            output_lengths = self._get_feat_extract_output_lengths(attention_mask.sum(-1))
+            attention_mask = torch.zeros(hidden_states.shape[:2], dtype=hidden_states.dtype,
+                                         device=hidden_states.device)
+            attention_mask[(torch.arange(attention_mask.shape[0], device=hidden_states.device), output_lengths - 1)] = 1
+            attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).bool()
+        if self.is_old_version:
+            hidden_states = self.feature_projection(hidden_states)
+        else:
+            hidden_states = self.feature_projection(hidden_states)[0]
+        if self.config.apply_spec_augment and self.training:
+            batch_size, sequence_length, hidden_size = hidden_states.size()
+            if self.config.mask_time_prob > 0:
+                mask_time_indices = _compute_mask_indices((batch_size, sequence_length), self.config.mask_time_prob,
+                                                          self.config.mask_time_length, attention_mask=attention_mask,
+                                                          min_masks=2, )
+                hidden_states[torch.from_numpy(mask_time_indices)] = self.masked_spec_embed.to(hidden_states.dtype)
+            if self.config.mask_feature_prob > 0:
+                mask_feature_indices = _compute_mask_indices((batch_size, hidden_size), self.config.mask_feature_prob,
+                                                             self.config.mask_feature_length, )
+                mask_feature_indices = torch.from_numpy(mask_feature_indices).to(hidden_states.device)
+                hidden_states[mask_feature_indices[:, None].expand(-1, sequence_length, -1)] = 0
+        encoder_outputs = self.encoder(hidden_states, attention_mask=attention_mask,
+                                       output_attentions=output_attentions, output_hidden_states=output_hidden_states,
+                                       return_dict=return_dict, )
+        hidden_states = encoder_outputs[0]
+        if not return_dict:
+            return (hidden_states,) + encoder_outputs[1:]
+        return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_outputs.hidden_states,
+                               attentions=encoder_outputs.attentions, )