diff --git a/.gitattributes b/.gitattributes
index a6344aac8c09253b3b630fb776ae94478aa0275b..94c22a8ce9b689713cf92cdc9fea1025e3d3dc30 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -33,3 +33,22 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/camera_control.png filter=lfs diff=lfs merge=lfs -text
+assets/cases/dog.png filter=lfs diff=lfs merge=lfs -text
+assets/cases/dog_pure_camera_motion_1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/cases/dog_pure_camera_motion_2.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/cases/dog_pure_obj_motion.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/func_1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/logo.png filter=lfs diff=lfs merge=lfs -text
+assets/logo_generated2.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/logo_generated2_single.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/pose_files/complex_1.png filter=lfs diff=lfs merge=lfs -text
+assets/pose_files/complex_2.png filter=lfs diff=lfs merge=lfs -text
+assets/pose_files/complex_3.png filter=lfs diff=lfs merge=lfs -text
+assets/pose_files/complex_4.png filter=lfs diff=lfs merge=lfs -text
+assets/sea.jpg filter=lfs diff=lfs merge=lfs -text
+assets/syn_video_control1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/syn_video_control2.mp4 filter=lfs diff=lfs merge=lfs -text
+data/folders/007401_007450_1018898026/video.mp4 filter=lfs diff=lfs merge=lfs -text
+data/folders/046001_046050_1011035429/video.mp4 filter=lfs diff=lfs merge=lfs -text
+data/folders/188701_188750_1026109505/video.mp4 filter=lfs diff=lfs merge=lfs -text
diff --git a/READEME.md b/READEME.md
new file mode 100644
index 0000000000000000000000000000000000000000..f1e33d4334326d6dc74d3f22f4969a03abe0793f
--- /dev/null
+++ b/READEME.md
@@ -0,0 +1,159 @@
+# MotionPro
+
+<p align="center">
+    <img src="assets/logo.png" width="400"/>
+<p>
+
+<p align="center">
+    🖥️ <a href="https://github.com/HiDream-ai/MotionPro">GitHub</a> &nbsp&nbsp ｜ &nbsp&nbsp  🌐 <a href="https://zhw-zhang.github.io/MotionPro-page/"><b>Project Page</b></a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/zzwustc/MotionPro/tree/main">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Paper </a> &nbsp&nbsp | &nbsp&nbsp 📖 <a href="">PDF</a> &nbsp&nbsp 
+<br>
+
+[**MotionPro: A Precise Motion Controller for Image-to-Video Generation**](https://zhw-zhang.github.io/MotionPro-page/) <be>
+
+🔆 If you find MotionPro useful, please give a ⭐ for this repo, which is important to Open-Source projects. Thanks!
+
+In this repository, we introduce **MotionPro**, an image-to-video generation model built on SVD. MotionPro learns object and camera motion control from **in-the-wild** video datasets (e.g., WebVid-10M) without applying special data filtering. The model offers the following key features:
+
+-  **User-friendly interaction.** Our model requires only simple conditional inputs, allowing users to achieve I2V motion control generation through brushing and dragging.
+-  **Simultaneous control of object and camera motion.** Our trained MotionPro model supports simultaneous object and camera motion control. Moreover, our model can achieve precise camera control driven by pose without requiring training on a specific camera-pose paired dataset. [More Details](assets/camera_control.png)
+-  **Synchronized video generation.** This is an extension of our model. By combining MotionPro and MotionPro-Dense, we can achieve synchronized video generation. [More Details](assets/README_syn.md)
+
+
+Additionally, our repository provides more tools to benefit the research community's development.:
+
+-  **Memory optimization for training.** We provide a training framework based on PyTorch Lightning, optimized for memory efficiency, enabling SVD fine-tuning with a batch size of 8 per NVIDIA A100 GPU.
+-  **Data construction tools.**  We offer scripts for constructing training data. Additionally, we also provide code for loading datasets in two formats, supporting video input from both folders (Dataset) and tar files (WebDataset).
+-  **MC-Bench and evaluation code.** We constructed MC-Bench with 1.1K user-annotated image-trajectory pairs, along with evaluation scripts for comprehensive assessments. All the images showcased on the project page can be found here.
+
+## Video Demos
+
+<div align="center">
+  <video src="assets/func_1.mp4" width="70%" autoplay loop muted playsinline poster="">
+  </video>
+  <p><em>Examples of different motion control types by our MotionPro.</em></p>
+</div>
+
+<!-- <div align="center">
+  <video src="assets/func_1.mp4" width="70%" autoplay loop muted playsinline poster="">
+  </video>
+  <p><em>Figure 2: Synchronized video generation and Video recapture.</em></p>
+</div> -->
+
+## 🔥 Updates
+- [x] **\[2025.03.26\]** Release inference and training code.
+- [ ] **\[2025.03.27\]** Upload gradio demo usage video.
+- [ ] **\[2025.03.29\]** Release MC-Bench and evaluation code.
+- [ ] **\[2025.03.30\]** Upload annotation tool for image-trajectory pair construction.
+
+## 🏃🏼 Inference
+<details open>
+<summary><strong>Environment Requirement</strong></summary>
+
+Clone the repo:
+```
+git clone https://github.com/HiDream-ai/MotionPro.git
+```
+
+Install dependencies:
+```
+conda create -n motionpro python=3.10.0
+conda activate motionpro
+pip install -r requirements.txt
+```
+</details>
+
+<details open>
+<summary><strong>Model Download</strong></summary>
+
+
+| Models            | Download Link                                                                 | Notes                                      |
+|-------------------|-------------------------------------------------------------------------------|--------------------------------------------|
+| MotionPro  | 🤗[Huggingface](https://huggingface.co/zzwustc/MotionPro/blob/main/MotionPro-gs_16k.pt)                | Supports both object and camera control. This is the default model mentioned in the paper.   |
+| MotionPro-Dense   | 🤗[Huggingface](https://huggingface.co/zzwustc/MotionPro/blob/main/MotionPro_Dense-gs_14k.pt)           | Supports synchronized video generation when combined with MotionPro. MotionPro-Dense shares the same architecture as Motion, but the input conditions are modified to include: dense optical flow and per-frame visibility masks relative to the first frame. |
+
+
+Download the model from HuggingFace at high speeds (30-80MB/s):
+```
+cd tools/huggingface_down
+bash download_hfd.sh
+```
+</details>
+
+
+<details open>
+<summary><strong>Run Motion Control</strong></summary>
+
+This section of the code supports simultaneous object motion and camera motion control. We provide a user-friendly Gradio demo interface that allows users to control motion with simple brushing and dragging operations. The instructional video can be found in `assets/demo.mp4` (please note the version of gradio).
+
+```
+python demo_sparse_flex_wh.py
+```
+When you expect all pixels to move (e.g., for camera control), you need to use the brush to fully cover the entire area. You can also test the demo using `assets/logo.png`.
+
+Additionally, users can also generate controllable image-to-video results using pre-defined camera trajectories. Note that our model has not been trained on a specific camera control dataset. Test the demo using `assets/sea.png`.
+
+```
+python demo_sparse_flex_wh_pure_camera.py
+```
+</details>
+
+
+<details open>
+<summary><strong>Run synchronized video generation and video recapture</strong></summary>
+
+By combining MotionPro and MotionPro-Dense, we can achieve the following functionalities:
+- Synchronized video generation. We assume that two videos, `pure_obj_motion.mp4` and `pure_camera_motion.mp4`, have been generated using the respective demos. By combining their motion flows and using the result as a condition for MotionPro-Dense, we obtain `final_video`. By pairing the same object motion with different camera motions, we can generate `synchronized videos` where the object motion remains consistent while the camera motion varies. [More Details](assets/README_syn.md)
+
+Here, you need to first download the [model_weights](https://huggingface.co/zzwustc/MotionPro/tree/main/tools/co-tracker/checkpoints) of cotracker and place them in the `tools/co-tracker/checkpoints` directory.
+
+```
+python inference_dense.py --ori_video 'assets/cases/dog_pure_obj_motion.mp4' --camera_video 'assets/cases/dog_pure_camera_motion_1.mp4' --save_name 'syn_video.mp4' --ckpt_path 'MotionPro-Dense CKPT-PATH'
+```
+
+</details>
+
+## 🚀 Training
+
+<details open>
+<summary><strong>Data Prepare</strong></summary>
+
+We have packaged several demo videos to help users debug the training code. Simply 🤗[download](https://huggingface.co/zzwustc/MotionPro/tree/main/data), extract the files, and place them in the `./data` directory.
+
+Additionally, `./data/dot_single_video` contains code for processing raw videos using [DOT](https://github.com/16lemoing/dot) to generate the necessary conditions for training, making it easier for the community to create training datasets.
+
+</details>
+
+
+<details open>
+<summary><strong>Train</strong></summary>
+
+Simply run the following command to train MotionPro:
+```
+train_server_1.sh
+```
+In addition to loading video data from folders, we also support [WebDataset](https://rom1504.github.io/webdataset/), allowing videos to be read directly from tar files for training. This can be enabled by modifying the config file:
+```
+train_debug_from_folder.yaml -> train_debug_from_tar.yaml 
+```
+
+Furthermore, to train the **MotionPro-Dense** model, simply modify the `train_debug_from_tar.yaml` file by changing `VidTar` to `VidTar_all_flow` and updating the `ckpt_path`.
+
+</details>
+
+## 🌟 Star and Citation
+If you find our work helpful for your research, please consider giving a star⭐ on this repository and citing our work📝.
+```
+@inproceedings{2025motionpro,
+  title={MotionPro: A Precise Motion Controller for Image-to-Video Generation},
+  author={Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao and Tao Mei},
+  booktitle={CVPR},
+  year={2025}
+}
+```
+
+
+## 💖 Acknowledgement
+<span id="acknowledgement"></span>
+
+Our code is inspired by several works, including [SVD](https://github.com/Stability-AI/generative-models), [DragNUWA](https://github.com/ProjectNUWA/DragNUWA), [DOT](https://github.com/16lemoing/dot), [Cotracker](https://github.com/facebookresearch/co-tracker). Thanks to all the contributors! 
+
diff --git a/assets/README_syn.md b/assets/README_syn.md
new file mode 100644
index 0000000000000000000000000000000000000000..ffb3d37fe0c0bdeac545d8ddfe03e537a72d9f80
--- /dev/null
+++ b/assets/README_syn.md
@@ -0,0 +1,38 @@
+## MotionPro-Dense Video Generation Pipeline
+
+
+This document provides an introduction to the MotionPro-Dense video generation pipeline, detailing its functionality and workflow. The pipeline is illustrated in the diagram below.
+
+### Pipeline Description
+
+1. **Video Generation with Base Motion Control**
+   - First, MotionPro can be used to generate videos with controllable object motion and camera motion.
+   
+2. **Optical Flow and Visibility Mask Extraction and Merging**  
+   - The generated videos are processed using CoTracker, a tool for extracting optical flow and visibility masks for each frame.  
+   - The extracted optical flows are accumulated through summation.  
+   - The per-frame visibility masks from both sequences are intersected to obtain the final visibility mask.
+
+3. **Final Video Generation with Combined Motions**  
+   - The aggregated motion conditions are used as input for **MotionPro-Dense**, which generates the final video with seamlessly integrated object and camera motions. 
+
+
+<div align="center">
+  <video src="./syn_video_control1.mp4" width="90%" autoplay loop muted playsinline poster="">
+  </video>
+  <p><em>Figure 1: Illustration of video generation with combined motions.</em></p>
+</div>
+
+
+### Synchronized Video Generation
+
+Additionally, the pipeline enables the generation of **synchronized videos**, where a consistent object motion is paired with different camera motions. 
+
+<div align="center">
+  <video src="./syn_video_control2.mp4" width="90%" autoplay loop muted playsinline poster="">
+  </video>
+  <p><em>Figure 2: Illustration of synchronized video generation.</em></p>
+</div>
+
+
+
diff --git a/assets/camera_control.png b/assets/camera_control.png
new file mode 100644
index 0000000000000000000000000000000000000000..17a204aa230202a9b5c0eb69bb4d1fbd715754ad
--- /dev/null
+++ b/assets/camera_control.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c3f1accf32a6a665324a33993433d16ae33ac74d7e8329b137c86d8e2e05b884
+size 853223
diff --git a/assets/cases/dog.png b/assets/cases/dog.png
new file mode 100644
index 0000000000000000000000000000000000000000..3138c5f2cf4b2e53337d3b04a94fef1f0a70d31b
--- /dev/null
+++ b/assets/cases/dog.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:12475a897c6b7daed611edbf74cdfdef4149deab15aded2087e5f0782dd3df20
+size 187872
diff --git a/assets/cases/dog_pure_camera_motion_1.mp4 b/assets/cases/dog_pure_camera_motion_1.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..aea58e0af372837886db0d87addae476597c4658
--- /dev/null
+++ b/assets/cases/dog_pure_camera_motion_1.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8137d4babcd1ad94e988a0d6768320e8ca924d8754e1a42f2a2dfda9c0f5e761
+size 208964
diff --git a/assets/cases/dog_pure_camera_motion_2.mp4 b/assets/cases/dog_pure_camera_motion_2.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..a116bc0fec38c5c515992fcfb49843105069d355
--- /dev/null
+++ b/assets/cases/dog_pure_camera_motion_2.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ddde37b57e4d6b3d48d18ef7f0ef814e0993a5a51443189841045bc62fadc0c2
+size 454983
diff --git a/assets/cases/dog_pure_obj_motion.mp4 b/assets/cases/dog_pure_obj_motion.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..d57b10afb0fe99af665138bb409a823eb4727b46
--- /dev/null
+++ b/assets/cases/dog_pure_obj_motion.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7c59c97ac7a59b79ae5146ee90ff684f97fed220b30cde646321965e43ffc187
+size 136490
diff --git a/assets/func_1.mp4 b/assets/func_1.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..8d1acffab25ac3bb48adc3479ff0d0137434f73b
--- /dev/null
+++ b/assets/func_1.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:862c9a0b58b9536d8479122154bade05077e2ee0ed71a2fe6c8c8c87d553d207
+size 4606221
diff --git a/assets/logo.png b/assets/logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..4f4502e14c60f470bff7849dc0d3f0dabc9e7496
--- /dev/null
+++ b/assets/logo.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e9b0c2525d869a606061f80a3f25fd7237bb61465a77cc7799a365fc63ffcf15
+size 2398622
diff --git a/assets/logo_generated2.mp4 b/assets/logo_generated2.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..10a03c0d15c8157e439f58e00e81e4bc00ef79d8
--- /dev/null
+++ b/assets/logo_generated2.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:30a5f7d9ad8a4ba8f6389a16acf024783d4be507d10578db542fba37e258837d
+size 333258
diff --git a/assets/logo_generated2_single.mp4 b/assets/logo_generated2_single.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..fdb3b697eb42f21464f763d28729a87948969987
--- /dev/null
+++ b/assets/logo_generated2_single.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a2e58a7533f59b2c490833bbf264c3029d93350e5f7205ff6e25aead38276b95
+size 104022
diff --git a/assets/pose_files/0bf152ef84195293.png b/assets/pose_files/0bf152ef84195293.png
new file mode 100644
index 0000000000000000000000000000000000000000..fc50953854077793f19d90b8c69e141047f2f1e2
Binary files /dev/null and b/assets/pose_files/0bf152ef84195293.png differ
diff --git a/assets/pose_files/0bf152ef84195293.txt b/assets/pose_files/0bf152ef84195293.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e87e03c57fefdae8e277555762b056c0f701bb9b
--- /dev/null
+++ b/assets/pose_files/0bf152ef84195293.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=QShWPZxTDoE
+158692025 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.780003667 0.059620168 -0.622928321 0.726968666 -0.062449891 0.997897983 0.017311305 0.217967188 0.622651041 0.025398925 0.782087326 -1.002211444
+158958959 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.743836701 0.064830206 -0.665209770 0.951841944 -0.068305343 0.997446954 0.020830527 0.206496789 0.664861917 0.029942872 0.746365905 -1.084913992
+159225893 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.697046876 0.070604131 -0.713540971 1.208789672 -0.074218854 0.996899366 0.026138915 0.196421447 0.713174045 0.034738146 0.700125754 -1.130142078
+159526193 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.635762572 0.077846259 -0.767949164 1.465161122 -0.080595709 0.996158004 0.034256749 0.157107229 0.767665446 0.040114246 0.639594078 -1.136893070
+159793126 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.593250692 0.083153486 -0.800711632 1.635091834 -0.085384794 0.995539784 0.040124334 0.135863998 0.800476789 0.044564810 0.597704709 -1.166997229
+160093427 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.555486798 0.087166689 -0.826943994 1.803789619 -0.089439675 0.994984210 0.044799786 0.145490422 0.826701283 0.049075913 0.560496747 -1.243827350
+160360360 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.523399472 0.090266660 -0.847292721 1.945815368 -0.093254104 0.994468153 0.048340045 0.174777447 0.846969128 0.053712368 0.528921843 -1.336914479
+160660661 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.491546303 0.092127070 -0.865964711 2.093852892 -0.095617607 0.994085968 0.051482171 0.196702533 0.865586221 0.057495601 0.497448236 -1.439709380
+160927594 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.475284129 0.093297184 -0.874871790 2.200792438 -0.096743606 0.993874133 0.053430639 0.209217395 0.874497354 0.059243519 0.481398523 -1.547068315
+161227895 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.464444131 0.093880348 -0.880612373 2.324141986 -0.097857766 0.993716478 0.054326952 0.220651207 0.880179226 0.060942926 0.470712721 -1.712512928
+161494828 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.458157241 0.093640216 -0.883925021 2.443100890 -0.098046601 0.993691206 0.054448847 0.257385043 0.883447111 0.061719712 0.464447916 -1.885672329
+161795128 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.457354397 0.093508720 -0.884354591 2.543246338 -0.097820736 0.993711591 0.054482624 0.281562244 0.883888066 0.061590351 0.463625461 -2.094829165
+162062062 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.465170115 0.093944497 -0.880222261 2.606377358 -0.097235762 0.993758380 0.054675922 0.277376127 0.879864752 0.060155477 0.471401453 -2.299280675
+162362362 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.511845231 0.090872414 -0.854257941 2.576774100 -0.093636356 0.994366586 0.049672548 0.270516319 0.853959382 0.054564942 0.517470777 -2.624374352
+162629296 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.590568483 0.083218277 -0.802685261 2.398318316 -0.085610889 0.995516419 0.040222570 0.282138215 0.802433550 0.044964414 0.595045030 -3.012309268
+162929596 0.474812461 0.844111024 0.500000000 0.500000000 0.000000000 0.000000000 0.684302032 0.072693504 -0.725566208 2.086323553 -0.074529484 0.996780157 0.029575195 0.310959312 0.725379944 0.033837710 0.687516510 -3.456740526
diff --git a/assets/pose_files/0c11dbe781b1c11c.png b/assets/pose_files/0c11dbe781b1c11c.png
new file mode 100644
index 0000000000000000000000000000000000000000..ee9cf43d38f0d8d5ca88ae088f936679b892f470
Binary files /dev/null and b/assets/pose_files/0c11dbe781b1c11c.png differ
diff --git a/assets/pose_files/0c11dbe781b1c11c.txt b/assets/pose_files/0c11dbe781b1c11c.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ae116b187cb39c26269b8585cddbfc2de318405f
--- /dev/null
+++ b/assets/pose_files/0c11dbe781b1c11c.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=a-Unpcomk5k
+89889800 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.959632158 -0.051068146 0.276583046 0.339363991 0.046715312 0.998659134 0.022308502 0.111317310 -0.277351439 -0.008487292 0.960731030 -0.353512177
+90156733 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.939171016 -0.057914909 0.338531673 0.380727498 0.052699961 0.998307705 0.024584483 0.134404073 -0.339382589 -0.005248427 0.940633774 -0.477942109
+90423667 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.913449824 -0.063028678 0.402040780 0.393354042 0.056629892 0.998008251 0.027794635 0.151535333 -0.402991891 -0.002621480 0.915199816 -0.622810637
+90723967 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.879072070 -0.069992281 0.471522361 0.381271678 0.062575974 0.997545719 0.031412520 0.175549569 -0.472563744 0.001892101 0.881294429 -0.821022008
+90990900 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.846152365 -0.078372896 0.527146876 0.360267421 0.071291871 0.996883452 0.033775900 0.212440374 -0.528151155 0.009001731 0.849102676 -1.013792538
+91291200 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.806246638 -0.086898506 0.585162342 0.297888150 0.078344196 0.996124208 0.039983708 0.243578507 -0.586368918 0.013607344 0.809929788 -1.248063630
+91558133 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.771091938 -0.093814306 0.629774630 0.223948432 0.087357447 0.995320201 0.041307874 0.293608807 -0.630702674 0.023163332 0.775678813 -1.459775674
+91858433 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.737968326 -0.099363215 0.667480111 0.145501271 0.093257703 0.994626462 0.044957232 0.329381977 -0.668360531 0.029070651 0.743269205 -1.688460978
+92125367 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.716826320 -0.101809755 0.689778805 0.086545731 0.098867603 0.994127929 0.043986596 0.379651732 -0.690206647 0.036666028 0.722682774 -1.885393814
+92425667 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.703360021 -0.101482928 0.703552365 0.039205180 0.098760851 0.994108558 0.044659954 0.417778776 -0.703939617 0.038071405 0.709238708 -2.106152155
+92692600 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.699525177 -0.101035394 0.707429409 0.029387371 0.096523918 0.994241416 0.046552572 0.439027166 -0.708059072 0.035719164 0.705249250 -2.314481674
+92992900 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.698582709 -0.101620331 0.708276451 0.018437890 0.096638583 0.994193733 0.047326516 0.478349552 -0.708973348 0.035385344 0.704347014 -2.540820022
+93259833 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.704948425 -0.098988213 0.702316940 0.047566428 0.095107265 0.994462848 0.044701166 0.517456396 -0.702853024 0.035283424 0.710459530 -2.724204596
+93560133 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.714113414 -0.104848787 0.692133486 0.107161588 0.100486010 0.993833601 0.046875130 0.568063228 -0.692780316 0.036075566 0.720245779 -2.948379150
+93827067 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.717699587 -0.112323314 0.687234104 0.118765931 0.105546549 0.993049562 0.052081093 0.593900230 -0.688307464 0.035156611 0.724566638 -3.140363331
+94127367 0.485388169 0.862912326 0.500000000 0.500000000 0.000000000 0.000000000 0.715531290 -0.122954883 0.687675118 0.089455249 0.115526602 0.991661787 0.057100743 0.643643035 -0.688961923 0.038587399 0.723769605 -3.310401931
diff --git a/assets/pose_files/0c9b371cc6225682.png b/assets/pose_files/0c9b371cc6225682.png
new file mode 100644
index 0000000000000000000000000000000000000000..a45f42678035d0d0c8c906a3732b5e57ba9b7830
Binary files /dev/null and b/assets/pose_files/0c9b371cc6225682.png differ
diff --git a/assets/pose_files/0c9b371cc6225682.txt b/assets/pose_files/0c9b371cc6225682.txt
new file mode 100644
index 0000000000000000000000000000000000000000..bc2f45f0ec995e31e83d824f8fc212d99775cbb1
--- /dev/null
+++ b/assets/pose_files/0c9b371cc6225682.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=_ca03xP_KUU
+211244000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.984322786 0.006958477 -0.176239252 0.004217217 -0.005594095 0.999950409 0.008237306 -0.107944544 0.176287830 -0.007122268 0.984312892 -0.571743822
+211511000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.981951714 0.008860772 -0.188924149 0.000856103 -0.007234470 0.999930620 0.009296093 -0.149397579 0.188993424 -0.007761548 0.981947660 -0.776566486
+211778000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.981318414 0.010160952 -0.192122281 -0.005546933 -0.008323869 0.999911606 0.010366773 -0.170816348 0.192210630 -0.008573905 0.981316268 -0.981924227
+212078000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.981108844 0.010863926 -0.193151161 0.019480142 -0.008781361 0.999893725 0.011634931 -0.185801323 0.193257034 -0.009719004 0.981100023 -1.207220396
+212345000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.981263518 0.010073495 -0.192407012 0.069708411 -0.008015377 0.999902070 0.011472094 -0.203594876 0.192503735 -0.009714933 0.981248140 -1.408936391
+212646000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980964184 0.009669405 -0.193947718 0.166020848 -0.007467276 0.999899149 0.012082115 -0.219176122 0.194044977 -0.010403861 0.980937481 -1.602649833
+212913000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980841637 0.009196524 -0.194589555 0.262465567 -0.006609587 0.999880970 0.013939449 -0.224018296 0.194694594 -0.012386235 0.980785728 -1.740759996
+213212000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980620921 0.008805701 -0.195716679 0.389752858 -0.006055873 0.999874413 0.014644019 -0.230312701 0.195821062 -0.013174997 0.980551124 -1.890949759
+213479000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980327129 0.009317402 -0.197159693 0.505632551 -0.006113928 0.999839306 0.016850581 -0.230702867 0.197285011 -0.015313662 0.980226576 -2.016199670
+213779000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980493963 0.009960363 -0.196296573 0.623893674 -0.006936011 0.999846518 0.016088497 -0.223079036 0.196426690 -0.014413159 0.980412602 -2.137999468
+214046000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.980032921 0.010318150 -0.198567480 0.754726451 -0.007264129 0.999843955 0.016102606 -0.222246314 0.198702648 -0.014338664 0.979954958 -2.230292399
+214347000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.976653159 0.010179597 -0.214580998 0.946523963 -0.006709154 0.999834776 0.016895246 -0.210005171 0.214717537 -0.015061138 0.976560056 -2.305666573
+214614000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.971478105 0.011535713 -0.236848563 1.096604956 -0.007706031 0.999824286 0.017088750 -0.192895049 0.237004071 -0.014776184 0.971396267 -2.365701917
+214914000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.965282261 0.014877280 -0.260785013 1.237534109 -0.014124592 0.999888897 0.004760279 -0.136261458 0.260826856 -0.000911531 0.965385139 -2.458136272
+215181000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.961933076 0.016891202 -0.272762626 1.331672110 -0.022902885 0.999559581 -0.018870916 -0.076291319 0.272323757 0.024399608 0.961896241 -2.579417067
+215481000 0.479272232 0.852039479 0.500000000 0.500000000 0.000000000 0.000000000 0.959357142 0.017509742 -0.281651050 1.417338469 -0.039949402 0.996448219 -0.074127860 0.083949011 0.279352754 0.082366876 0.956649244 -2.712094466
diff --git a/assets/pose_files/0f47577ab3441480.png b/assets/pose_files/0f47577ab3441480.png
new file mode 100644
index 0000000000000000000000000000000000000000..8bb402ea77289870f77d271f76fd0d54c5545a61
Binary files /dev/null and b/assets/pose_files/0f47577ab3441480.png differ
diff --git a/assets/pose_files/0f47577ab3441480.txt b/assets/pose_files/0f47577ab3441480.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b3ff33ec400ca157b4a66c5983a79ac41ae8d6f
--- /dev/null
+++ b/assets/pose_files/0f47577ab3441480.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=in69BD2eZqg
+195562033 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.999749303 -0.004518872 0.021929268 0.038810557 0.004613766 0.999980211 -0.004278630 0.328177052 -0.021909500 0.004378735 0.999750376 -0.278403591
+195828967 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.999336481 -0.006239665 0.035883281 0.034735125 0.006456365 0.999961615 -0.005926326 0.417233500 -0.035844926 0.006154070 0.999338388 -0.270773664
+196095900 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.998902142 -0.007417044 0.046254709 0.033849936 0.007582225 0.999965489 -0.003396692 0.504852301 -0.046227921 0.003743677 0.998923898 -0.256677740
+196396200 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.998096347 -0.008753631 0.061049398 0.026475959 0.009088391 0.999945164 -0.005207890 0.583593760 -0.061000463 0.005752816 0.998121142 -0.236166024
+196663133 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.997463286 -0.009214416 0.070583619 0.014842158 0.009590282 0.999941587 -0.004988078 0.634675512 -0.070533529 0.005652342 0.997493386 -0.198663134
+196963433 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.996699810 -0.009558053 0.080611102 0.003250557 0.009986609 0.999938071 -0.004914839 0.670145924 -0.080559134 0.005703651 0.996733487 -0.141256339
+197230367 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.996102691 -0.010129508 0.087617576 -0.013035317 0.010638822 0.999929130 -0.005347892 0.673139255 -0.087557197 0.006259197 0.996139824 -0.073934910
+197530667 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.995880842 -0.009925503 0.090126604 -0.036202423 0.010367444 0.999936402 -0.004436717 0.655632681 -0.090076834 0.005352824 0.995920420 0.017267095
+197797600 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.995802402 -0.010077500 0.090972595 -0.060858524 0.010445373 0.999939084 -0.003568561 0.618604505 -0.090931088 0.004503824 0.995846987 0.133592270
+198097900 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.995846093 -0.010148350 0.090484887 -0.077962281 0.010412642 0.999942780 -0.002449236 0.561822755 -0.090454854 0.003381249 0.995894790 0.274195378
+198364833 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.995989919 -0.009936163 0.088912196 -0.082315587 0.010200773 0.999944806 -0.002522171 0.520613290 -0.088882230 0.003419030 0.996036291 0.395169547
+198665133 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.997159958 -0.009351323 0.074730076 -0.068472873 0.009822783 0.999934077 -0.005943770 0.466061412 -0.074669570 0.006660947 0.997186065 0.549834051
+198932067 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.998626053 -0.008290987 0.051742285 -0.037270541 0.008407482 0.999962568 -0.002034174 0.410440195 -0.051723484 0.002466401 0.998658419 0.690111645
+199232367 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.999980092 -0.004756952 0.004140501 -0.005957613 0.004773445 0.999980688 -0.003982662 0.354437092 -0.004121476 0.004002347 0.999983490 0.842797271
+199499300 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.998872638 0.001147069 -0.047456335 0.002603018 -0.001435435 0.999980688 -0.006042828 0.295339877 0.047448486 0.006104136 0.998855054 0.988644188
+199799600 0.507650910 0.902490531 0.500000000 0.500000000 0.000000000 0.000000000 0.992951691 0.008710741 -0.118199304 -0.030798243 -0.009495872 0.999936402 -0.006080875 0.208803899 0.118138820 0.007160421 0.992971301 1.161643267
diff --git a/assets/pose_files/0f68374b76390082.png b/assets/pose_files/0f68374b76390082.png
new file mode 100644
index 0000000000000000000000000000000000000000..632ae498002c03cfcb07fb407fa10c3da267c8eb
Binary files /dev/null and b/assets/pose_files/0f68374b76390082.png differ
diff --git a/assets/pose_files/0f68374b76390082.txt b/assets/pose_files/0f68374b76390082.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d6fc78bb4f4bc95aeb5e9a7b5fc2e1272e296637
--- /dev/null
+++ b/assets/pose_files/0f68374b76390082.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=-aldZQifF2U
+103736967 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.804089785 -0.073792785 0.589910388 -2.686968354 0.081914566 0.996554494 0.013005137 0.128970374 -0.588837504 0.037864953 0.807363987 -1.789505608
+104003900 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.772824645 -0.077280566 0.629896700 -2.856354365 0.084460691 0.996253133 0.018602582 0.115028772 -0.628974140 0.038824979 0.776456118 -1.799931844
+104270833 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.740043461 -0.078656308 0.667943776 -3.017167990 0.086847030 0.995998919 0.021066183 0.116867188 -0.666928232 0.042419042 0.743913531 -1.815074499
+104571133 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.696879685 -0.073477358 0.713414192 -3.221640235 0.086792909 0.996067226 0.017807571 0.133618379 -0.711916924 0.049509555 0.700516284 -1.784051774
+104838067 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.654997289 -0.066671766 0.752684176 -3.418233112 0.086666502 0.996154904 0.012819566 0.161623584 -0.750644684 0.056835718 0.658256948 -1.733288907
+105138367 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.603833497 -0.059696361 0.794871926 -3.619566170 0.087576874 0.996123314 0.008281946 0.184519895 -0.792284906 0.064611480 0.606720686 -1.643568460
+105405300 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.555575073 -0.055402864 0.829618514 -3.768244320 0.089813948 0.995938241 0.006363695 0.197587954 -0.826601386 0.070975810 0.558294415 -1.559717271
+105705600 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.501615226 -0.052979972 0.863467038 -3.914896511 0.093892507 0.995560884 0.006539768 0.201989601 -0.859980464 0.077792637 0.504362881 -1.476983336
+105972533 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.454045177 -0.052372806 0.889438093 -4.034987790 0.099656843 0.994991958 0.007714771 0.211683202 -0.885387778 0.085135736 0.456990600 -1.405070279
+106272833 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.397668689 -0.051785514 0.916066527 -4.178181130 0.105599925 0.994354606 0.010369749 0.208751884 -0.911431968 0.092612833 0.400892258 -1.295093582
+106539767 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.345666498 -0.052948993 0.936862350 -4.285116664 0.110631727 0.993743002 0.015344846 0.195070069 -0.931812882 0.098342501 0.349361509 -1.182773054
+106840067 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.284817457 -0.055293880 0.956985712 -4.392320606 0.115495987 0.993041575 0.023003323 0.168523273 -0.951598525 0.103976257 0.289221793 -1.053514096
+107107000 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.228878200 -0.056077410 0.971838534 -4.485196000 0.120451130 0.992298782 0.028890507 0.159180748 -0.965974271 0.110446639 0.233870149 -0.923927626
+107407300 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.162932962 -0.053445265 0.985188544 -4.601126217 0.124115810 0.991709769 0.033272449 0.152041098 -0.978799343 0.116856292 0.168215603 -0.758111250
+107674233 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.102818660 -0.051196381 0.993381739 -4.691710857 0.127722457 0.991087139 0.037858382 0.141352300 -0.986466050 0.122984610 0.108441174 -0.599244073
+107974533 0.474175212 0.842978122 0.500000000 0.500000000 0.000000000 0.000000000 0.034108389 -0.050325166 0.998150289 -4.758242879 0.132215530 0.990180492 0.045405328 0.118994547 -0.990633965 0.130422264 0.040427230 -0.433560831
diff --git a/assets/pose_files/2c80f9eb0d3b2bb4.png b/assets/pose_files/2c80f9eb0d3b2bb4.png
new file mode 100644
index 0000000000000000000000000000000000000000..258aef1a9006c4334a602608837d822fea73248c
Binary files /dev/null and b/assets/pose_files/2c80f9eb0d3b2bb4.png differ
diff --git a/assets/pose_files/2c80f9eb0d3b2bb4.txt b/assets/pose_files/2c80f9eb0d3b2bb4.txt
new file mode 100644
index 0000000000000000000000000000000000000000..b7873f812fa8cd65bbfde387c6ebb7f34d44c94d
--- /dev/null
+++ b/assets/pose_files/2c80f9eb0d3b2bb4.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=sLIFyXD2ujI
+77444033 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.980310440 0.030424286 -0.195104495 -0.195846403 -0.034550700 0.999244750 -0.017780757 0.034309913 0.194416180 0.024171660 0.980621278 -0.178639121
+77610867 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.973806441 0.034138829 -0.224801064 -0.221452338 -0.039088678 0.999080658 -0.017603843 0.038706263 0.223993421 0.025929911 0.974245667 -0.219951444
+77777700 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.965910375 0.037603889 -0.256131083 -0.242696017 -0.043735024 0.998875856 -0.018281631 0.046505467 0.255155712 0.028860316 0.966469169 -0.265310453
+77944533 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.956261098 0.040829532 -0.289650917 -0.252766079 -0.048421524 0.998644531 -0.019089982 0.054620904 0.288478881 0.032280345 0.956941962 -0.321621308
+78144733 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.941536248 0.043692805 -0.334066480 -0.250198162 -0.053955212 0.998311937 -0.021497937 0.069548726 0.332563221 0.038265716 0.942304313 -0.401964240
+78311567 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.926658213 0.046350952 -0.373036444 -0.239336491 -0.058738846 0.998033047 -0.021904159 0.077439241 0.371287435 0.042209402 0.927558064 -0.474019461
+78478400 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.909880757 0.048629351 -0.412009954 -0.218247042 -0.063676558 0.997708619 -0.022863906 0.088967126 0.409954011 0.047038805 0.910892427 -0.543114491
+78645233 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.891359746 0.050869841 -0.450433195 -0.185763327 -0.067926541 0.997452736 -0.021771761 0.093745158 0.448178291 0.050002839 0.892544627 -0.611223637
+78845433 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.870108902 0.054302387 -0.489858925 -0.153515269 -0.074510135 0.996981323 -0.021829695 0.107765162 0.487194777 0.055493668 0.871528387 -0.691303250
+79012267 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.852772951 0.056338910 -0.519234240 -0.128052677 -0.078825951 0.996660411 -0.021319628 0.116291007 0.516299069 0.059109934 0.854366004 -0.760654136
+79179100 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.835254073 0.059146367 -0.546673834 -0.101344556 -0.084243484 0.996225357 -0.020929486 0.126763936 0.543372452 0.063535146 0.837083995 -0.832841061
+79345933 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.818755865 0.062443536 -0.570736051 -0.077325807 -0.089739971 0.995768547 -0.019791666 0.136091605 0.567085147 0.067422375 0.820895016 -0.908256727
+79546133 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.798365474 0.066207208 -0.598522484 -0.043774887 -0.096616283 0.995144248 -0.018795265 0.150808225 0.594371796 0.072832510 0.800885499 -0.994657638
+79712967 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.781648815 0.069040783 -0.619885862 -0.013285614 -0.101820730 0.994646847 -0.017611075 0.161173621 0.615351617 0.076882906 0.784494340 -1.070102980
+79879800 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.765168309 0.072694756 -0.639713168 0.019850080 -0.108554602 0.993946910 -0.016894773 0.177612448 0.634612799 0.082371153 0.768428028 -1.147576811
+80080000 0.483930168 0.860320329 0.500000000 0.500000000 0.000000000 0.000000000 0.745406330 0.077463314 -0.662094295 0.062107046 -0.117075674 0.993000031 -0.015629012 0.200140798 0.656248927 0.089165099 0.749257565 -1.238600776
diff --git a/assets/pose_files/2f25826f0d0ef09a.png b/assets/pose_files/2f25826f0d0ef09a.png
new file mode 100644
index 0000000000000000000000000000000000000000..b17a8f1369b58d514dcc718c5db486c7f3b85854
Binary files /dev/null and b/assets/pose_files/2f25826f0d0ef09a.png differ
diff --git a/assets/pose_files/2f25826f0d0ef09a.txt b/assets/pose_files/2f25826f0d0ef09a.txt
new file mode 100644
index 0000000000000000000000000000000000000000..316703a9f682933e388f5e2e52dd825dd9921a80
--- /dev/null
+++ b/assets/pose_files/2f25826f0d0ef09a.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=t-mlAKnESzQ
+167200000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.991872013 -0.011311784 0.126735851 0.400533760 0.012037775 0.999915242 -0.004963919 -0.047488550 -0.126668960 0.006449190 0.991924107 -0.414499612
+167467000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.991945148 -0.011409644 0.126153216 0.506974565 0.012122569 0.999914587 -0.004884966 -0.069421149 -0.126086697 0.006374919 0.991998732 -0.517325825
+167734000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.992271781 -0.010751382 0.123616949 0.590358341 0.011312660 0.999928653 -0.003839425 -0.085158661 -0.123566844 0.005208189 0.992322564 -0.599035085
+168034000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.993287027 -0.009973313 0.115245141 0.673577580 0.010455138 0.999938965 -0.003577147 -0.104263255 -0.115202427 0.004758038 0.993330657 -0.691557669
+168301000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.993988216 -0.009955904 0.109033749 0.753843765 0.010435819 0.999938190 -0.003831771 -0.106670354 -0.108988866 0.004946592 0.994030654 -0.805538867
+168602000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.994774222 -0.010583352 0.101549298 0.846176230 0.011122120 0.999926925 -0.004740742 -0.089426372 -0.101491705 0.005845411 0.994819224 -0.933449460
+168869000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.995415390 -0.010595482 0.095057681 0.913119395 0.011053002 0.999929726 -0.004287821 -0.072756893 -0.095005572 0.005318835 0.995462537 -1.037255409
+169169000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.996029556 -0.009414902 0.088523701 0.977259045 0.009879347 0.999939620 -0.004809874 -0.042104006 -0.088473074 0.005665333 0.996062458 -1.127427189
+169436000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.996695757 -0.008890423 0.080737323 1.025351476 0.009221899 0.999950528 -0.003733651 -0.007486727 -0.080700137 0.004465866 0.996728420 -1.188659636
+169736000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.997404695 -0.008404067 0.071506783 1.073562767 0.008649707 0.999957681 -0.003126226 0.054879890 -0.071477488 0.003736625 0.997435212 -1.216979926
+170003000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.997887254 -0.008228444 0.064446673 1.110116903 0.008409287 0.999961436 -0.002535321 0.124372514 -0.064423330 0.003071915 0.997917950 -1.231904045
+170303000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.998332024 -0.007790270 0.057205603 1.136173895 0.007975516 0.999963641 -0.003010646 0.212542522 -0.057180069 0.003461868 0.998357892 -1.242942079
+170570000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.998471320 -0.007715963 0.054730706 1.159189486 0.007868989 0.999965727 -0.002581036 0.310163907 -0.054708913 0.003007766 0.998497844 -1.245661417
+170871000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.998552144 -0.007742116 0.053231847 1.173763753 0.007991423 0.999958038 -0.004472161 0.412779543 -0.053194992 0.004891084 0.998572171 -1.229165757
+171137000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.998553872 -0.007909958 0.053175092 1.179029258 0.008138723 0.999958515 -0.004086919 0.509089997 -0.053140558 0.004513786 0.998576820 -1.196146494
+171438000 0.470983989 0.837304886 0.500000000 0.500000000 0.000000000 0.000000000 0.998469293 -0.008281939 0.054685175 1.181414517 0.008542870 0.999953210 -0.004539483 0.618089736 -0.054645021 0.004999703 0.998493314 -1.159911786
diff --git a/assets/pose_files/3c35b868a8ec3433.png b/assets/pose_files/3c35b868a8ec3433.png
new file mode 100644
index 0000000000000000000000000000000000000000..630b1531dc75b26ce41936b6a74172ac20d291c8
Binary files /dev/null and b/assets/pose_files/3c35b868a8ec3433.png differ
diff --git a/assets/pose_files/3c35b868a8ec3433.txt b/assets/pose_files/3c35b868a8ec3433.txt
new file mode 100644
index 0000000000000000000000000000000000000000..671dc924f15cfe4228560b9e72103f5d3ab1e2f5
--- /dev/null
+++ b/assets/pose_files/3c35b868a8ec3433.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=bJyPo9pESu0
+189622767 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.966956913 0.041186374 -0.251590967 0.235831829 -0.037132759 0.999092996 0.020840336 0.069818943 0.252221137 -0.010809440 0.967609227 -0.850289525
+189789600 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.967445135 0.041703269 -0.249621317 0.217678822 -0.037349533 0.999056637 0.022154763 0.078295447 0.250309765 -0.012110277 0.968090057 -0.818677483
+189956433 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.967587769 0.043305319 -0.248794302 0.196350216 -0.038503598 0.998966932 0.024136283 0.085749990 0.249582499 -0.013774496 0.968255579 -0.778043636
+190123267 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.967742383 0.044170257 -0.248039767 0.170234078 -0.039154600 0.998917341 0.025120445 0.090556068 0.248880804 -0.014598221 0.968424082 -0.733500964
+190323467 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.973553717 0.043272153 -0.224322766 0.120337922 -0.038196862 0.998907626 0.026917407 0.091227451 0.225242496 -0.017637115 0.974143088 -0.680520640
+190490300 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.984184802 0.039637893 -0.172653258 0.065019106 -0.035194401 0.998967648 0.028723357 0.090969669 0.173613548 -0.022192664 0.984563768 -0.638603728
+190657133 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.993078411 0.035358477 -0.112004772 0.011571313 -0.032207530 0.999036312 0.029818388 0.092482656 0.112951167 -0.026004599 0.993260205 -0.588118143
+190823967 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.997807920 0.031166473 -0.058378015 -0.027908508 -0.029339414 0.999060452 0.031897116 0.092538838 0.059317287 -0.030114418 0.997784853 -0.529325066
+191024167 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.999654651 0.026247790 -0.001263706 -0.064570799 -0.026190240 0.999087334 0.033742432 0.091922841 0.002148218 -0.033697683 0.999429762 -0.448626929
+191191000 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.998773992 0.022079065 0.044305529 -0.084478169 -0.023622099 0.999121666 0.034611158 0.087434649 -0.043502431 -0.035615314 0.998418272 -0.371306296
+191357833 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.995725632 0.017435640 0.090699598 -0.094868572 -0.020876031 0.999092638 0.037122324 0.082208324 -0.089970052 -0.038857099 0.995186150 -0.290596011
+191524667 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.989503622 0.013347236 0.143890470 -0.096537122 -0.019140780 0.999057651 0.038954727 0.079283141 -0.143234938 -0.041300017 0.988826632 -0.207477308
+191724867 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.975660741 0.006981443 0.219174415 -0.085240259 -0.016479453 0.999001026 0.041537181 0.072219148 -0.218665481 -0.044138070 0.974801123 -0.112100139
+191891700 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.955792487 -0.000511726 0.294041574 -0.064476318 -0.012924311 0.998958945 0.043749433 0.061688334 -0.293757826 -0.045615666 0.954790831 -0.034724173
+192058533 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.925362229 -0.009678445 0.378960580 -0.029417786 -0.008889219 0.998845160 0.047216032 0.058476640 -0.378979892 -0.047060598 0.924207509 0.042010383
+192258733 0.474122545 0.842884498 0.500000000 0.500000000 0.000000000 0.000000000 0.872846186 -0.021581186 0.487517983 0.038433307 -0.004890797 0.998584569 0.052961230 0.057516307 -0.487970918 -0.048611358 0.871505201 0.124675285
diff --git a/assets/pose_files/3f79dc32d575bcdc.png b/assets/pose_files/3f79dc32d575bcdc.png
new file mode 100644
index 0000000000000000000000000000000000000000..9dc6cc6eebe5029ee4cf3cc28c7e97030e2af38b
Binary files /dev/null and b/assets/pose_files/3f79dc32d575bcdc.png differ
diff --git a/assets/pose_files/3f79dc32d575bcdc.txt b/assets/pose_files/3f79dc32d575bcdc.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9ece1affc7fd92eb5986758c84b2f19d3a8edefd
--- /dev/null
+++ b/assets/pose_files/3f79dc32d575bcdc.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=1qVpRlWxam4
+86319567 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999183893 0.038032386 -0.013605987 -0.249154748 -0.038085770 0.999267697 -0.003686040 0.047875167 0.013455833 0.004201226 0.999900639 -0.566803149
+86586500 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999392629 0.034676589 -0.003445767 -0.282371175 -0.034685481 0.999395013 -0.002555777 0.057086778 0.003355056 0.002673743 0.999990821 -0.624021456
+86853433 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999498725 0.028301919 0.014187563 -0.320995587 -0.028301118 0.999599397 -0.000257314 0.061367205 -0.014189162 -0.000144339 0.999899328 -0.706664680
+87153733 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999064565 0.022049030 0.037200645 -0.371910835 -0.022201553 0.999746680 0.003691827 0.063911726 -0.037109818 -0.004514286 0.999301016 -0.799748814
+87420667 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.998171926 0.018552339 0.057520505 -0.440220060 -0.018887693 0.999807596 0.005291941 0.070160264 -0.057411261 -0.006368696 0.998330295 -0.853433007
+87720967 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.997675776 0.016262729 0.066170901 -0.486385324 -0.016915560 0.999813497 0.009317505 0.069230577 -0.066007033 -0.010415167 0.997764826 -0.912234761
+87987900 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.998019218 0.015867118 0.060876362 -0.497549423 -0.016505934 0.999813735 0.010005167 0.076295227 -0.060706269 -0.010990170 0.998095155 -0.980435972
+88288200 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999152124 0.018131699 0.036962789 -0.468507446 -0.018461898 0.999792457 0.008611582 0.087696066 -0.036798976 -0.009286684 0.999279559 -1.074633197
+88555133 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999717414 0.022977378 0.006097841 -0.420528982 -0.023013741 0.999717355 0.005961678 0.101216630 -0.005959134 -0.006100327 0.999963641 -1.169004730
+88855433 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.999106526 0.030726369 -0.029017152 -0.374249594 -0.030677194 0.999527037 0.002138488 0.120936030 0.029069137 -0.001246413 0.999576628 -1.251082317
+89122367 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.997359693 0.039784521 -0.060752310 -0.335843098 -0.039773725 0.999207735 0.001387495 0.132824955 0.060759377 0.001032514 0.998151898 -1.312258423
+89422667 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.992973983 0.050480653 -0.107025139 -0.253623964 -0.050627887 0.998716712 0.001342622 0.144421611 0.106955573 0.004085269 0.994255424 -1.394020432
+89689600 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.986886561 0.059628733 -0.149997801 -0.173418608 -0.059660275 0.998209476 0.004293700 0.142984494 0.149985254 0.004711515 0.988677025 -1.462588413
+89989900 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.978200734 0.067550205 -0.196367815 -0.089199207 -0.067402542 0.997698128 0.007442682 0.141665403 0.196418539 0.005955252 0.980502069 -1.524381413
+90256833 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.967793405 0.073765829 -0.240695804 0.013635864 -0.073441446 0.997246027 0.010330606 0.134276795 0.240794986 0.007679154 0.970545650 -1.588498428
+90557133 0.487278048 0.866272132 0.500000000 0.500000000 0.000000000 0.000000000 0.953711152 0.081056722 -0.289594263 0.148156165 -0.081249826 0.996628821 0.011376631 0.129987979 0.289540142 0.012679463 0.957081914 -1.633951355
diff --git a/assets/pose_files/4a2d6753676df096.png b/assets/pose_files/4a2d6753676df096.png
new file mode 100644
index 0000000000000000000000000000000000000000..20481d7ea0c8e3de101ab6d30e87b984cce03547
Binary files /dev/null and b/assets/pose_files/4a2d6753676df096.png differ
diff --git a/assets/pose_files/4a2d6753676df096.txt b/assets/pose_files/4a2d6753676df096.txt
new file mode 100644
index 0000000000000000000000000000000000000000..74b9d4583c1ba540796caa2ad01edaf887b27d51
--- /dev/null
+++ b/assets/pose_files/4a2d6753676df096.txt
@@ -0,0 +1,17 @@
+https://www.youtube.com/watch?v=mGFQkgadzRQ
+123665000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.996869564 0.002875770 -0.079011612 -0.427841466 -0.002861131 0.999995887 0.000298484 -0.005788880 0.079012141 -0.000071487 0.996873677 0.132732609
+123999000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.993462563 0.003229393 -0.114112593 -0.472377562 -0.003208589 0.999994814 0.000365978 -0.005932507 0.114113182 0.000002555 0.993467748 0.123959606
+124332000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.988605380 0.003602870 -0.150487319 -0.517270184 -0.003599323 0.999993503 0.000295953 -0.005751638 0.150487408 0.000249071 0.988611877 0.113156366
+124708000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.981692851 0.004048047 -0.190427750 -0.566330350 -0.004096349 0.999991596 0.000139980 -0.007622665 0.190426722 0.000642641 0.981701195 0.098572887
+125041000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.974759340 0.004326052 -0.223216295 -0.606091424 -0.004403458 0.999990284 0.000150970 -0.009427620 0.223214790 0.000835764 0.974768937 0.084984909
+125417000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.965238512 0.004419941 -0.261333257 -0.651601078 -0.004571608 0.999989569 0.000027561 -0.007437027 0.261330664 0.001168111 0.965248644 0.068577736
+125750000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.953956902 0.004390486 -0.299911648 -0.697081969 -0.004806366 0.999988258 -0.000648964 -0.003676960 0.299905270 0.002060569 0.953966737 0.050264043
+126126000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.940579295 0.004839818 -0.339539677 -0.744385684 -0.005527717 0.999984145 -0.001058831 -0.001820489 0.339529186 0.002872794 0.940591156 0.028560147
+126459000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.928297341 0.004980532 -0.371805429 -0.781716025 -0.005848793 0.999982178 -0.001207554 -0.001832299 0.371792793 0.003295582 0.928309917 0.009470658
+126835000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.913324535 0.005156573 -0.407199889 -0.824074795 -0.006227055 0.999979734 -0.001303667 -0.001894351 0.407184929 0.003726327 0.913338125 -0.013179829
+127168000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.898822486 0.005400294 -0.438279599 -0.860775204 -0.006702366 0.999976516 -0.001423908 -0.001209170 0.438261628 0.004217350 0.898837566 -0.034594674
+127544000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.880397439 0.005455900 -0.474205226 -0.903308447 -0.007032821 0.999974072 -0.001551900 -0.000798134 0.474184483 0.004701289 0.880412936 -0.061250069
+127877000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.862660766 0.005402398 -0.505754173 -0.939888304 -0.007276668 0.999972045 -0.001730187 -0.000489221 0.505730629 0.005172769 0.862675905 -0.086411685
+128253000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.841714203 0.005442667 -0.539895892 -0.978630821 -0.007698633 0.999968529 -0.001921765 0.000975953 0.539868414 0.005774037 0.841729641 -0.115983579
+128587000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.823229551 0.005282366 -0.567684054 -1.010071242 -0.007977336 0.999965608 -0.002263572 0.002284809 0.567652583 0.006392045 0.823243380 -0.141444392
+128962000 0.591609280 1.051749871 0.500000000 0.500000000 0.000000000 0.000000000 0.802855015 0.005112482 -0.596152425 -1.042319682 -0.008217614 0.999963105 -0.002491409 0.003637235 0.596117735 0.006899191 0.802867413 -0.169369454
diff --git a/assets/pose_files/color_bar.png b/assets/pose_files/color_bar.png
new file mode 100644
index 0000000000000000000000000000000000000000..6c73f221640b043844ab2f7b9c504ddba25e5b0f
Binary files /dev/null and b/assets/pose_files/color_bar.png differ
diff --git a/assets/pose_files/complex_1.png b/assets/pose_files/complex_1.png
new file mode 100644
index 0000000000000000000000000000000000000000..3ae49c98f2eb8ceace2e86ff04dd6c7b4c712e84
--- /dev/null
+++ b/assets/pose_files/complex_1.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:12e6411790a6d0d21b9ee25cf6b540016d58ddc9dfe11943c02b23d33cb36c40
+size 184778
diff --git a/assets/pose_files/complex_1.pth b/assets/pose_files/complex_1.pth
new file mode 100644
index 0000000000000000000000000000000000000000..d421750a27a93d51f18b175578f6d93d8ab42087
--- /dev/null
+++ b/assets/pose_files/complex_1.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fa358d46a1d516f4a9186928b697af3167a57c8996a050bca61d3ae62316c2e2
+size 1521
diff --git a/assets/pose_files/complex_2.png b/assets/pose_files/complex_2.png
new file mode 100644
index 0000000000000000000000000000000000000000..09868a1dde24a3735648dc161d14a46332369e87
--- /dev/null
+++ b/assets/pose_files/complex_2.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c2efb207f514382d5d8927e853e94c3707727f576687284cc338dd6e3fdc58f1
+size 183569
diff --git a/assets/pose_files/complex_2.pth b/assets/pose_files/complex_2.pth
new file mode 100644
index 0000000000000000000000000000000000000000..0cd94f6ae58e09d2c923a8a30e8fd031e9274016
--- /dev/null
+++ b/assets/pose_files/complex_2.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4643626ae50680c8ff1c14b9507ffb840e6bbd21a73d7845bc45c454591073ba
+size 1521
diff --git a/assets/pose_files/complex_3.png b/assets/pose_files/complex_3.png
new file mode 100644
index 0000000000000000000000000000000000000000..f40c9e800fb6aefd3e35444abca9f0c0de1ff926
--- /dev/null
+++ b/assets/pose_files/complex_3.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6342bd2532e81cc5257891667c0c76e1eec7d09bcfafee41098306169d8e2e63
+size 199701
diff --git a/assets/pose_files/complex_3.pth b/assets/pose_files/complex_3.pth
new file mode 100644
index 0000000000000000000000000000000000000000..4040a9cb930ba6c3935cea5bd220677b240c9ab1
--- /dev/null
+++ b/assets/pose_files/complex_3.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:82af5c5211af04db50521b4e5ca64196a4965ee8f3074ab18d52023c488834d0
+size 1521
diff --git a/assets/pose_files/complex_4.png b/assets/pose_files/complex_4.png
new file mode 100644
index 0000000000000000000000000000000000000000..ab7a1594000f9cbef74ed0f7ca246d90b4131a02
--- /dev/null
+++ b/assets/pose_files/complex_4.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d36a39d30bdec7a43029683f34107ff9b645801eed0fe9b9d823689ff71227b6
+size 185156
diff --git a/assets/pose_files/complex_4.pth b/assets/pose_files/complex_4.pth
new file mode 100644
index 0000000000000000000000000000000000000000..77074ddac59dc6f2d02c9fe2fb77f8d2b5bd36f2
--- /dev/null
+++ b/assets/pose_files/complex_4.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c045a7176f6f3ee95c517706e8fabf6c548d1a6fe0d8cf139a4a46955196d50d
+size 1521
diff --git a/assets/sea.jpg b/assets/sea.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..deece29c1be91ec9d48d9ff1d8f3dc2b0ccc7dd6
--- /dev/null
+++ b/assets/sea.jpg
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:bef05aeada45bbd68ca6e39f267dcf29a9fd223deb6c23aca03ca4af063a74fd
+size 1500066
diff --git a/assets/syn_video_control1.mp4 b/assets/syn_video_control1.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..b2cd75f20fc87064f7274dc1161f4fa60f533cd7
--- /dev/null
+++ b/assets/syn_video_control1.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:62a819a54311b19c9187d0561f404ffe1a37bea0d1cea065ef742ec880a91dce
+size 597937
diff --git a/assets/syn_video_control2.mp4 b/assets/syn_video_control2.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..d034c947c5abcdf41c5cd5ebcaa18d096e180af9
--- /dev/null
+++ b/assets/syn_video_control2.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:efee5c51d2db1bd3f075c4269ba62482f454d735b54760d91a34358893696cb0
+size 952736
diff --git a/data/dot_single_video/checkpoints/cvo_raft_patch_8.pth b/data/dot_single_video/checkpoints/cvo_raft_patch_8.pth
new file mode 100644
index 0000000000000000000000000000000000000000..c0f920d7c2fabb676de4325c77fd4cc7b5bd84b2
--- /dev/null
+++ b/data/dot_single_video/checkpoints/cvo_raft_patch_8.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c1422383b0e61f3c8543c4bedf6c64087675421cc37616a60faa580bbadddc51
+size 21093617
diff --git a/data/dot_single_video/checkpoints/movi_f_cotracker2_patch_4_wind_8.pth b/data/dot_single_video/checkpoints/movi_f_cotracker2_patch_4_wind_8.pth
new file mode 100644
index 0000000000000000000000000000000000000000..2113d57ff1b4ccae0645ee86f39f91f84ed65209
--- /dev/null
+++ b/data/dot_single_video/checkpoints/movi_f_cotracker2_patch_4_wind_8.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:89e7585d3e95d6e2bc4ff74ce072a98f70377047e669d44fa0b5c01311d4f54c
+size 204379466
diff --git a/data/dot_single_video/checkpoints/movi_f_cotracker_patch_4_wind_8.pth b/data/dot_single_video/checkpoints/movi_f_cotracker_patch_4_wind_8.pth
new file mode 100644
index 0000000000000000000000000000000000000000..2bd5a940c61fd3bb20c21fdb3942f4df658e3374
--- /dev/null
+++ b/data/dot_single_video/checkpoints/movi_f_cotracker_patch_4_wind_8.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8788efe1d7c462757200605a2bcfd357a76d385ae57769be605016d7f9bbb1d5
+size 96650477
diff --git a/data/dot_single_video/checkpoints/movi_f_raft_patch_4_alpha.pth b/data/dot_single_video/checkpoints/movi_f_raft_patch_4_alpha.pth
new file mode 100644
index 0000000000000000000000000000000000000000..7841f89df64c6278eb5cfd57cc44e0a6bb80db02
--- /dev/null
+++ b/data/dot_single_video/checkpoints/movi_f_raft_patch_4_alpha.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ae2b478a99f1948384c237f0935eff2a61be4f7b736493c922b2fb5223117fb9
+size 23173365
diff --git a/data/dot_single_video/configs/cotracker2_patch_4_wind_8.json b/data/dot_single_video/configs/cotracker2_patch_4_wind_8.json
new file mode 100644
index 0000000000000000000000000000000000000000..c777e3bc25ee578ccb183bfef0c922b6f9e221aa
--- /dev/null
+++ b/data/dot_single_video/configs/cotracker2_patch_4_wind_8.json
@@ -0,0 +1,5 @@
+{
+  "name": "cotracker2",
+  "patch_size": 4,
+  "wind_size": 8
+}
\ No newline at end of file
diff --git a/data/dot_single_video/configs/cotracker_patch_4_wind_8.json b/data/dot_single_video/configs/cotracker_patch_4_wind_8.json
new file mode 100644
index 0000000000000000000000000000000000000000..2d24aabd72ca712de22263f0116c7900509ebdde
--- /dev/null
+++ b/data/dot_single_video/configs/cotracker_patch_4_wind_8.json
@@ -0,0 +1,5 @@
+{
+  "name": "cotracker",
+  "patch_size": 4,
+  "wind_size": 8
+}
\ No newline at end of file
diff --git a/data/dot_single_video/configs/dot_single_video_1105.yaml b/data/dot_single_video/configs/dot_single_video_1105.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0185107f11f1f2a0533692694b8c233c2678c8d7
--- /dev/null
+++ b/data/dot_single_video/configs/dot_single_video_1105.yaml
@@ -0,0 +1,23 @@
+dot_model:
+  height: 320
+  width: 512
+  tracker_config: data/dot_single_video/configs/cotracker2_patch_4_wind_8.json
+  tracker_path: data/dot_single_video/checkpoints/movi_f_cotracker2_patch_4_wind_8.pth
+  estimator_config: data/dot_single_video/configs/raft_patch_8.json
+  estimator_path: data/dot_single_video/checkpoints/cvo_raft_patch_8.pth
+  refiner_config: data/dot_single_video/configs/raft_patch_4_alpha.json
+  refiner_path: data/dot_single_video/checkpoints/movi_f_raft_patch_4_alpha.pth
+
+inference_config:
+  mode: tracks_from_first_to_every_other_frame
+  return_flow: true     # ! important prams
+
+  num_tracks: 8192
+  sim_tracks: 2048
+  sample_mode: all
+
+  is_train: false
+  interpolation_version: torch3d
+  alpha_thresh: 0.8
+
+
diff --git a/data/dot_single_video/configs/raft_patch_4_alpha.json b/data/dot_single_video/configs/raft_patch_4_alpha.json
new file mode 100644
index 0000000000000000000000000000000000000000..13706299bc40b4b73b4c9fc93690bc8cf1fa488e
--- /dev/null
+++ b/data/dot_single_video/configs/raft_patch_4_alpha.json
@@ -0,0 +1,8 @@
+{
+  "name": "raft",
+  "patch_size": 4,
+  "num_iter": 4,
+  "refine_alpha": true,
+  "norm_fnet": "instance",
+  "norm_cnet": "instance"
+}
\ No newline at end of file
diff --git a/data/dot_single_video/configs/raft_patch_8.json b/data/dot_single_video/configs/raft_patch_8.json
new file mode 100644
index 0000000000000000000000000000000000000000..5fc0080c4af486337b916e772b9cbb93c967eedf
--- /dev/null
+++ b/data/dot_single_video/configs/raft_patch_8.json
@@ -0,0 +1,8 @@
+{
+  "name": "raft",
+  "patch_size": 8,
+  "num_iter": 12,
+  "refine_alpha": false,
+  "norm_fnet": "instance",
+  "norm_cnet": "batch"
+}
\ No newline at end of file
diff --git a/data/dot_single_video/dot/__init__.py b/data/dot_single_video/dot/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/data/dot_single_video/dot/data/cvo_dataset.py b/data/dot_single_video/dot/data/cvo_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..58e1fc8fe359eb67093734bf79026f1791eabc64
--- /dev/null
+++ b/data/dot_single_video/dot/data/cvo_dataset.py
@@ -0,0 +1,129 @@
+import os
+import os.path as osp
+from collections import OrderedDict
+
+import lmdb
+import torch
+import numpy as np
+import pickle as pkl
+from einops import rearrange
+from torch.utils.data import Dataset, DataLoader
+
+from dot.utils.torch import get_alpha_consistency
+
+
+class CVO_sampler_lmdb:
+    """Data sampling"""
+
+    all_keys = ["imgs", "imgs_blur", "fflows", "bflows", "delta_fflows", "delta_bflows"]
+
+    def __init__(self, data_root, keys=None, split=None):
+        if split == "extended":
+            self.db_path = osp.join(data_root, "cvo_test_extended.lmdb")
+        else:
+            self.db_path = osp.join(data_root, "cvo_test.lmdb")
+        self.split = split
+
+        self.env = lmdb.open(
+            self.db_path,
+            subdir=os.path.isdir(self.db_path),
+            readonly=True,
+            lock=False,
+            readahead=False,
+            meminit=False,
+        )
+        with self.env.begin(write=False) as txn:
+            self.samples = pkl.loads(txn.get(b"__samples__"))
+            self.length = len(self.samples)
+
+        self.keys = self.all_keys if keys is None else [x.lower() for x in keys]
+        self._check_keys(self.keys)
+
+    def _check_keys(self, keys):
+        # check keys are supported:
+        for k in keys:
+            assert k in self.all_keys, f"Invalid key value: {k}"
+
+    def __len__(self):
+        return self.length
+
+    def sample(self, index):
+        sample = OrderedDict()
+        with self.env.begin(write=False) as txn:
+            for k in self.keys:
+                key = "{:05d}_{:s}".format(index, k)
+                value = pkl.loads(txn.get(key.encode()))
+                if "flow" in key and self.split in ["clean", "final"]:  # Convert Int to Floating
+                    value = value.astype(np.float32)
+                    value = (value - 2 ** 15) / 128.0
+                if "imgs" in k:
+                    k = "imgs"
+                sample[k] = value
+        return sample
+
+
+class CVO(Dataset):
+    all_keys = ["fflows", "bflows"]
+
+    def __init__(self, data_root, keys=None, split="clean", crop_size=256):
+        keys = self.all_keys if keys is None else [x.lower() for x in keys]
+        self._check_keys(keys)
+        if split == "final":
+            keys.append("imgs_blur")
+        else:
+            keys.append("imgs")
+        self.split = split
+        self.sampler = CVO_sampler_lmdb(data_root, keys, split)
+
+    def __getitem__(self, index):
+        sample = self.sampler.sample(index)
+
+        video = torch.from_numpy(sample["imgs"].copy())
+        video = video / 255.0
+        video = rearrange(video, "h w (t c) -> t c h w", c=3)
+
+        fflow = torch.from_numpy(sample["fflows"].copy())
+        fflow = rearrange(fflow, "h w (t c) -> t h w c", c=2)[-1]
+
+        bflow = torch.from_numpy(sample["bflows"].copy())
+        bflow = rearrange(bflow, "h w (t c) -> t h w c", c=2)[-1]
+
+        if self.split in ["clean", "final"]:
+            thresh_1 = 0.01
+            thresh_2 = 0.5
+        elif self.split == "extended":
+            thresh_1 = 0.1
+            thresh_2 = 0.5
+        else:
+            raise ValueError(f"Unknown split {self.split}")
+
+        alpha = get_alpha_consistency(bflow[None], fflow[None], thresh_1=thresh_1, thresh_2=thresh_2)[0]
+
+        data = {
+            "video": video,
+            "alpha": alpha,
+            "flow": bflow
+        }
+
+        return data
+
+    def _check_keys(self, keys):
+        # check keys are supported:
+        for k in keys:
+            assert k in self.all_keys, f"Invalid key value: {k}"
+
+    def __len__(self):
+        return len(self.sampler)
+
+
+def create_optical_flow_dataset(args):
+    dataset = CVO(args.data_root, split=args.split)
+    dataloader = DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        pin_memory=True,
+        shuffle=False,
+        num_workers=0,
+        drop_last=False,
+    )
+    return dataloader
\ No newline at end of file
diff --git a/data/dot_single_video/dot/data/movi_f_dataset.py b/data/dot_single_video/dot/data/movi_f_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a7e3a3ad17a62dfde3ee49b04eb7e3747e5503d
--- /dev/null
+++ b/data/dot_single_video/dot/data/movi_f_dataset.py
@@ -0,0 +1,124 @@
+import os
+from glob import glob
+import random
+import numpy as np
+import torch
+from torch.utils import data
+
+from dot.utils.io import read_video, read_tracks
+
+
+def create_point_tracking_dataset(args, batch_size=1, split="train", num_workers=None, verbose=False):
+    dataset = Dataset(args, split, verbose)
+    dataloader = DataLoader(args, dataset, batch_size, split, num_workers)
+    return dataloader
+
+
+class DataLoader:
+    def __init__(self, args, dataset, batch_size=1, split="train", num_workers=None):
+        num_workers = args.num_workers if num_workers is None else num_workers
+        is_train = split == "train"
+        self.sampler = data.distributed.DistributedSampler(dataset, args.world_size, args.rank) if is_train else None
+        self.loader = data.DataLoader(
+            dataset,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            sampler=self.sampler,
+        )
+        self.epoch = -1
+        self.reinit()
+
+    def reinit(self):
+        self.epoch += 1
+        if self.sampler:
+            self.sampler.set_epoch(self.epoch)
+        self.iter = iter(self.loader)
+
+    def next(self):
+        try:
+            return next(self.iter)
+        except StopIteration:
+            self.reinit()
+            return next(self.iter)
+
+
+def get_correspondences(track_path, src_step, tgt_step, num_tracks, height, width, vis_src_only):
+    H, W = height, width
+    tracks = torch.from_numpy(read_tracks(track_path))
+    tracks[..., 0] = tracks[..., 0] / (W - 1)
+    tracks[..., 1] = tracks[..., 1] / (H - 1)
+    src_points = tracks[:, src_step]
+    tgt_points = tracks[:, tgt_step]
+    if vis_src_only:
+        src_alpha = src_points[..., 2]
+        vis_idx = torch.nonzero(src_alpha, as_tuple=True)[0]
+        num_vis = vis_idx.shape[0]
+        if num_vis == 0:
+            return False, None
+        samples = np.random.choice(num_vis, num_tracks, replace=num_tracks > num_vis)
+        idx = vis_idx[samples]
+    else:
+        idx = np.random.choice(tracks.size(0), num_tracks, replace=num_tracks > tracks.size(0))
+    return True, (src_points[idx], tgt_points[idx])
+
+
+class Dataset(data.Dataset):
+    def __init__(self, args, split="train", verbose=False):
+        super().__init__()
+        self.video_folder = os.path.join(args.data_root, "video")
+        self.in_track_folder = os.path.join(args.data_root, args.in_track_name)
+        self.out_track_folder = os.path.join(args.data_root, args.out_track_name)
+        self.num_in_tracks = args.num_in_tracks
+        self.num_out_tracks = args.num_out_tracks
+        num_videos = len(glob(os.path.join(self.video_folder, "*")))
+        self.video_steps = [
+            len(glob(os.path.join(self.video_folder, str(video_idx), "*"))) for video_idx in range(num_videos)
+        ]
+        video_indices = list(range(num_videos))
+        if split == "valid":
+            video_indices = video_indices[:int(num_videos * args.valid_ratio)]
+        elif split == "train":
+            video_indices = video_indices[int(num_videos * args.valid_ratio):]
+        self.video_indices = video_indices
+        self.num_videos = len(video_indices)
+        if verbose:
+            print(f"Created {split} dataset of length {self.num_videos}")
+
+    def __len__(self):
+        return self.num_videos
+
+    def __getitem__(self, idx):
+        idx = idx % self.num_videos
+        video_idx = self.video_indices[idx]
+        time_steps = self.video_steps[video_idx]
+        src_step = random.randrange(time_steps)
+        tgt_step = random.randrange(time_steps - 1)
+        tgt_step = (src_step + tgt_step) % time_steps
+
+        video_path = os.path.join(self.video_folder, str(video_idx))
+        src_frame = read_video(video_path, start_step=src_step, time_steps=1)[0]
+        tgt_frame = read_video(video_path, start_step=tgt_step, time_steps=1)[0]
+        _, H, W = src_frame.shape
+
+        in_track_path = os.path.join(self.in_track_folder, f"{video_idx}.npy")
+        out_track_path = os.path.join(self.out_track_folder, f"{video_idx}.npy")
+        vis_src_only = False
+        _, corr = get_correspondences(in_track_path, src_step, tgt_step, self.num_in_tracks, H, W, vis_src_only)
+        src_points, tgt_points = corr
+
+        vis_src_only = True
+        success, corr = get_correspondences(out_track_path, src_step, tgt_step, self.num_out_tracks, H, W, vis_src_only)
+        if not success:
+            return self[idx + 1]
+        out_src_points, out_tgt_points = corr
+
+        data = {
+            "src_frame": src_frame,
+            "tgt_frame": tgt_frame,
+            "src_points": src_points,
+            "tgt_points": tgt_points,
+            "out_src_points": out_src_points,
+            "out_tgt_points": out_tgt_points,
+        }
+
+        return data
diff --git a/data/dot_single_video/dot/data/movi_f_tf_dataset.py b/data/dot_single_video/dot/data/movi_f_tf_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..2347e6daffb5daab282a72a6f885b0a5c6702cd4
--- /dev/null
+++ b/data/dot_single_video/dot/data/movi_f_tf_dataset.py
@@ -0,0 +1,1005 @@
+# Copyright 2023 The Kubric Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Kubric dataset with point tracking."""
+
+import functools
+import itertools
+
+import matplotlib.pyplot as plt
+import mediapy as media
+import numpy as np
+import tensorflow.compat.v1 as tf
+import tensorflow_datasets as tfds
+from tensorflow_graphics.geometry.transformation import rotation_matrix_3d
+
+
+def project_point(cam, point3d, num_frames):
+    """Compute the image space coordinates [0, 1] for a set of points.
+
+    Args:
+      cam: The camera parameters, as returned by kubric.  'matrix_world' and
+        'intrinsics' have a leading axis num_frames.
+      point3d: Points in 3D world coordinates.  it has shape [num_frames,
+        num_points, 3].
+      num_frames: The number of frames in the video.
+
+    Returns:
+      Image coordinates in 2D.  The last coordinate is an indicator of whether
+        the point is behind the camera.
+    """
+
+    homo_transform = tf.linalg.inv(cam['matrix_world'])
+    homo_intrinsics = tf.zeros((num_frames, 3, 1), dtype=tf.float32)
+    homo_intrinsics = tf.concat([cam['intrinsics'], homo_intrinsics], axis=2)
+
+    point4d = tf.concat([point3d, tf.ones_like(point3d[:, :, 0:1])], axis=2)
+    projected = tf.matmul(point4d, tf.transpose(homo_transform, (0, 2, 1)))
+    projected = tf.matmul(projected, tf.transpose(homo_intrinsics, (0, 2, 1)))
+    image_coords = projected / projected[:, :, 2:3]
+    image_coords = tf.concat(
+        [image_coords[:, :, :2],
+         tf.sign(projected[:, :, 2:])], axis=2)
+    return image_coords
+
+
+def unproject(coord, cam, depth):
+    """Unproject points.
+
+    Args:
+      coord: Points in 2D coordinates.  it has shape [num_points, 2].  Coord is in
+        integer (y,x) because of the way meshgrid happens.
+      cam: The camera parameters, as returned by kubric.  'matrix_world' and
+        'intrinsics' have a leading axis num_frames.
+      depth: Depth map for the scene.
+
+    Returns:
+      Image coordinates in 3D.
+    """
+    shp = tf.convert_to_tensor(tf.shape(depth))
+    idx = coord[:, 0] * shp[1] + coord[:, 1]
+    coord = tf.cast(coord[..., ::-1], tf.float32)
+    shp = tf.cast(shp[1::-1], tf.float32)[tf.newaxis, ...]
+
+    # Need to convert from pixel to raster coordinate.
+    projected_pt = (coord + 0.5) / shp
+
+    projected_pt = tf.concat(
+        [
+            projected_pt,
+            tf.ones_like(projected_pt[:, -1:]),
+        ],
+        axis=-1,
+    )
+
+    camera_plane = projected_pt @ tf.linalg.inv(tf.transpose(cam['intrinsics']))
+    camera_ball = camera_plane / tf.sqrt(
+        tf.reduce_sum(
+            tf.square(camera_plane),
+            axis=1,
+            keepdims=True,
+        ), )
+    camera_ball *= tf.gather(tf.reshape(depth, [-1]), idx)[:, tf.newaxis]
+
+    camera_ball = tf.concat(
+        [
+            camera_ball,
+            tf.ones_like(camera_plane[:, 2:]),
+        ],
+        axis=1,
+    )
+    points_3d = camera_ball @ tf.transpose(cam['matrix_world'])
+    return points_3d[:, :3] / points_3d[:, 3:]
+
+
+def reproject(coords, camera, camera_pos, num_frames, bbox=None):
+    """Reconstruct points in 3D and reproject them to pixels.
+
+    Args:
+      coords: Points in 3D.  It has shape [num_points, 3].  If bbox is specified,
+        these are assumed to be in local box coordinates (as specified by kubric),
+        and bbox will be used to put them into world coordinates; otherwise they
+        are assumed to be in world coordinates.
+      camera: the camera intrinsic parameters, as returned by kubric.
+        'matrix_world' and 'intrinsics' have a leading axis num_frames.
+      camera_pos: the camera positions.  It has shape [num_frames, 3]
+      num_frames: the number of frames in the video.
+      bbox: The kubric bounding box for the object.  Its first axis is num_frames.
+
+    Returns:
+      Image coordinates in 2D and their respective depths.  For the points,
+      the last coordinate is an indicator of whether the point is behind the
+      camera.  They are of shape [num_points, num_frames, 3] and
+      [num_points, num_frames] respectively.
+    """
+    # First, reconstruct points in the local object coordinate system.
+    if bbox is not None:
+        coord_box = list(itertools.product([-.5, .5], [-.5, .5], [-.5, .5]))
+        coord_box = np.array([np.array(x) for x in coord_box])
+        coord_box = np.concatenate(
+            [coord_box, np.ones_like(coord_box[:, 0:1])], axis=1)
+        coord_box = tf.tile(coord_box[tf.newaxis, ...], [num_frames, 1, 1])
+        bbox_homo = tf.concat([bbox, tf.ones_like(bbox[:, :, 0:1])], axis=2)
+
+        local_to_world = tf.linalg.lstsq(tf.cast(coord_box, tf.float32), bbox_homo)
+        world_coords = tf.matmul(
+            tf.cast(
+                tf.concat([coords, tf.ones_like(coords[:, 0:1])], axis=1),
+                tf.float32)[tf.newaxis, :, :], local_to_world)
+        world_coords = world_coords[:, :, 0:3] / world_coords[:, :, 3:]
+    else:
+        world_coords = tf.tile(coords[tf.newaxis, :, :], [num_frames, 1, 1])
+
+    # Compute depths by taking the distance between the points and the camera
+    # center.
+    depths = tf.sqrt(
+        tf.reduce_sum(
+            tf.square(world_coords - camera_pos[:, np.newaxis, :]),
+            axis=2,
+        ), )
+
+    # Project each point back to the image using the camera.
+    projections = project_point(camera, world_coords, num_frames)
+
+    return (
+        tf.transpose(projections, (1, 0, 2)),
+        tf.transpose(depths),
+        tf.transpose(world_coords, (1, 0, 2)),
+    )
+
+
+def estimate_occlusion_by_depth_and_segment(
+        data,
+        segments,
+        x,
+        y,
+        num_frames,
+        thresh,
+        seg_id,
+):
+    """Estimate depth at a (floating point) x,y position.
+
+    We prefer overestimating depth at the point, so we take the max over the 4
+    neightoring pixels.
+
+    Args:
+      data: depth map. First axis is num_frames.
+      segments: segmentation map. First axis is num_frames.
+      x: x coordinate. First axis is num_frames.
+      y: y coordinate. First axis is num_frames.
+      num_frames: number of frames.
+      thresh: Depth threshold at which we consider the point occluded.
+      seg_id: Original segment id.  Assume occlusion if there's a mismatch.
+
+    Returns:
+      Depth for each point.
+    """
+
+    # need to convert from raster to pixel coordinates
+    x = x - 0.5
+    y = y - 0.5
+
+    x0 = tf.cast(tf.floor(x), tf.int32)
+    x1 = x0 + 1
+    y0 = tf.cast(tf.floor(y), tf.int32)
+    y1 = y0 + 1
+
+    shp = tf.shape(data)
+    assert len(data.shape) == 3
+    x0 = tf.clip_by_value(x0, 0, shp[2] - 1)
+    x1 = tf.clip_by_value(x1, 0, shp[2] - 1)
+    y0 = tf.clip_by_value(y0, 0, shp[1] - 1)
+    y1 = tf.clip_by_value(y1, 0, shp[1] - 1)
+
+    data = tf.reshape(data, [-1])
+    rng = tf.range(num_frames)[:, tf.newaxis]
+    i1 = tf.gather(data, rng * shp[1] * shp[2] + y0 * shp[2] + x0)
+    i2 = tf.gather(data, rng * shp[1] * shp[2] + y1 * shp[2] + x0)
+    i3 = tf.gather(data, rng * shp[1] * shp[2] + y0 * shp[2] + x1)
+    i4 = tf.gather(data, rng * shp[1] * shp[2] + y1 * shp[2] + x1)
+
+    depth = tf.maximum(tf.maximum(tf.maximum(i1, i2), i3), i4)
+
+    segments = tf.reshape(segments, [-1])
+    i1 = tf.gather(segments, rng * shp[1] * shp[2] + y0 * shp[2] + x0)
+    i2 = tf.gather(segments, rng * shp[1] * shp[2] + y1 * shp[2] + x0)
+    i3 = tf.gather(segments, rng * shp[1] * shp[2] + y0 * shp[2] + x1)
+    i4 = tf.gather(segments, rng * shp[1] * shp[2] + y1 * shp[2] + x1)
+
+    depth_occluded = tf.less(tf.transpose(depth), thresh)
+    seg_occluded = True
+    for i in [i1, i2, i3, i4]:
+        i = tf.cast(i, tf.int32)
+        seg_occluded = tf.logical_and(seg_occluded, tf.not_equal(seg_id, i))
+
+    return tf.logical_or(depth_occluded, tf.transpose(seg_occluded))
+
+
+def get_camera_matrices(
+        cam_focal_length,
+        cam_positions,
+        cam_quaternions,
+        cam_sensor_width,
+        input_size,
+        num_frames=None,
+):
+    """Tf function that converts camera positions into projection matrices."""
+    intrinsics = []
+    matrix_world = []
+    assert cam_quaternions.shape[0] == num_frames
+    for frame_idx in range(cam_quaternions.shape[0]):
+        focal_length = tf.cast(cam_focal_length, tf.float32)
+        sensor_width = tf.cast(cam_sensor_width, tf.float32)
+        f_x = focal_length / sensor_width
+        f_y = focal_length / sensor_width * input_size[0] / input_size[1]
+        p_x = 0.5
+        p_y = 0.5
+        intrinsics.append(
+            tf.stack([
+                tf.stack([f_x, 0., -p_x]),
+                tf.stack([0., -f_y, -p_y]),
+                tf.stack([0., 0., -1.]),
+            ]))
+
+        position = cam_positions[frame_idx]
+        quat = cam_quaternions[frame_idx]
+        rotation_matrix = rotation_matrix_3d.from_quaternion(
+            tf.concat([quat[1:], quat[0:1]], axis=0))
+        transformation = tf.concat(
+            [rotation_matrix, position[:, tf.newaxis]],
+            axis=1,
+        )
+        transformation = tf.concat(
+            [transformation,
+             tf.constant([0.0, 0.0, 0.0, 1.0])[tf.newaxis, :]],
+            axis=0,
+        )
+        matrix_world.append(transformation)
+
+    return (
+        tf.cast(tf.stack(intrinsics), tf.float32),
+        tf.cast(tf.stack(matrix_world), tf.float32),
+    )
+
+
+def quat2rot(quats):
+    """Convert a list of quaternions to rotation matrices."""
+    rotation_matrices = []
+    for frame_idx in range(quats.shape[0]):
+        quat = quats[frame_idx]
+        rotation_matrix = rotation_matrix_3d.from_quaternion(
+            tf.concat([quat[1:], quat[0:1]], axis=0))
+        rotation_matrices.append(rotation_matrix)
+    return tf.cast(tf.stack(rotation_matrices), tf.float32)
+
+
+def rotate_surface_normals(
+        world_frame_normals,
+        point_3d,
+        cam_pos,
+        obj_rot_mats,
+        frame_for_query,
+):
+    """Points are occluded if the surface normal points away from the camera."""
+    query_obj_rot_mat = tf.gather(obj_rot_mats, frame_for_query)
+    obj_frame_normals = tf.einsum(
+        'boi,bi->bo',
+        tf.linalg.inv(query_obj_rot_mat),
+        world_frame_normals,
+    )
+    world_frame_normals_frames = tf.einsum(
+        'foi,bi->bfo',
+        obj_rot_mats,
+        obj_frame_normals,
+    )
+    cam_to_pt = point_3d - cam_pos[tf.newaxis, :, :]
+    dots = tf.reduce_sum(world_frame_normals_frames * cam_to_pt, axis=-1)
+    faces_away = dots > 0
+
+    # If the query point also faces away, it's probably a bug in the meshes, so
+    # ignore the result of the test.
+    faces_away_query = tf.reduce_sum(
+        tf.cast(faces_away, tf.int32)
+        * tf.one_hot(frame_for_query, tf.shape(faces_away)[1], dtype=tf.int32),
+        axis=1,
+        keepdims=True,
+    )
+    faces_away = tf.logical_and(faces_away, tf.logical_not(faces_away_query > 0))
+    return faces_away
+
+
+def single_object_reproject(
+        bbox_3d=None,
+        pt=None,
+        pt_segments=None,
+        camera=None,
+        cam_positions=None,
+        num_frames=None,
+        depth_map=None,
+        segments=None,
+        window=None,
+        input_size=None,
+        quat=None,
+        normals=None,
+        frame_for_pt=None,
+        trust_normals=None,
+):
+    """Reproject points for a single object.
+
+    Args:
+      bbox_3d: The object bounding box from Kubric.  If none, assume it's
+        background.
+      pt: The set of points in 3D, with shape [num_points, 3]
+      pt_segments: The segment each point came from, with shape [num_points]
+      camera: Camera intrinsic parameters
+      cam_positions: Camera positions, with shape [num_frames, 3]
+      num_frames: Number of frames
+      depth_map: Depth map video for the camera
+      segments: Segmentation map video for the camera
+      window: the window inside which we're sampling points
+      input_size: [height, width] of the input images.
+      quat: Object quaternion [num_frames, 4]
+      normals: Point normals on the query frame [num_points, 3]
+      frame_for_pt: Integer frame where the query point came from [num_points]
+      trust_normals: Boolean flag for whether the surface normals for each query
+        are trustworthy [num_points]
+
+    Returns:
+      Position for each point, of shape [num_points, num_frames, 2], in pixel
+      coordinates, and an occlusion flag for each point, of shape
+      [num_points, num_frames].  These are respect to the image frame, not the
+      window.
+
+    """
+    # Finally, reproject
+    reproj, depth_proj, world_pos = reproject(
+        pt,
+        camera,
+        cam_positions,
+        num_frames,
+        bbox=bbox_3d,
+    )
+
+    occluded = tf.less(reproj[:, :, 2], 0)
+    reproj = reproj[:, :, 0:2] * np.array(input_size[::-1])[np.newaxis,
+                                 np.newaxis, :]
+    occluded = tf.logical_or(
+        occluded,
+        estimate_occlusion_by_depth_and_segment(
+            depth_map[:, :, :, 0],
+            segments[:, :, :, 0],
+            tf.transpose(reproj[:, :, 0]),
+            tf.transpose(reproj[:, :, 1]),
+            num_frames,
+            depth_proj * .99,
+            pt_segments,
+        ),
+    )
+    obj_occ = occluded
+    obj_reproj = reproj
+
+    obj_occ = tf.logical_or(obj_occ, tf.less(obj_reproj[:, :, 1], window[0]))
+    obj_occ = tf.logical_or(obj_occ, tf.less(obj_reproj[:, :, 0], window[1]))
+    obj_occ = tf.logical_or(obj_occ, tf.greater(obj_reproj[:, :, 1], window[2]))
+    obj_occ = tf.logical_or(obj_occ, tf.greater(obj_reproj[:, :, 0], window[3]))
+
+    if quat is not None:
+        faces_away = rotate_surface_normals(
+            normals,
+            world_pos,
+            cam_positions,
+            quat2rot(quat),
+            frame_for_pt,
+        )
+        faces_away = tf.logical_and(faces_away, trust_normals)
+    else:
+        # world is convex; can't face away from cam.
+        faces_away = tf.zeros([tf.shape(pt)[0], num_frames], dtype=tf.bool)
+
+    return obj_reproj, tf.logical_or(faces_away, obj_occ)
+
+
+def get_num_to_sample(counts, max_seg_id, max_sampled_frac, tracks_to_sample):
+    """Computes the number of points to sample for each object.
+
+    Args:
+      counts: The number of points available per object.  An int array of length
+        n, where n is the number of objects.
+      max_seg_id: The maximum number of segment id's in the video.
+      max_sampled_frac: The maximum fraction of points to sample from each
+        object, out of all points that lie on the sampling grid.
+      tracks_to_sample: Total number of tracks to sample per video.
+
+    Returns:
+      The number of points to sample for each object.  An int array of length n.
+    """
+    seg_order = tf.argsort(counts)
+    sorted_counts = tf.gather(counts, seg_order)
+    initializer = (0, tracks_to_sample, 0)
+
+    def scan_fn(prev_output, count_seg):
+        index = prev_output[0]
+        remaining_needed = prev_output[1]
+        desired_frac = 1 / (tf.shape(seg_order)[0] - index)
+        want_to_sample = (
+                tf.cast(remaining_needed, tf.float32) *
+                tf.cast(desired_frac, tf.float32))
+        want_to_sample = tf.cast(tf.round(want_to_sample), tf.int32)
+        max_to_sample = (
+                tf.cast(count_seg, tf.float32) * tf.cast(max_sampled_frac, tf.float32))
+        max_to_sample = tf.cast(tf.round(max_to_sample), tf.int32)
+        num_to_sample = tf.minimum(want_to_sample, max_to_sample)
+
+        remaining_needed = remaining_needed - num_to_sample
+        return (index + 1, remaining_needed, num_to_sample)
+
+    # outputs 0 and 1 are just bookkeeping; output 2 is the actual number of
+    # points to sample per object.
+    res = tf.scan(scan_fn, sorted_counts, initializer)[2]
+    invert = tf.argsort(seg_order)
+    num_to_sample = tf.gather(res, invert)
+    num_to_sample = tf.concat(
+        [
+            num_to_sample,
+            tf.zeros([max_seg_id - tf.shape(num_to_sample)[0]], dtype=tf.int32),
+        ],
+        axis=0,
+    )
+    return num_to_sample
+
+
+#  pylint: disable=cell-var-from-loop
+
+
+def track_points(
+        object_coordinates,
+        depth,
+        depth_range,
+        segmentations,
+        surface_normals,
+        bboxes_3d,
+        obj_quat,
+        cam_focal_length,
+        cam_positions,
+        cam_quaternions,
+        cam_sensor_width,
+        window,
+        tracks_to_sample=256,
+        sampling_stride=4,
+        max_seg_id=25,
+        max_sampled_frac=0.1,
+):
+    """Track points in 2D using Kubric data.
+
+    Args:
+      object_coordinates: Video of coordinates for each pixel in the object's
+        local coordinate frame.  Shape [num_frames, height, width, 3]
+      depth: uint16 depth video from Kubric.  Shape [num_frames, height, width]
+      depth_range: Values needed to normalize Kubric's int16 depth values into
+        metric depth.
+      segmentations: Integer object id for each pixel.  Shape
+        [num_frames, height, width]
+      surface_normals: uint16 surface normal map. Shape
+        [num_frames, height, width, 3]
+      bboxes_3d: The set of all object bounding boxes from Kubric
+      obj_quat: Quaternion rotation for each object.  Shape
+        [num_objects, num_frames, 4]
+      cam_focal_length: Camera focal length
+      cam_positions: Camera positions, with shape [num_frames, 3]
+      cam_quaternions: Camera orientations, with shape [num_frames, 4]
+      cam_sensor_width: Camera sensor width parameter
+      window: the window inside which we're sampling points.  Integer valued
+        in the format [x_min, y_min, x_max, y_max], where min is inclusive and
+        max is exclusive.
+      tracks_to_sample: Total number of tracks to sample per video.
+      sampling_stride: For efficiency, query points are sampled from a random grid
+        of this stride.
+      max_seg_id: The maxium segment id in the video.
+      max_sampled_frac: The maximum fraction of points to sample from each
+        object, out of all points that lie on the sampling grid.
+
+    Returns:
+      A set of queries, randomly sampled from the video (with a bias toward
+        objects), of shape [num_points, 3].  Each point is [t, y, x], where
+        t is time.  All points are in pixel/frame coordinates.
+      The trajectory for each query point, of shape [num_points, num_frames, 3].
+        Each point is [x, y].  Points are in pixel coordinates
+      Occlusion flag for each point, of shape [num_points, num_frames].  This is
+        a boolean, where True means the point is occluded.
+
+    """
+    chosen_points = []
+    all_reproj = []
+    all_occ = []
+
+    # Convert to metric depth
+
+    depth_range_f32 = tf.cast(depth_range, tf.float32)
+    depth_min = depth_range_f32[0]
+    depth_max = depth_range_f32[1]
+    depth_f32 = tf.cast(depth, tf.float32)
+    depth_map = depth_min + depth_f32 * (depth_max - depth_min) / 65535
+
+    surface_normal_map = surface_normals / 65535 * 2. - 1.
+
+    input_size = object_coordinates.shape.as_list()[1:3]
+    num_frames = object_coordinates.shape.as_list()[0]
+
+    # We first sample query points within the given window.  That means first
+    # extracting the window from the segmentation tensor, because we want to have
+    # a bias toward moving objects.
+    # Note: for speed we sample points on a grid.  The grid start position is
+    # randomized within the window.
+    start_vec = [
+        tf.random.uniform([], minval=0, maxval=sampling_stride, dtype=tf.int32)
+        for _ in range(3)
+    ]
+    start_vec[1] += window[0]
+    start_vec[2] += window[1]
+    end_vec = [num_frames, window[2], window[3]]
+
+    def extract_box(x):
+        x = x[start_vec[0]::sampling_stride, start_vec[1]:window[2]:sampling_stride,
+            start_vec[2]:window[3]:sampling_stride]
+        return x
+
+    segmentations_box = extract_box(segmentations)
+    object_coordinates_box = extract_box(object_coordinates)
+
+    # Next, get the number of points to sample from each object.  First count
+    # how many points are available for each object.
+
+    cnt = tf.math.bincount(tf.cast(tf.reshape(segmentations_box, [-1]), tf.int32))
+    num_to_sample = get_num_to_sample(
+        cnt,
+        max_seg_id,
+        max_sampled_frac,
+        tracks_to_sample,
+    )
+    num_to_sample.set_shape([max_seg_id])
+    intrinsics, matrix_world = get_camera_matrices(
+        cam_focal_length,
+        cam_positions,
+        cam_quaternions,
+        cam_sensor_width,
+        input_size,
+        num_frames=num_frames,
+    )
+
+    # If the normal map is very rough, it's often because they come from a normal
+    # map rather than the mesh.  These aren't trustworthy, and the normal test
+    # may fail (i.e. the normal is pointing away from the camera even though the
+    # point is still visible).  So don't use the normal test when inferring
+    # occlusion.
+    trust_sn = True
+    sn_pad = tf.pad(surface_normal_map, [(0, 0), (1, 1), (1, 1), (0, 0)])
+    shp = surface_normal_map.shape
+    sum_thresh = 0
+    for i in [0, 2]:
+        for j in [0, 2]:
+            diff = sn_pad[:, i: shp[1] + i, j: shp[2] + j, :] - surface_normal_map
+            diff = tf.reduce_sum(tf.square(diff), axis=-1)
+            sum_thresh += tf.cast(diff > 0.05 * 0.05, tf.int32)
+    trust_sn = tf.logical_and(trust_sn, (sum_thresh <= 2))[..., tf.newaxis]
+    surface_normals_box = extract_box(surface_normal_map)
+    trust_sn_box = extract_box(trust_sn)
+
+    def get_camera(fr=None):
+        if fr is None:
+            return {'intrinsics': intrinsics, 'matrix_world': matrix_world}
+        return {'intrinsics': intrinsics[fr], 'matrix_world': matrix_world[fr]}
+
+    # Construct pixel coordinates for each pixel within the window.
+    window = tf.cast(window, tf.float32)
+    z, y, x = tf.meshgrid(
+        *[
+            tf.range(st, ed, sampling_stride)
+            for st, ed in zip(start_vec, end_vec)
+        ],
+        indexing='ij')
+    pix_coords = tf.reshape(tf.stack([z, y, x], axis=-1), [-1, 3])
+
+    for i in range(max_seg_id):
+        # sample points on object i in the first frame.  obj_id is the position
+        # within the object_coordinates array, which is one lower than the value
+        # in the segmentation mask (0 in the segmentation mask is the background
+        # object, which has no bounding box).
+        obj_id = i - 1
+        mask = tf.equal(tf.reshape(segmentations_box, [-1]), i)
+        pt = tf.boolean_mask(tf.reshape(object_coordinates_box, [-1, 3]), mask)
+        normals = tf.boolean_mask(tf.reshape(surface_normals_box, [-1, 3]), mask)
+        trust_sn_mask = tf.boolean_mask(tf.reshape(trust_sn_box, [-1, 1]), mask)
+        idx = tf.cond(
+            tf.shape(pt)[0] > 0,
+            lambda: tf.multinomial(  # pylint: disable=g-long-lambda
+                tf.zeros(tf.shape(pt)[0:1])[tf.newaxis, :],
+                tf.gather(num_to_sample, i))[0],
+            lambda: tf.zeros([0], dtype=tf.int64))
+        # note: pt_coords is pixel coordinates, not raster coordinates.
+        pt_coords = tf.gather(tf.boolean_mask(pix_coords, mask), idx)
+        normals = tf.gather(normals, idx)
+        trust_sn_gather = tf.gather(trust_sn_mask, idx)
+
+        pixel_to_raster = tf.constant([0.0, 0.5, 0.5])[tf.newaxis, :]
+
+        if obj_id == -1:
+            # For the background object, no bounding box is available.  However,
+            # this doesn't move, so we use the depth map to backproject these points
+            # into 3D and use those positions throughout the video.
+            pt_3d = []
+            pt_coords_reorder = []
+            for fr in range(num_frames):
+                # We need to loop over frames because we need to use the correct depth
+                # map for each frame.
+                pt_coords_chunk = tf.boolean_mask(pt_coords,
+                                                  tf.equal(pt_coords[:, 0], fr))
+                pt_coords_reorder.append(pt_coords_chunk)
+
+                pt_3d.append(
+                    unproject(pt_coords_chunk[:, 1:], get_camera(fr), depth_map[fr]))
+            pt = tf.concat(pt_3d, axis=0)
+            chosen_points.append(
+                tf.cast(tf.concat(pt_coords_reorder, axis=0), tf.float32) +
+                pixel_to_raster)
+            bbox = None
+            quat = None
+            frame_for_pt = None
+        else:
+            # For any other object, we just use the point coordinates supplied by
+            # kubric.
+            pt = tf.gather(pt, idx)
+            pt = pt / np.iinfo(np.uint16).max - .5
+            chosen_points.append(tf.cast(pt_coords, tf.float32) + pixel_to_raster)
+            # if obj_id>num_objects, then we won't have a box.  We also won't have
+            # points, so just use a dummy to prevent tf from crashing.
+            bbox = tf.cond(obj_id >= tf.shape(bboxes_3d)[0], lambda: bboxes_3d[0, :],
+                           lambda: bboxes_3d[obj_id, :])
+            quat = tf.cond(obj_id >= tf.shape(obj_quat)[0], lambda: obj_quat[0, :],
+                           lambda: obj_quat[obj_id, :])
+            frame_for_pt = pt_coords[..., 0]
+
+        # Finally, compute the reprojections for this particular object.
+        obj_reproj, obj_occ = tf.cond(
+            tf.shape(pt)[0] > 0,
+            functools.partial(
+                single_object_reproject,
+                bbox_3d=bbox,
+                pt=pt,
+                pt_segments=i,
+                camera=get_camera(),
+                cam_positions=cam_positions,
+                num_frames=num_frames,
+                depth_map=depth_map,
+                segments=segmentations,
+                window=window,
+                input_size=input_size,
+                quat=quat,
+                normals=normals,
+                frame_for_pt=frame_for_pt,
+                trust_normals=trust_sn_gather,
+            ),
+            lambda:  # pylint: disable=g-long-lambda
+            (tf.zeros([0, num_frames, 2], dtype=tf.float32),
+             tf.zeros([0, num_frames], dtype=tf.bool)))
+        all_reproj.append(obj_reproj)
+        all_occ.append(obj_occ)
+
+    # Points are currently in pixel coordinates of the original video.  We now
+    # convert them to coordinates within the window frame, and rescale to
+    # pixel coordinates.  Note that this produces the pixel coordinates after
+    # the window gets cropped and rescaled to the full image size.
+    wd = tf.concat(
+        [np.array([0.0]), window[0:2],
+         np.array([num_frames]), window[2:4]],
+        axis=0)
+    wd = wd[tf.newaxis, tf.newaxis, :]
+    coord_multiplier = [num_frames, input_size[0], input_size[1]]
+    all_reproj = tf.concat(all_reproj, axis=0)
+    # We need to extract x,y, but the format of the window is [t1,y1,x1,t2,y2,x2]
+    window_size = wd[:, :, 5:3:-1] - wd[:, :, 2:0:-1]
+    window_top_left = wd[:, :, 2:0:-1]
+    all_reproj = (all_reproj - window_top_left) / window_size
+    all_reproj = all_reproj * coord_multiplier[2:0:-1]
+    all_occ = tf.concat(all_occ, axis=0)
+
+    # chosen_points is [num_points, (z,y,x)]
+    chosen_points = tf.concat(chosen_points, axis=0)
+
+    chosen_points = tf.cast(chosen_points, tf.float32)
+
+    # renormalize so the box corners are at [-1,1]
+    chosen_points = (chosen_points - wd[:, 0, :3]) / (wd[:, 0, 3:] - wd[:, 0, :3])
+    chosen_points = chosen_points * coord_multiplier
+    # Note: all_reproj is in (x,y) format, but chosen_points is in (z,y,x) format
+
+    return tf.cast(chosen_points, tf.float32), tf.cast(all_reproj,
+                                                       tf.float32), all_occ
+
+
+def _get_distorted_bounding_box(
+        jpeg_shape,
+        bbox,
+        min_object_covered,
+        aspect_ratio_range,
+        area_range,
+        max_attempts,
+):
+    """Sample a crop window to be used for cropping."""
+    bbox_begin, bbox_size, _ = tf.image.sample_distorted_bounding_box(
+        jpeg_shape,
+        bounding_boxes=bbox,
+        min_object_covered=min_object_covered,
+        aspect_ratio_range=aspect_ratio_range,
+        area_range=area_range,
+        max_attempts=max_attempts,
+        use_image_if_no_bounding_boxes=True)
+
+    # Crop the image to the specified bounding box.
+    offset_y, offset_x, _ = tf.unstack(bbox_begin)
+    target_height, target_width, _ = tf.unstack(bbox_size)
+    crop_window = tf.stack(
+        [offset_y, offset_x, offset_y + target_height, offset_x + target_width])
+    return crop_window
+
+
+def add_tracks(data,
+               train_size=(256, 256),
+               vflip=False,
+               random_crop=True,
+               tracks_to_sample=256,
+               sampling_stride=4,
+               max_seg_id=25,
+               max_sampled_frac=0.1):
+    """Track points in 2D using Kubric data.
+
+    Args:
+      data: Kubric data, including RGB/depth/object coordinate/segmentation
+        videos and camera parameters.
+      train_size: Cropped output will be at this resolution.  Ignored if
+        random_crop is False.
+      vflip: whether to vertically flip images and tracks (to test generalization)
+      random_crop: Whether to randomly crop videos
+      tracks_to_sample: Total number of tracks to sample per video.
+      sampling_stride: For efficiency, query points are sampled from a random grid
+        of this stride.
+      max_seg_id: The maxium segment id in the video.
+      max_sampled_frac: The maximum fraction of points to sample from each
+        object, out of all points that lie on the sampling grid.
+
+    Returns:
+      A dict with the following keys:
+      query_points:
+        A set of queries, randomly sampled from the video (with a bias toward
+        objects), of shape [num_points, 3].  Each point is [t, y, x], where
+        t is time.  Points are in pixel/frame coordinates.
+        [num_frames, height, width].
+      target_points:
+        The trajectory for each query point, of shape [num_points, num_frames, 3].
+        Each point is [x, y].  Points are in pixel/frame coordinates.
+      occlusion:
+        Occlusion flag for each point, of shape [num_points, num_frames].  This is
+        a boolean, where True means the point is occluded.
+      video:
+        The cropped video, normalized into the range [-1, 1]
+
+    """
+    shp = data['video'].shape.as_list()
+    num_frames = shp[0]
+    if any([s % sampling_stride != 0 for s in shp[:-1]]):
+        raise ValueError('All video dims must be a multiple of sampling_stride.')
+
+    bbox = tf.constant([0.0, 0.0, 1.0, 1.0], dtype=tf.float32, shape=[1, 1, 4])
+    min_area = 0.3
+    max_area = 1.0
+    min_aspect_ratio = 0.5
+    max_aspect_ratio = 2.0
+    if random_crop:
+        crop_window = _get_distorted_bounding_box(
+            jpeg_shape=shp[1:4],
+            bbox=bbox,
+            min_object_covered=min_area,
+            aspect_ratio_range=(min_aspect_ratio, max_aspect_ratio),
+            area_range=(min_area, max_area),
+            max_attempts=20)
+    else:
+        crop_window = tf.constant([0, 0, shp[1], shp[2]],
+                                  dtype=tf.int32,
+                                  shape=[4])
+
+    query_points, target_points, occluded = track_points(
+        data['object_coordinates'], data['depth'],
+        data['metadata']['depth_range'], data['segmentations'],
+        data['normal'],
+        data['instances']['bboxes_3d'], data['instances']['quaternions'],
+        data['camera']['focal_length'],
+        data['camera']['positions'], data['camera']['quaternions'],
+        data['camera']['sensor_width'], crop_window, tracks_to_sample,
+        sampling_stride, max_seg_id, max_sampled_frac)
+    video = data['video']
+
+    shp = video.shape.as_list()
+    query_points.set_shape([tracks_to_sample, 3])
+    target_points.set_shape([tracks_to_sample, num_frames, 2])
+    occluded.set_shape([tracks_to_sample, num_frames])
+
+    # Crop the video to the sampled window, in a way which matches the coordinate
+    # frame produced the track_points functions.
+    crop_window = crop_window / (
+            np.array(shp[1:3] + shp[1:3]).astype(np.float32) - 1)
+    crop_window = tf.tile(crop_window[tf.newaxis, :], [num_frames, 1])
+    video = tf.image.crop_and_resize(
+        video,
+        tf.cast(crop_window, tf.float32),
+        tf.range(num_frames),
+        train_size,
+    )
+    if vflip:
+        video = video[:, ::-1, :, :]
+        target_points = target_points * np.array([1, -1])
+        query_points = query_points * np.array([1, -1, 1])
+    res = {
+        'query_points': query_points,
+        'target_points': target_points,
+        'occluded': occluded,
+        'video': video / (255. / 2.) - 1.,
+    }
+    return res
+
+
+def create_point_tracking_dataset(
+        data_dir="gs://kubric-public/tfds",
+        train_size=(512, 512),
+        shuffle=True,
+        shuffle_buffer_size=None,
+        split='train',
+        batch_dims=tuple(),
+        repeat=True,
+        vflip=False,
+        random_crop=True,
+        tracks_to_sample=2048,
+        sampling_stride=4,
+        max_seg_id=25,
+        max_sampled_frac=0.1,
+        num_parallel_point_extraction_calls=16,
+        **kwargs):
+    """Construct a dataset for point tracking using Kubric.
+
+    Args:
+      train_size: Tuple of 2 ints. Cropped output will be at this resolution
+      shuffle_buffer_size: Int. Size of the shuffle buffer
+      split: Which split to construct from Kubric.  Can be 'train' or
+        'validation'.
+      batch_dims: Sequence of ints. Add multiple examples into a batch of this
+        shape.
+      repeat: Bool. whether to repeat the dataset.
+      vflip: Bool. whether to vertically flip the dataset to test generalization.
+      random_crop: Bool. whether to randomly crop videos
+      tracks_to_sample: Int. Total number of tracks to sample per video.
+      sampling_stride: Int. For efficiency, query points are sampled from a
+        random grid of this stride.
+      max_seg_id: Int. The maxium segment id in the video.  Note the size of
+        the to graph is proportional to this number, so prefer small values.
+      max_sampled_frac: Float. The maximum fraction of points to sample from each
+        object, out of all points that lie on the sampling grid.
+      num_parallel_point_extraction_calls: Int. The num_parallel_calls for the
+        map function for point extraction.
+      **kwargs: additional args to pass to tfds.load.
+
+    Returns:
+      The dataset generator.
+    """
+    ds = tfds.load(
+        'movi_f/512x512',
+        data_dir=data_dir,
+        shuffle_files=shuffle,
+        **kwargs)
+
+    ds = ds[split]
+    if repeat:
+        ds = ds.repeat()
+    ds = ds.map(
+        functools.partial(
+            add_tracks,
+            train_size=train_size,
+            vflip=vflip,
+            random_crop=random_crop,
+            tracks_to_sample=tracks_to_sample,
+            sampling_stride=sampling_stride,
+            max_seg_id=max_seg_id,
+            max_sampled_frac=max_sampled_frac),
+        num_parallel_calls=num_parallel_point_extraction_calls)
+    if shuffle_buffer_size is not None:
+        ds = ds.shuffle(shuffle_buffer_size)
+
+    for bs in batch_dims[::-1]:
+        ds = ds.batch(bs)
+
+    return ds
+
+
+def plot_tracks(rgb, points, occluded, trackgroup=None):
+    """Plot tracks with matplotlib."""
+    disp = []
+    cmap = plt.cm.hsv
+
+    z_list = np.arange(
+        points.shape[0]) if trackgroup is None else np.array(trackgroup)
+    # random permutation of the colors so nearby points in the list can get
+    # different colors
+    z_list = np.random.permutation(np.max(z_list) + 1)[z_list]
+    colors = cmap(z_list / (np.max(z_list) + 1))
+    figure_dpi = 64
+
+    for i in range(rgb.shape[0]):
+        fig = plt.figure(
+            figsize=(256 / figure_dpi, 256 / figure_dpi),
+            dpi=figure_dpi,
+            frameon=False,
+            facecolor='w')
+        ax = fig.add_subplot()
+        ax.axis('off')
+        ax.imshow(rgb[i])
+
+        valid = points[:, i, 0] > 0
+        valid = np.logical_and(valid, points[:, i, 0] < rgb.shape[2] - 1)
+        valid = np.logical_and(valid, points[:, i, 1] > 0)
+        valid = np.logical_and(valid, points[:, i, 1] < rgb.shape[1] - 1)
+
+        colalpha = np.concatenate([colors[:, :-1], 1 - occluded[:, i:i + 1]],
+                                  axis=1)
+        # Note: matplotlib uses pixel corrdinates, not raster.
+        plt.scatter(
+            points[valid, i, 0] - 0.5,
+            points[valid, i, 1] - 0.5,
+            s=3,
+            c=colalpha[valid],
+        )
+
+        occ2 = occluded[:, i:i + 1]
+
+        colalpha = np.concatenate([colors[:, :-1], occ2], axis=1)
+
+        plt.scatter(
+            points[valid, i, 0],
+            points[valid, i, 1],
+            s=20,
+            facecolors='none',
+            edgecolors=colalpha[valid],
+        )
+
+        plt.subplots_adjust(top=1, bottom=0, right=1, left=0, hspace=0, wspace=0)
+        plt.margins(0, 0)
+        fig.canvas.draw()
+        width, height = fig.get_size_inches() * fig.get_dpi()
+        img = np.frombuffer(
+            fig.canvas.tostring_rgb(),
+            dtype='uint8').reshape(int(height), int(width), 3)
+        disp.append(np.copy(img))
+        plt.close(fig)
+
+    return np.stack(disp, axis=0)
+
+
+def main():
+    ds = tfds.as_numpy(create_point_tracking_dataset(shuffle_buffer_size=None))
+    for i, data in enumerate(ds):
+        disp = plot_tracks(data['video'] * .5 + .5, data['target_points'],
+                           data['occluded'])
+        media.write_video(f'{i}.mp4', disp, fps=10)
+        if i > 10:
+            break
+
+
+if __name__ == '__main__':
+    main()
diff --git a/data/dot_single_video/dot/data/tap_dataset.py b/data/dot_single_video/dot/data/tap_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f6f3188128615125439a607b9f1a85706a8d8a2
--- /dev/null
+++ b/data/dot_single_video/dot/data/tap_dataset.py
@@ -0,0 +1,230 @@
+# Copyright 2023 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+import glob
+import torch
+import pickle as pkl
+import numpy as np
+import os.path as osp
+import mediapy as media
+from torch.utils.data import Dataset, DataLoader
+
+from PIL import Image
+from typing import Mapping, Tuple
+
+
+def resize_video(video: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray:
+    """Resize a video to output_size."""
+    # If you have a GPU, consider replacing this with a GPU-enabled resize op,
+    # such as a jitted jax.image.resize.  It will make things faster.
+    return media.resize_video(video, output_size)
+
+
+def sample_queries_first(
+    target_occluded: np.ndarray,
+    target_points: np.ndarray,
+    frames: np.ndarray,
+) -> Mapping[str, np.ndarray]:
+    """Package a set of frames and tracks for use in TAPNet evaluations.
+    Given a set of frames and tracks with no query points, use the first
+    visible point in each track as the query.
+    Args:
+      target_occluded: Boolean occlusion flag, of shape [n_tracks, n_frames],
+        where True indicates occluded.
+      target_points: Position, of shape [n_tracks, n_frames, 2], where each point
+        is [x,y] scaled between 0 and 1.
+      frames: Video tensor, of shape [n_frames, height, width, 3].  Scaled between
+        -1 and 1.
+    Returns:
+      A dict with the keys:
+        video: Video tensor of shape [1, n_frames, height, width, 3]
+        query_points: Query points of shape [1, n_queries, 3] where
+          each point is [t, y, x] scaled to the range [-1, 1]
+        target_points: Target points of shape [1, n_queries, n_frames, 2] where
+          each point is [x, y] scaled to the range [-1, 1]
+    """
+    valid = np.sum(~target_occluded, axis=1) > 0
+    target_points = target_points[valid, :]
+    target_occluded = target_occluded[valid, :]
+
+    query_points = []
+    for i in range(target_points.shape[0]):
+        index = np.where(target_occluded[i] == 0)[0][0]
+        x, y = target_points[i, index, 0], target_points[i, index, 1]
+        query_points.append(np.array([index, y, x]))  # [t, y, x]
+    query_points = np.stack(query_points, axis=0)
+
+    return {
+        "video": frames[np.newaxis, ...],
+        "query_points": query_points[np.newaxis, ...],
+        "target_points": target_points[np.newaxis, ...],
+        "occluded": target_occluded[np.newaxis, ...],
+    }
+
+
+def sample_queries_strided(
+    target_occluded: np.ndarray,
+    target_points: np.ndarray,
+    frames: np.ndarray,
+    query_stride: int = 5,
+) -> Mapping[str, np.ndarray]:
+    """Package a set of frames and tracks for use in TAPNet evaluations.
+
+    Given a set of frames and tracks with no query points, sample queries
+    strided every query_stride frames, ignoring points that are not visible
+    at the selected frames.
+
+    Args:
+      target_occluded: Boolean occlusion flag, of shape [n_tracks, n_frames],
+        where True indicates occluded.
+      target_points: Position, of shape [n_tracks, n_frames, 2], where each point
+        is [x,y] scaled between 0 and 1.
+      frames: Video tensor, of shape [n_frames, height, width, 3].  Scaled between
+        -1 and 1.
+      query_stride: When sampling query points, search for un-occluded points
+        every query_stride frames and convert each one into a query.
+
+    Returns:
+      A dict with the keys:
+        video: Video tensor of shape [1, n_frames, height, width, 3].  The video
+          has floats scaled to the range [-1, 1].
+        query_points: Query points of shape [1, n_queries, 3] where
+          each point is [t, y, x] scaled to the range [-1, 1].
+        target_points: Target points of shape [1, n_queries, n_frames, 2] where
+          each point is [x, y] scaled to the range [-1, 1].
+        trackgroup: Index of the original track that each query point was
+          sampled from.  This is useful for visualization.
+    """
+    tracks = []
+    occs = []
+    queries = []
+    trackgroups = []
+    total = 0
+    trackgroup = np.arange(target_occluded.shape[0])
+    for i in range(0, target_occluded.shape[1], query_stride):
+        mask = target_occluded[:, i] == 0
+        query = np.stack(
+            [
+                i * np.ones(target_occluded.shape[0:1]),
+                target_points[:, i, 1],
+                target_points[:, i, 0],
+            ],
+            axis=-1,
+        )
+        queries.append(query[mask])
+        tracks.append(target_points[mask])
+        occs.append(target_occluded[mask])
+        trackgroups.append(trackgroup[mask])
+        total += np.array(np.sum(target_occluded[:, i] == 0))
+
+    return {
+        "video": frames[np.newaxis, ...],
+        "query_points": np.concatenate(queries, axis=0)[np.newaxis, ...],
+        "target_points": np.concatenate(tracks, axis=0)[np.newaxis, ...],
+        "occluded": np.concatenate(occs, axis=0)[np.newaxis, ...],
+        "trackgroup": np.concatenate(trackgroups, axis=0)[np.newaxis, ...],
+    }
+
+
+class TapVid(Dataset):
+    def __init__(
+        self,
+        data_root,
+        split="davis",
+        query_mode="first",
+        resize_to_256=True
+    ):
+        self.split = split
+        self.resize_to_256 = resize_to_256
+        self.query_mode = query_mode
+        if self.split == "kinetics":
+            all_paths = glob.glob(osp.join(data_root, "*_of_0010.pkl"))
+            points_dataset = []
+            for pickle_path in all_paths:
+                with open(pickle_path, "rb") as f:
+                    data = pkl.load(f)
+                    points_dataset = points_dataset + data
+            self.points_dataset = points_dataset
+        else:
+            with open(data_root, "rb") as f:
+                self.points_dataset = pkl.load(f)
+            if self.split == "davis":
+                self.video_names = list(self.points_dataset.keys())
+        print("found %d unique videos in %s" % (len(self.points_dataset), data_root))
+
+    def __getitem__(self, index):
+        if self.split == "davis":
+            video_name = self.video_names[index]
+        else:
+            video_name = index
+        video = self.points_dataset[video_name]
+        frames = video["video"]
+
+        if isinstance(frames[0], bytes):
+            # TAP-Vid is stored and JPEG bytes rather than `np.ndarray`s.
+            def decode(frame):
+                byteio = io.BytesIO(frame)
+                img = Image.open(byteio)
+                return np.array(img)
+
+            frames = np.array([decode(frame) for frame in frames])
+
+        target_points = self.points_dataset[video_name]["points"]
+        if self.resize_to_256:
+            frames = resize_video(frames, [256, 256])
+            target_points *= np.array([256, 256])
+        else:
+            target_points *= np.array([frames.shape[2], frames.shape[1]])
+
+        target_occ = self.points_dataset[video_name]["occluded"]
+        if self.query_mode == "first":
+            converted = sample_queries_first(target_occ, target_points, frames)
+        else:
+            converted = sample_queries_strided(target_occ, target_points, frames)
+        assert converted["target_points"].shape[1] == converted["query_points"].shape[1]
+
+        trajs = torch.from_numpy(converted["target_points"])[0].permute(1, 0, 2).float() # T, N, D
+
+        rgbs = torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.
+        visibles = torch.logical_not(torch.from_numpy(converted["occluded"]))[0].permute(1, 0)  # T, N
+        query_points = torch.from_numpy(converted["query_points"])[0].float() # T, N
+        tracks = torch.cat([trajs, visibles[..., None]], dim=-1)
+
+        data = {
+            "video": rgbs,
+            "query_points": query_points,
+            "tracks": tracks
+        }
+
+        return data
+
+    def __len__(self):
+        return len(self.points_dataset)
+
+
+def create_point_tracking_dataset(args):
+    data_root = osp.join(args.data_root, f"tapvid_{args.split}")
+    if args.split in ["davis", "rgb_stacking"]:
+        data_root = osp.join(data_root, f"tapvid_{args.split}.pkl")
+    dataset = TapVid(data_root, args.split, args.query_mode)
+    dataloader = DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        pin_memory=True,
+        shuffle=False,
+        num_workers=0,
+        drop_last=False,
+    )
+    return dataloader
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/__init__.py b/data/dot_single_video/dot/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..258216371af34cbf47fbbe36b49b566f858ecbd7
--- /dev/null
+++ b/data/dot_single_video/dot/models/__init__.py
@@ -0,0 +1,42 @@
+from .dense_optical_tracking import DenseOpticalTracker
+from .optical_flow import OpticalFlow
+from .point_tracking import PointTracker
+
+def create_model(args):
+    if args.model == "dot":
+        model = DenseOpticalTracker(
+            height=args.height,
+            width=args.width,
+            tracker_config=args.tracker_config,
+            tracker_path=args.tracker_path,
+            estimator_config=args.estimator_config,
+            estimator_path=args.estimator_path,
+            refiner_config=args.refiner_config,
+            refiner_path=args.refiner_path,
+        )
+    elif args.model == "pt":
+        model = PointTracker(
+            height=args.height,
+            width=args.width,
+            tracker_config=args.tracker_config,
+            tracker_path=args.tracker_path,
+            estimator_config=args.estimator_config,
+            estimator_path=args.estimator_path,
+        )
+    elif args.model == "ofe":
+        model = OpticalFlow(
+            height=args.height,
+            width=args.width,
+            config=args.estimator_config,
+            load_path=args.estimator_path,
+        )
+    elif args.model == "ofr":
+        model = OpticalFlow(
+            height=args.height,
+            width=args.width,
+            config=args.refiner_config,
+            load_path=args.refiner_path,
+        )
+    else:
+        raise ValueError(f"Unknown model name {args.model}")
+    return model
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/dense_optical_tracking.py b/data/dot_single_video/dot/models/dense_optical_tracking.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fce799ea0e2ea45830be36ee9809165596aade3
--- /dev/null
+++ b/data/dot_single_video/dot/models/dense_optical_tracking.py
@@ -0,0 +1,241 @@
+import torch
+from torch import nn
+import torch.nn.functional as F
+from tqdm import tqdm
+from einops import rearrange, repeat
+
+from .optical_flow import OpticalFlow
+from .point_tracking import PointTracker
+from dot.utils.torch import get_grid
+
+
+class DenseOpticalTracker(nn.Module):
+    def __init__(self,
+                 height=512,
+                 width=512,
+                 tracker_config="configs/cotracker2_patch_4_wind_8.json",
+                 tracker_path="checkpoints/movi_f_cotracker2_patch_4_wind_8.pth",
+                 estimator_config="configs/raft_patch_8.json",
+                 estimator_path="checkpoints/cvo_raft_patch_8.pth",
+                 refiner_config="configs/raft_patch_4_alpha.json",
+                 refiner_path="checkpoints/movi_f_raft_patch_4_alpha.pth"):
+        super().__init__()
+        self.point_tracker = PointTracker(height, width, tracker_config, tracker_path, estimator_config, estimator_path)
+        self.optical_flow_refiner = OpticalFlow(height, width, refiner_config, refiner_path)
+        self.name = self.point_tracker.name + "_" + self.optical_flow_refiner.name
+        self.resolution = [height, width]
+
+    def forward(self, data, mode, **kwargs):
+        if mode == "flow_from_last_to_first_frame":
+            return self.get_flow_from_last_to_first_frame(data, **kwargs)
+        elif mode == "tracks_for_queries":
+            return self.get_tracks_for_queries(data, **kwargs)
+        elif mode == "tracks_from_first_to_every_other_frame":
+            return self.get_tracks_from_first_to_every_other_frame(data, **kwargs)
+        elif mode == "tracks_from_every_cell_in_every_frame":
+            return self.get_tracks_from_every_cell_in_every_frame(data, **kwargs)
+        else:
+            raise ValueError(f"Unknown mode {mode}")
+
+    def get_flow_from_last_to_first_frame(self, data, **kwargs):
+        B, T, C, h, w = data["video"].shape
+        init = self.point_tracker(data, mode="tracks_at_motion_boundaries", **kwargs)["tracks"]
+        init = torch.stack([init[..., 0] / (w - 1), init[..., 1] / (h - 1), init[..., 2]], dim=-1)
+        data = {
+            "src_frame": data["video"][:, -1],
+            "tgt_frame": data["video"][:, 0],
+            "src_points": init[:, -1],
+            "tgt_points": init[:, 0]
+        }
+        pred = self.optical_flow_refiner(data, mode="flow_with_tracks_init", **kwargs)
+        pred["src_points"] = data["src_points"]
+        pred["tgt_points"] = data["tgt_points"]
+        return pred
+
+    def get_tracks_for_queries(self, data, **kwargs):
+        time_steps = data["video"].size(1)
+        query_points = data["query_points"]
+        video = data["video"]
+        S = query_points.size(1)
+        B, T, C, h, w = video.shape
+        H, W = self.resolution
+
+        init = self.point_tracker(data, mode="tracks_at_motion_boundaries", **kwargs)["tracks"]
+        init = torch.stack([init[..., 0] / (w - 1), init[..., 1] / (h - 1), init[..., 2]], dim=-1)
+
+        if h != H or w != W:
+            video = video.reshape(B * T, C, h, w)
+            video = F.interpolate(video, size=(H, W), mode="bilinear")
+            video = video.reshape(B, T, C, H, W)
+
+        feats = self.optical_flow_refiner({"video": video}, mode="feats", **kwargs)["feats"]
+
+        grid = get_grid(H, W, device=video.device)
+        src_steps = [int(v) for v in torch.unique(query_points[..., 0])]
+        tracks = torch.zeros(B, T, S, 3, device=video.device)
+        for src_step in tqdm(src_steps, desc="Refine source step", leave=False):
+            src_points = init[:, src_step]
+            src_feats = feats[:, src_step]
+            tracks_from_src = []
+            for tgt_step in tqdm(range(time_steps), desc="Refine target step", leave=False):
+                if src_step == tgt_step:
+                    flow = torch.zeros(B, H, W, 2, device=video.device)
+                    alpha = torch.ones(B, H, W, device=video.device)
+                else:
+                    tgt_points = init[:, tgt_step]
+                    tgt_feats = feats[:, tgt_step]
+                    data = {
+                        "src_feats": src_feats,
+                        "tgt_feats": tgt_feats,
+                        "src_points": src_points,
+                        "tgt_points": tgt_points
+                    }
+                    pred = self.optical_flow_refiner(data, mode="flow_with_tracks_init", **kwargs)
+                    flow, alpha = pred["flow"], pred["alpha"]
+                    flow[..., 0] = flow[..., 0] / (W - 1)
+                    flow[..., 1] = flow[..., 1] / (H - 1)
+                tracks_from_src.append(torch.cat([flow + grid, alpha[..., None]], dim=-1))
+            tracks_from_src = torch.stack(tracks_from_src, dim=1)
+            for b in range(B):
+                cur = query_points[b, :, 0] == src_step
+                if torch.any(cur):
+                    cur_points = query_points[b, cur]
+                    cur_x = cur_points[..., 2] / (w - 1)
+                    cur_y = cur_points[..., 1] / (h - 1)
+                    cur_tracks = dense_to_sparse_tracks(cur_x, cur_y, tracks_from_src[b], h, w)
+                    tracks[b, :, cur] = cur_tracks
+        return {"tracks": tracks}
+
+    def get_tracks_from_first_to_every_other_frame(self, data, return_flow=False, **kwargs):
+        video = data["video"]
+        B, T, C, h, w = video.shape
+        H, W = self.resolution
+
+        if h != H or w != W:
+            video = video.reshape(B * T, C, h, w)
+            video = F.interpolate(video, size=(H, W), mode="bilinear")
+            video = video.reshape(B, T, C, H, W)
+
+        init = self.point_tracker(data, mode="tracks_at_motion_boundaries", **kwargs)["tracks"]
+        init = torch.stack([init[..., 0] / (w - 1), init[..., 1] / (h - 1), init[..., 2]], dim=-1)
+
+        grid = get_grid(H, W, device=video.device)
+        grid[..., 0] *= (W - 1)
+        grid[..., 1] *= (H - 1)
+        src_step = 0
+        src_points = init[:, src_step]
+        src_frame = video[:, src_step]
+        tracks = []
+        for tgt_step in tqdm(range(T), desc="Refine target step", leave=False):
+            if src_step == tgt_step:
+                flow = torch.zeros(B, H, W, 2, device=video.device)
+                alpha = torch.ones(B, H, W, device=video.device)
+            else:
+                tgt_points = init[:, tgt_step]
+                tgt_frame = video[:, tgt_step]
+                data = {
+                    "src_frame": src_frame,
+                    "tgt_frame": tgt_frame,
+                    "src_points": src_points,
+                    "tgt_points": tgt_points
+                }
+                pred = self.optical_flow_refiner(data, mode="flow_with_tracks_init", **kwargs)
+                flow, alpha = pred["flow"], pred["alpha"]
+            if return_flow:
+                tracks.append(torch.cat([flow, alpha[..., None]], dim=-1))
+            else:
+                tracks.append(torch.cat([flow + grid, alpha[..., None]], dim=-1))  # flow means: 1->i pixel moving values,  grid is the fisrt frame pixel ori cood, alpha is confidence
+        tracks = torch.stack(tracks, dim=1)
+        return {"tracks": tracks}
+
+    def get_tracks_from_every_cell_in_every_frame(self, data, cell_size=1, cell_time_steps=20, **kwargs):
+        video = data["video"]
+        B, T, C, h, w = video.shape
+        H, W = self.resolution
+        ch, cw, ct = h // cell_size, w // cell_size, min(T, cell_time_steps)
+
+        if h != H or w != W:
+            video = video.reshape(B * T, C, h, w)
+            video = F.interpolate(video, size=(H, W), mode="bilinear")
+            video = video.reshape(B, T, C, H, W)
+
+        init = self.point_tracker(data, mode="tracks_at_motion_boundaries", **kwargs)["tracks"]
+        init = torch.stack([init[..., 0] / (w - 1), init[..., 1] / (h - 1), init[..., 2]], dim=-1)
+
+        feats = self.optical_flow_refiner({"video": video}, mode="feats", **kwargs)["feats"]
+
+        grid = get_grid(H, W, device=video.device)
+        visited_cells = torch.zeros(B, T, ch, cw, device=video.device)
+        src_steps = torch.linspace(0, T - 1, T // ct).long()
+        tracks = [[] for _ in range(B)]
+        for k, src_step in enumerate(tqdm(src_steps, desc="Refine source step", leave=False)):
+            if visited_cells[:, src_step].all():
+                continue
+            src_points = init[:, src_step]
+            src_feats = feats[:, src_step]
+            tracks_from_src = []
+            for tgt_step in tqdm(range(T), desc="Refine target step", leave=False):
+                if src_step == tgt_step:
+                    flow = torch.zeros(B, H, W, 2, device=video.device)
+                    alpha = torch.ones(B, H, W, device=video.device)
+                else:
+                    tgt_points = init[:, tgt_step]
+                    tgt_feats = feats[:, tgt_step]
+                    data = {
+                        "src_feats": src_feats,
+                        "tgt_feats": tgt_feats,
+                        "src_points": src_points,
+                        "tgt_points": tgt_points
+                    }
+                    pred = self.optical_flow_refiner(data, mode="flow_with_tracks_init", **kwargs)
+                    flow, alpha = pred["flow"], pred["alpha"]
+                    flow[..., 0] = flow[..., 0] / (W - 1)
+                    flow[..., 1] = flow[..., 1] / (H - 1)
+                tracks_from_src.append(torch.cat([flow + grid, alpha[..., None]], dim=-1))
+            tracks_from_src = torch.stack(tracks_from_src, dim=1)
+            for b in range(B):
+                src_cell = visited_cells[b, src_step]
+                if src_cell.all():
+                    continue
+                cur_y, cur_x = (1 - src_cell).nonzero(as_tuple=True)
+                cur_x = (cur_x + 0.5) / cw
+                cur_y = (cur_y + 0.5) / ch
+                cur_tracks = dense_to_sparse_tracks(cur_x, cur_y, tracks_from_src[b], h, w)
+                visited_cells[b] = update_visited(visited_cells[b], cur_tracks, h, w, ch, cw)
+                tracks[b].append(cur_tracks)
+        tracks = [torch.cat(t, dim=1) for t in tracks]
+        return {"tracks": tracks}
+
+def dense_to_sparse_tracks(x, y, tracks, height, width):
+    h, w = height, width
+    T = tracks.size(0)
+    grid = torch.stack([x, y], dim=-1) * 2 - 1
+    grid = repeat(grid, "s c -> t s r c", t=T, r=1)
+    tracks = rearrange(tracks, "t h w c -> t c h w")
+    tracks = F.grid_sample(tracks, grid, align_corners=True, mode="bilinear")
+    tracks = rearrange(tracks[..., 0], "t c s -> t s c")
+    tracks[..., 0] = tracks[..., 0] * (w - 1)
+    tracks[..., 1] = tracks[..., 1] * (h - 1)
+    tracks[..., 2] = (tracks[..., 2] > 0).float()
+    return tracks
+
+def update_visited(visited_cells, tracks, height, width, cell_height, cell_width):
+    T = tracks.size(0)
+    h, w = height, width
+    ch, cw = cell_height, cell_width
+    for tgt_step in range(T):
+        tgt_points = tracks[tgt_step]
+        tgt_vis = tgt_points[:, 2]
+        visited = tgt_points[tgt_vis.bool()]
+        if len(visited) > 0:
+            visited_x, visited_y = visited[:, 0], visited[:, 1]
+            visited_x = (visited_x / (w - 1) * cw).floor().long()
+            visited_y = (visited_y / (h - 1) * ch).floor().long()
+            valid = (visited_x >= 0) & (visited_x < cw) & (visited_y >= 0) & (visited_y < ch)
+            visited_x = visited_x[valid]
+            visited_y = visited_y[valid]
+            tgt_cell = visited_cells[tgt_step].view(-1)
+            tgt_cell[visited_y * cw + visited_x] = 1.
+            tgt_cell = tgt_cell.view_as(visited_cells[tgt_step])
+            visited_cells[tgt_step] = tgt_cell
+    return visited_cells
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/interpolation.py b/data/dot_single_video/dot/models/interpolation.py
new file mode 100644
index 0000000000000000000000000000000000000000..2edb78a31e98d6cb050033d2b0da53dc9c89cf72
--- /dev/null
+++ b/data/dot_single_video/dot/models/interpolation.py
@@ -0,0 +1,53 @@
+import warnings
+import torch
+
+try:
+    from dot.utils import torch3d
+except ModuleNotFoundError:
+    torch3d = None
+
+if torch3d:
+    TORCH3D_AVAILABLE = True
+else:
+    TORCH3D_AVAILABLE = False
+
+
+def interpolate(src_points, tgt_points, grid, version="torch3d"):
+    B, S, _ = src_points.shape
+    H, W, _ = grid.shape
+
+    # For each point in a regular grid, find indices of nearest visible source point
+    grid = grid.view(1, H * W, 2).expand(B, -1, -1)  # B HW 2
+    src_pos, src_alpha = src_points[..., :2], src_points[..., 2]
+    if version == "torch" or (version == "torch3d" and not TORCH3D_AVAILABLE):
+        if version == "torch3d":
+            warnings.warn(
+                "Torch3D is not available. For optimal speed and memory consumption, consider setting it up.",
+                stacklevel=2,
+            )
+        dis = (grid ** 2).sum(-1)[:, None] + (src_pos ** 2).sum(-1)[:, :, None] - 2 * src_pos @ grid.permute(0, 2, 1)
+        dis[src_alpha == 0] = float('inf')
+        _, idx = dis.min(dim=1)
+        idx = idx.view(B, H * W, 1)
+    elif version == "torch3d":
+        src_pos_packed = src_pos[src_alpha.bool()]
+        tgt_points_packed = tgt_points[src_alpha.bool()]
+        lengths = src_alpha.sum(dim=1).long()
+        max_length = int(lengths.max())
+        cum_lengths = lengths.cumsum(dim=0)
+        cum_lengths = torch.cat([torch.zeros_like(cum_lengths[:1]), cum_lengths[:-1]])
+        src_pos = torch3d.packed_to_padded(src_pos_packed, cum_lengths, max_length)
+        tgt_points = torch3d.packed_to_padded(tgt_points_packed, cum_lengths, max_length)
+        _, idx, _ = torch3d.knn_points(grid, src_pos, lengths2=lengths, return_nn=False)
+        idx = idx.view(B, H * W, 1)
+
+    # Use correspondences between source and target points to initialize the flow
+    tgt_pos, tgt_alpha = tgt_points[..., :2], tgt_points[..., 2]
+    flow = tgt_pos - src_pos
+    flow = torch.cat([flow, tgt_alpha[..., None]], dim=-1)  # B S 3
+    flow = flow.gather(dim=1, index=idx.expand(-1, -1, flow.size(-1)))
+    flow = flow.view(B, H, W, -1)
+    flow, alpha = flow[..., :2], flow[..., 2]
+    flow[..., 0] = flow[..., 0] * (W - 1)
+    flow[..., 1] = flow[..., 1] * (H - 1)
+    return flow, alpha
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/optical_flow.py b/data/dot_single_video/dot/models/optical_flow.py
new file mode 100644
index 0000000000000000000000000000000000000000..69f7e186bf45daa98b28e16349fd85116d6125ff
--- /dev/null
+++ b/data/dot_single_video/dot/models/optical_flow.py
@@ -0,0 +1,91 @@
+import torch
+from torch import nn
+import torch.nn.functional as F
+from tqdm import tqdm
+
+from .shelf import RAFT
+from .interpolation import interpolate
+from dot.utils.io import read_config
+from dot.utils.torch import get_grid, get_sobel_kernel
+
+
+class OpticalFlow(nn.Module):
+    def __init__(self, height, width, config, load_path):
+        super().__init__()
+        model_args = read_config(config)
+        model_dict = {"raft": RAFT}
+        self.model = model_dict[model_args.name](model_args)
+        self.name = model_args.name
+        if load_path is not None:
+            device = next(self.model.parameters()).device
+            self.model.load_state_dict(torch.load(load_path, map_location=device))
+        coarse_height, coarse_width = height // model_args.patch_size, width // model_args.patch_size
+        self.register_buffer("coarse_grid", get_grid(coarse_height, coarse_width))
+
+    def forward(self, data, mode, **kwargs):
+        if mode == "flow_with_tracks_init":
+            return self.get_flow_with_tracks_init(data, **kwargs)
+        elif mode == "motion_boundaries":
+            return self.get_motion_boundaries(data, **kwargs)
+        elif mode == "feats":
+            return self.get_feats(data, **kwargs)
+        elif mode == "tracks_for_queries":
+            return self.get_tracks_for_queries(data, **kwargs)
+        elif mode == "tracks_from_first_to_every_other_frame":
+            return self.get_tracks_from_first_to_every_other_frame(data, **kwargs)
+        elif mode == "flow_from_last_to_first_frame":
+            return self.get_flow_from_last_to_first_frame(data, **kwargs)
+        else:
+            raise ValueError(f"Unknown mode {mode}")
+
+    def get_motion_boundaries(self, data, boundaries_size=1, boundaries_dilation=4, boundaries_thresh=0.025, **kwargs):
+        eps = 1e-12
+        src_frame, tgt_frame = data["src_frame"], data["tgt_frame"]
+        K = boundaries_size * 2 + 1
+        D = boundaries_dilation
+        B, _, H, W = src_frame.shape
+        reflect = torch.nn.ReflectionPad2d(K // 2)
+        sobel_kernel = get_sobel_kernel(K).to(src_frame.device)
+        flow, _ = self.model(src_frame, tgt_frame)
+        norm_flow = torch.stack([flow[..., 0] / (W - 1), flow[..., 1] / (H - 1)], dim=-1)
+        norm_flow = norm_flow.permute(0, 3, 1, 2).reshape(-1, 1, H, W)
+        boundaries = F.conv2d(reflect(norm_flow), sobel_kernel)
+        boundaries = ((boundaries ** 2).sum(dim=1, keepdim=True) + eps).sqrt()
+        boundaries = boundaries.view(-1, 2, H, W).mean(dim=1, keepdim=True)
+        if boundaries_dilation > 1:
+            boundaries = torch.nn.functional.max_pool2d(boundaries, kernel_size=D * 2, stride=1, padding=D)
+            boundaries = boundaries[:, :, -H:, -W:]
+        boundaries = boundaries[:, 0]
+        boundaries = boundaries - boundaries.reshape(B, -1).min(dim=1)[0].reshape(B, 1, 1)
+        boundaries = boundaries / boundaries.reshape(B, -1).max(dim=1)[0].reshape(B, 1, 1)
+        boundaries = boundaries > boundaries_thresh
+        return {"motion_boundaries": boundaries, "flow": flow}
+
+    def get_feats(self, data, **kwargs):
+        video = data["video"]
+        feats = []
+        for step in tqdm(range(video.size(1)), desc="Extract feats for frame", leave=False):
+            feats.append(self.model.encode(video[:, step]))
+        feats = torch.stack(feats, dim=1)
+        return {"feats": feats}
+
+    def get_flow_with_tracks_init(self, data, is_train=False, interpolation_version="torch3d", alpha_thresh=0.8, **kwargs):
+        coarse_flow, coarse_alpha = interpolate(data["src_points"], data["tgt_points"], self.coarse_grid,
+                                                version=interpolation_version)
+        flow, alpha = self.model(src_frame=data["src_frame"] if "src_feats" not in data else None,
+                                 tgt_frame=data["tgt_frame"] if "tgt_feats" not in data else None,
+                                 src_feats=data["src_feats"] if "src_feats" in data else None,
+                                 tgt_feats=data["tgt_feats"] if "tgt_feats" in data else None,
+                                 coarse_flow=coarse_flow,
+                                 coarse_alpha=coarse_alpha,
+                                 is_train=is_train)
+        if not is_train:
+            alpha = (alpha > alpha_thresh).float()
+        return {"flow": flow, "alpha": alpha, "coarse_flow": coarse_flow, "coarse_alpha": coarse_alpha}
+
+    def get_tracks_for_queries(self, data, **kwargs):
+        raise NotImplementedError
+
+
+
+
diff --git a/data/dot_single_video/dot/models/point_tracking.py b/data/dot_single_video/dot/models/point_tracking.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea48ebd771ebd264c8e0897186bb8bbff4f4379f
--- /dev/null
+++ b/data/dot_single_video/dot/models/point_tracking.py
@@ -0,0 +1,132 @@
+from tqdm import tqdm
+import torch
+from torch import nn
+
+from .optical_flow import OpticalFlow
+from .shelf import CoTracker, CoTracker2, Tapir
+from dot.utils.io import read_config
+from dot.utils.torch import sample_points, sample_mask_points, get_grid
+
+
+class PointTracker(nn.Module):
+    def __init__(self,  height, width, tracker_config, tracker_path, estimator_config, estimator_path):
+        super().__init__()
+        model_args = read_config(tracker_config)
+        model_dict = {
+            "cotracker": CoTracker,
+            "cotracker2": CoTracker2,
+            "tapir": Tapir,
+            "bootstapir": Tapir
+        }
+        self.name = model_args.name
+        self.model = model_dict[model_args.name](model_args)
+        if tracker_path is not None:
+            device = next(self.model.parameters()).device
+            self.model.load_state_dict(torch.load(tracker_path, map_location=device), strict=False)
+        self.optical_flow_estimator = OpticalFlow(height, width, estimator_config, estimator_path)
+
+    def forward(self, data, mode, **kwargs):
+        if mode == "tracks_at_motion_boundaries":
+            return self.get_tracks_at_motion_boundaries(data, **kwargs)
+        elif mode == "flow_from_last_to_first_frame":
+            return self.get_flow_from_last_to_first_frame(data, **kwargs)
+        else:
+            raise ValueError(f"Unknown mode {mode}")
+
+    def get_tracks_at_motion_boundaries(self, data, num_tracks=8192, sim_tracks=2048, sample_mode="all", **kwargs):
+        video = data["video"]
+        N, S = num_tracks, sim_tracks
+        B, T, _, H, W = video.shape
+        assert N % S == 0
+
+        # Define sampling strategy
+        if sample_mode == "all":
+            samples_per_step = [S // T for _ in range(T)]
+            samples_per_step[0] += S - sum(samples_per_step)
+            backward_tracking = True
+            flip = False
+        elif sample_mode == "first":
+            samples_per_step = [0 for _ in range(T)]
+            samples_per_step[0] += S
+            backward_tracking = False
+            flip = False
+        elif sample_mode == "last":
+            samples_per_step = [0 for _ in range(T)]
+            samples_per_step[0] += S
+            backward_tracking = False
+            flip = True
+        else:
+            raise ValueError(f"Unknown sample mode {sample_mode}")
+
+        if flip:
+            video = video.flip(dims=[1])
+
+        # Track batches of points
+        tracks = []
+        motion_boundaries = {}
+        cache_features = True
+        for _ in tqdm(range(N // S), desc="Track batch of points", leave=False):
+            src_points = []
+            for src_step, src_samples in enumerate(samples_per_step):
+                if src_samples == 0:
+                    continue
+                if not src_step in motion_boundaries:
+                    tgt_step = src_step - 1 if src_step > 0 else src_step + 1
+                    data = {"src_frame": video[:, src_step], "tgt_frame": video[:, tgt_step]}
+                    pred = self.optical_flow_estimator(data, mode="motion_boundaries", **kwargs)
+                    motion_boundaries[src_step] = pred["motion_boundaries"]
+                src_boundaries = motion_boundaries[src_step]
+                src_points.append(sample_points(src_step, src_boundaries, src_samples))
+            src_points = torch.cat(src_points, dim=1)
+            traj, vis = self.model(video, src_points, backward_tracking, cache_features)
+            tracks.append(torch.cat([traj, vis[..., None]], dim=-1))
+            cache_features = False
+        tracks = torch.cat(tracks, dim=2)
+
+        if flip:
+            tracks = tracks.flip(dims=[1])
+
+        return {"tracks": tracks}
+
+    def get_flow_from_last_to_first_frame(self, data, sim_tracks=2048, **kwargs):
+        video = data["video"]
+        video = video.flip(dims=[1])
+        src_step = 0  # We have flipped video over temporal axis so src_step is 0
+        B, T, C, H, W = video.shape
+        S = sim_tracks
+        backward_tracking = False
+        cache_features = True
+        flow = get_grid(H, W, shape=[B]).cuda()
+        flow[..., 0] = flow[..., 0] * (W - 1)
+        flow[..., 1] = flow[..., 1] * (H - 1)
+        alpha = torch.zeros(B, H, W).cuda()
+        mask = torch.ones(H, W)
+        pbar = tqdm(total=H * W // S, desc="Track batch of points", leave=False)
+        while torch.any(mask):
+            points, (i, j) = sample_mask_points(src_step, mask, S)
+            idx = i * W + j
+            points = points.cuda()[None].expand(B, -1, -1)
+
+            traj, vis = self.model(video, points, backward_tracking, cache_features)
+            traj = traj[:, -1]
+            vis = vis[:, -1].float()
+
+            # Update mask
+            mask = mask.view(-1)
+            mask[idx] = 0
+            mask = mask.view(H, W)
+
+            # Update flow
+            flow = flow.view(B, -1, 2)
+            flow[:, idx] = traj - flow[:, idx]
+            flow = flow.view(B, H, W, 2)
+
+            # Update alpha
+            alpha = alpha.view(B, -1)
+            alpha[:, idx] = vis
+            alpha = alpha.view(B, H, W)
+
+            cache_features = False
+            pbar.update(1)
+        pbar.close()
+        return {"flow": flow, "alpha": alpha}
diff --git a/data/dot_single_video/dot/models/shelf/__init__.py b/data/dot_single_video/dot/models/shelf/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ce20921a243fec9437682fc8ec34ee3955dc87f
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/__init__.py
@@ -0,0 +1,4 @@
+from .raft import RAFT
+from .cotracker import CoTracker
+from .cotracker2 import CoTracker2
+from .tapir import Tapir
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/shelf/cotracker.py b/data/dot_single_video/dot/models/shelf/cotracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e0a6d25a4ac32cbd2214cd37b955e2ab4cef99e
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker.py
@@ -0,0 +1,12 @@
+from torch import nn
+
+from .cotracker_utils.predictor import CoTrackerPredictor
+
+
+class CoTracker(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.model = CoTrackerPredictor(args.patch_size, args.wind_size)
+
+    def forward(self, video, queries, backward_tracking, cache_features=False):
+        return self.model(video, queries=queries, backward_tracking=backward_tracking, cache_features=cache_features)
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2.py b/data/dot_single_video/dot/models/shelf/cotracker2.py
new file mode 100644
index 0000000000000000000000000000000000000000..0547bfdf4e44f016ec77d60e5f13102a4f0505ad
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2.py
@@ -0,0 +1,14 @@
+from torch import nn
+
+from .cotracker2_utils.predictor import CoTrackerPredictor
+
+
+class CoTracker2(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.model = CoTrackerPredictor(args.patch_size, args.wind_size)
+
+    def forward(self, video, queries, backward_tracking, cache_features=False):
+        return self.model(video, queries=queries, backward_tracking=backward_tracking, cache_features=cache_features)
+
+
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/LICENSE.md b/data/dot_single_video/dot/models/shelf/cotracker2_utils/LICENSE.md
new file mode 100644
index 0000000000000000000000000000000000000000..ba959871dca0f9b6775570410879e637de44d7b4
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/LICENSE.md
@@ -0,0 +1,399 @@
+Attribution-NonCommercial 4.0 International
+
+=======================================================================
+
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+
+Using Creative Commons Public Licenses
+
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+	wiki.creativecommons.org/Considerations_for_licensors
+
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More_considerations
+     for the public: 
+	wiki.creativecommons.org/Considerations_for_licensees
+
+=======================================================================
+
+Creative Commons Attribution-NonCommercial 4.0 International Public
+License
+
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial 4.0 International Public License ("Public
+License"). To the extent this Public License may be interpreted as a
+contract, You are granted the Licensed Rights in consideration of Your
+acceptance of these terms and conditions, and the Licensor grants You
+such rights in consideration of benefits the Licensor receives from
+making the Licensed Material available under these terms and
+conditions.
+
+Section 1 -- Definitions.
+
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+
+  i. NonCommercial means not primarily intended for or directed towards
+     commercial advantage or monetary compensation. For purposes of
+     this Public License, the exchange of the Licensed Material for
+     other material subject to Copyright and Similar Rights by digital
+     file-sharing or similar means is NonCommercial provided there is
+     no payment of monetary compensation in connection with the
+     exchange.
+
+  j. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+
+  k. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+
+  l. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+
+Section 2 -- Scope.
+
+  a. License grant.
+
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+
+            a. reproduce and Share the Licensed Material, in whole or
+               in part, for NonCommercial purposes only; and
+
+            b. produce, reproduce, and Share Adapted Material for
+               NonCommercial purposes only.
+
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+
+       5. Downstream recipients.
+
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+
+  b. Other rights.
+
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties, including when
+          the Licensed Material is used other than for NonCommercial
+          purposes.
+
+Section 3 -- License Conditions.
+
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+
+  a. Attribution.
+
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+
+                ii. a copyright notice;
+
+               iii. a notice that refers to this Public License;
+
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+
+Section 4 -- Sui Generis Database Rights.
+
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database for NonCommercial purposes
+     only;
+
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+
+Section 6 -- Term and Termination.
+
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+
+       2. upon express reinstatement by the Licensor.
+
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+
+Section 7 -- Other Terms and Conditions.
+
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+
+Section 8 -- Interpretation.
+
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+
+=======================================================================
+
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+
+Creative Commons may be contacted at creativecommons.org.
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/build_cotracker.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/build_cotracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..b26aa4b91d7b9e8ad1822f8f4d12a065ee7b7157
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/build_cotracker.py
@@ -0,0 +1,14 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+
+from dot.models.shelf.cotracker2_utils.models.core.cotracker.cotracker import CoTracker2
+
+
+def build_cotracker(patch_size, wind_size):
+    cotracker = CoTracker2(stride=patch_size, window_len=wind_size, add_space_attn=True)
+    return cotracker
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/blocks.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/blocks.py
new file mode 100644
index 0000000000000000000000000000000000000000..f64f945522b8f52876c7e86b4934f2fb1a949439
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/blocks.py
@@ -0,0 +1,367 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from functools import partial
+from typing import Callable
+import collections
+from torch import Tensor
+from itertools import repeat
+
+from dot.models.shelf.cotracker2_utils.models.core.model_utils import bilinear_sampler
+
+
+# From PyTorch internals
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
+            return tuple(x)
+        return tuple(repeat(x, n))
+
+    return parse
+
+
+def exists(val):
+    return val is not None
+
+
+def default(val, d):
+    return val if exists(val) else d
+
+
+to_2tuple = _ntuple(2)
+
+
+class Mlp(nn.Module):
+    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
+
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        act_layer=nn.GELU,
+        norm_layer=None,
+        bias=True,
+        drop=0.0,
+        use_conv=False,
+    ):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        bias = to_2tuple(bias)
+        drop_probs = to_2tuple(drop)
+        linear_layer = partial(nn.Conv2d, kernel_size=1) if use_conv else nn.Linear
+
+        self.fc1 = linear_layer(in_features, hidden_features, bias=bias[0])
+        self.act = act_layer()
+        self.drop1 = nn.Dropout(drop_probs[0])
+        self.norm = norm_layer(hidden_features) if norm_layer is not None else nn.Identity()
+        self.fc2 = linear_layer(hidden_features, out_features, bias=bias[1])
+        self.drop2 = nn.Dropout(drop_probs[1])
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop1(x)
+        x = self.fc2(x)
+        x = self.drop2(x)
+        return x
+
+
+class ResidualBlock(nn.Module):
+    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
+        super(ResidualBlock, self).__init__()
+
+        self.conv1 = nn.Conv2d(
+            in_planes,
+            planes,
+            kernel_size=3,
+            padding=1,
+            stride=stride,
+            padding_mode="zeros",
+        )
+        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, padding=1, padding_mode="zeros")
+        self.relu = nn.ReLU(inplace=True)
+
+        num_groups = planes // 8
+
+        if norm_fn == "group":
+            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            if not stride == 1:
+                self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+
+        elif norm_fn == "batch":
+            self.norm1 = nn.BatchNorm2d(planes)
+            self.norm2 = nn.BatchNorm2d(planes)
+            if not stride == 1:
+                self.norm3 = nn.BatchNorm2d(planes)
+
+        elif norm_fn == "instance":
+            self.norm1 = nn.InstanceNorm2d(planes)
+            self.norm2 = nn.InstanceNorm2d(planes)
+            if not stride == 1:
+                self.norm3 = nn.InstanceNorm2d(planes)
+
+        elif norm_fn == "none":
+            self.norm1 = nn.Sequential()
+            self.norm2 = nn.Sequential()
+            if not stride == 1:
+                self.norm3 = nn.Sequential()
+
+        if stride == 1:
+            self.downsample = None
+
+        else:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm3
+            )
+
+    def forward(self, x):
+        y = x
+        y = self.relu(self.norm1(self.conv1(y)))
+        y = self.relu(self.norm2(self.conv2(y)))
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+
+        return self.relu(x + y)
+
+
+class BasicEncoder(nn.Module):
+    def __init__(self, input_dim=3, output_dim=128, stride=4):
+        super(BasicEncoder, self).__init__()
+        self.stride = stride
+        self.norm_fn = "instance"
+        self.in_planes = output_dim // 2
+
+        self.norm1 = nn.InstanceNorm2d(self.in_planes)
+        self.norm2 = nn.InstanceNorm2d(output_dim * 2)
+
+        self.conv1 = nn.Conv2d(
+            input_dim,
+            self.in_planes,
+            kernel_size=7,
+            stride=2,
+            padding=3,
+            padding_mode="zeros",
+        )
+        self.relu1 = nn.ReLU(inplace=True)
+        self.layer1 = self._make_layer(output_dim // 2, stride=1)
+        self.layer2 = self._make_layer(output_dim // 4 * 3, stride=2)
+        self.layer3 = self._make_layer(output_dim, stride=2)
+        self.layer4 = self._make_layer(output_dim, stride=2)
+
+        self.conv2 = nn.Conv2d(
+            output_dim * 3 + output_dim // 4,
+            output_dim * 2,
+            kernel_size=3,
+            padding=1,
+            padding_mode="zeros",
+        )
+        self.relu2 = nn.ReLU(inplace=True)
+        self.conv3 = nn.Conv2d(output_dim * 2, output_dim, kernel_size=1)
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+            elif isinstance(m, (nn.InstanceNorm2d)):
+                if m.weight is not None:
+                    nn.init.constant_(m.weight, 1)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+
+    def _make_layer(self, dim, stride=1):
+        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
+        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
+        layers = (layer1, layer2)
+
+        self.in_planes = dim
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        _, _, H, W = x.shape
+
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu1(x)
+
+        a = self.layer1(x)
+        b = self.layer2(a)
+        c = self.layer3(b)
+        d = self.layer4(c)
+
+        def _bilinear_intepolate(x):
+            return F.interpolate(
+                x,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+
+        a = _bilinear_intepolate(a)
+        b = _bilinear_intepolate(b)
+        c = _bilinear_intepolate(c)
+        d = _bilinear_intepolate(d)
+
+        x = self.conv2(torch.cat([a, b, c, d], dim=1))
+        x = self.norm2(x)
+        x = self.relu2(x)
+        x = self.conv3(x)
+        return x
+
+
+class CorrBlock:
+    def __init__(
+        self,
+        fmaps,
+        num_levels=4,
+        radius=4,
+        multiple_track_feats=False,
+        padding_mode="zeros",
+    ):
+        B, S, C, H, W = fmaps.shape
+        self.S, self.C, self.H, self.W = S, C, H, W
+        self.padding_mode = padding_mode
+        self.num_levels = num_levels
+        self.radius = radius
+        self.fmaps_pyramid = []
+        self.multiple_track_feats = multiple_track_feats
+
+        self.fmaps_pyramid.append(fmaps)
+        for i in range(self.num_levels - 1):
+            fmaps_ = fmaps.reshape(B * S, C, H, W)
+            fmaps_ = F.avg_pool2d(fmaps_, 2, stride=2)
+            _, _, H, W = fmaps_.shape
+            fmaps = fmaps_.reshape(B, S, C, H, W)
+            self.fmaps_pyramid.append(fmaps)
+
+    def sample(self, coords):
+        r = self.radius
+        B, S, N, D = coords.shape
+        assert D == 2
+
+        H, W = self.H, self.W
+        out_pyramid = []
+        for i in range(self.num_levels):
+            corrs = self.corrs_pyramid[i]  # B, S, N, H, W
+            *_, H, W = corrs.shape
+
+            dx = torch.linspace(-r, r, 2 * r + 1)
+            dy = torch.linspace(-r, r, 2 * r + 1)
+            delta = torch.stack(torch.meshgrid(dy, dx, indexing="ij"), axis=-1).to(coords.device)
+
+            centroid_lvl = coords.reshape(B * S * N, 1, 1, 2) / 2**i
+            delta_lvl = delta.view(1, 2 * r + 1, 2 * r + 1, 2)
+            coords_lvl = centroid_lvl + delta_lvl
+
+            corrs = bilinear_sampler(
+                corrs.reshape(B * S * N, 1, H, W),
+                coords_lvl,
+                padding_mode=self.padding_mode,
+            )
+            corrs = corrs.view(B, S, N, -1)
+            out_pyramid.append(corrs)
+
+        out = torch.cat(out_pyramid, dim=-1)  # B, S, N, LRR*2
+        out = out.permute(0, 2, 1, 3).contiguous().view(B * N, S, -1).float()
+        return out
+
+    def corr(self, targets):
+        B, S, N, C = targets.shape
+        if self.multiple_track_feats:
+            targets_split = targets.split(C // self.num_levels, dim=-1)
+            B, S, N, C = targets_split[0].shape
+
+        assert C == self.C
+        assert S == self.S
+
+        fmap1 = targets
+
+        self.corrs_pyramid = []
+        for i, fmaps in enumerate(self.fmaps_pyramid):
+            *_, H, W = fmaps.shape
+            fmap2s = fmaps.view(B, S, C, H * W)  # B S C H W ->  B S C (H W)
+            if self.multiple_track_feats:
+                fmap1 = targets_split[i]
+            corrs = torch.matmul(fmap1, fmap2s)
+            corrs = corrs.view(B, S, N, H, W)  # B S N (H W) -> B S N H W
+            corrs = corrs / torch.sqrt(torch.tensor(C).float())
+            self.corrs_pyramid.append(corrs)
+
+
+class Attention(nn.Module):
+    def __init__(self, query_dim, context_dim=None, num_heads=8, dim_head=48, qkv_bias=False):
+        super().__init__()
+        inner_dim = dim_head * num_heads
+        context_dim = default(context_dim, query_dim)
+        self.scale = dim_head**-0.5
+        self.heads = num_heads
+
+        self.to_q = nn.Linear(query_dim, inner_dim, bias=qkv_bias)
+        self.to_kv = nn.Linear(context_dim, inner_dim * 2, bias=qkv_bias)
+        self.to_out = nn.Linear(inner_dim, query_dim)
+
+    def forward(self, x, context=None, attn_bias=None):
+        B, N1, C = x.shape
+        h = self.heads
+
+        q = self.to_q(x).reshape(B, N1, h, C // h).permute(0, 2, 1, 3)
+        context = default(context, x)
+        k, v = self.to_kv(context).chunk(2, dim=-1)
+
+        N2 = context.shape[1]
+        k = k.reshape(B, N2, h, C // h).permute(0, 2, 1, 3)
+        v = v.reshape(B, N2, h, C // h).permute(0, 2, 1, 3)
+
+        sim = (q @ k.transpose(-2, -1)) * self.scale
+
+        if attn_bias is not None:
+            sim = sim + attn_bias
+        attn = sim.softmax(dim=-1)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N1, C)
+        return self.to_out(x)
+
+
+class AttnBlock(nn.Module):
+    def __init__(
+        self,
+        hidden_size,
+        num_heads,
+        attn_class: Callable[..., nn.Module] = Attention,
+        mlp_ratio=4.0,
+        **block_kwargs
+    ):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = attn_class(hidden_size, num_heads=num_heads, qkv_bias=True, **block_kwargs)
+
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size,
+            hidden_features=mlp_hidden_dim,
+            act_layer=approx_gelu,
+            drop=0,
+        )
+
+    def forward(self, x, mask=None):
+        attn_bias = mask
+        if mask is not None:
+            mask = (
+                (mask[:, None] * mask[:, :, None])
+                .unsqueeze(1)
+                .expand(-1, self.attn.num_heads, -1, -1)
+            )
+            max_neg_value = -torch.finfo(x.dtype).max
+            attn_bias = (~mask) * max_neg_value
+        x = x + self.attn(self.norm1(x), attn_bias=attn_bias)
+        x = x + self.mlp(self.norm2(x))
+        return x
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/cotracker.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/cotracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f06ff96fb8814d93480d0a653e1fd157fadcfea
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/cotracker.py
@@ -0,0 +1,507 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from dot.models.shelf.cotracker2_utils.models.core.model_utils import sample_features4d, sample_features5d
+from dot.models.shelf.cotracker2_utils.models.core.embeddings import (
+    get_2d_embedding,
+    get_1d_sincos_pos_embed_from_grid,
+    get_2d_sincos_pos_embed,
+)
+
+from dot.models.shelf.cotracker2_utils.models.core.cotracker.blocks import (
+    Mlp,
+    BasicEncoder,
+    AttnBlock,
+    CorrBlock,
+    Attention,
+)
+
+torch.manual_seed(0)
+
+
+class CoTracker2(nn.Module):
+    def __init__(
+        self,
+        window_len=8,
+        stride=4,
+        add_space_attn=True,
+        num_virtual_tracks=64,
+        model_resolution=(384, 512),
+    ):
+        super(CoTracker2, self).__init__()
+        self.window_len = window_len
+        self.stride = stride
+        self.hidden_dim = 256
+        self.latent_dim = 128
+        self.add_space_attn = add_space_attn
+        self.fnet = BasicEncoder(output_dim=self.latent_dim)
+        self.num_virtual_tracks = num_virtual_tracks
+        self.model_resolution = model_resolution
+        self.input_dim = 456
+        self.updateformer = EfficientUpdateFormer(
+            space_depth=6,
+            time_depth=6,
+            input_dim=self.input_dim,
+            hidden_size=384,
+            output_dim=self.latent_dim + 2,
+            mlp_ratio=4.0,
+            add_space_attn=add_space_attn,
+            num_virtual_tracks=num_virtual_tracks,
+        )
+
+        time_grid = torch.linspace(0, window_len - 1, window_len).reshape(1, window_len, 1)
+
+        self.register_buffer(
+            "time_emb", get_1d_sincos_pos_embed_from_grid(self.input_dim, time_grid[0])
+        )
+
+        self.register_buffer(
+            "pos_emb",
+            get_2d_sincos_pos_embed(
+                embed_dim=self.input_dim,
+                grid_size=(
+                    model_resolution[0] // stride,
+                    model_resolution[1] // stride,
+                ),
+            ),
+        )
+        self.norm = nn.GroupNorm(1, self.latent_dim)
+        self.track_feat_updater = nn.Sequential(
+            nn.Linear(self.latent_dim, self.latent_dim),
+            nn.GELU(),
+        )
+        self.vis_predictor = nn.Sequential(
+            nn.Linear(self.latent_dim, 1),
+        )
+
+    def forward_window(
+        self,
+        fmaps,
+        coords,
+        track_feat=None,
+        vis=None,
+        track_mask=None,
+        attention_mask=None,
+        iters=4,
+    ):
+        # B = batch size
+        # S = number of frames in the window)
+        # N = number of tracks
+        # C = channels of a point feature vector
+        # E = positional embedding size
+        # LRR = local receptive field radius
+        # D = dimension of the transformer input tokens
+
+        # track_feat = B S N C
+        # vis = B S N 1
+        # track_mask = B S N 1
+        # attention_mask = B S N
+
+        B, S_init, N, __ = track_mask.shape
+        B, S, *_ = fmaps.shape
+
+        track_mask = F.pad(track_mask, (0, 0, 0, 0, 0, S - S_init), "constant")
+        track_mask_vis = (
+            torch.cat([track_mask, vis], dim=-1).permute(0, 2, 1, 3).reshape(B * N, S, 2)
+        )
+
+        corr_block = CorrBlock(
+            fmaps,
+            num_levels=4,
+            radius=3,
+            padding_mode="border",
+        )
+
+        sampled_pos_emb = (
+            sample_features4d(self.pos_emb.repeat(B, 1, 1, 1), coords[:, 0])
+            .reshape(B * N, self.input_dim)
+            .unsqueeze(1)
+        )  # B E N -> (B N) 1 E
+
+        coord_preds = []
+        for __ in range(iters):
+            coords = coords.detach()  # B S N 2
+            corr_block.corr(track_feat)
+
+            # Sample correlation features around each point
+            fcorrs = corr_block.sample(coords)  # (B N) S LRR
+
+            # Get the flow embeddings
+            flows = (coords - coords[:, 0:1]).permute(0, 2, 1, 3).reshape(B * N, S, 2)
+            flow_emb = get_2d_embedding(flows, 64, cat_coords=True)  # N S E
+
+            track_feat_ = track_feat.permute(0, 2, 1, 3).reshape(B * N, S, self.latent_dim)
+
+            transformer_input = torch.cat([flow_emb, fcorrs, track_feat_, track_mask_vis], dim=2)
+            x = transformer_input + sampled_pos_emb + self.time_emb
+            x = x.view(B, N, S, -1)  # (B N) S D -> B N S D
+
+            delta = self.updateformer(
+                x,
+                attention_mask.reshape(B * S, N),  # B S N -> (B S) N
+            )
+
+            delta_coords = delta[..., :2].permute(0, 2, 1, 3)
+            coords = coords + delta_coords
+            coord_preds.append(coords * self.stride)
+
+            delta_feats_ = delta[..., 2:].reshape(B * N * S, self.latent_dim)
+            track_feat_ = track_feat.permute(0, 2, 1, 3).reshape(B * N * S, self.latent_dim)
+            track_feat_ = self.track_feat_updater(self.norm(delta_feats_)) + track_feat_
+            track_feat = track_feat_.reshape(B, N, S, self.latent_dim).permute(
+                0, 2, 1, 3
+            )  # (B N S) C -> B S N C
+
+        vis_pred = self.vis_predictor(track_feat).reshape(B, S, N)
+        return coord_preds, vis_pred
+
+    def get_track_feat(self, fmaps, queried_frames, queried_coords):
+        sample_frames = queried_frames[:, None, :, None]
+        sample_coords = torch.cat(
+            [
+                sample_frames,
+                queried_coords[:, None],
+            ],
+            dim=-1,
+        )
+        sample_track_feats = sample_features5d(fmaps, sample_coords)
+        return sample_track_feats
+
+    def init_video_online_processing(self):
+        self.online_ind = 0
+        self.online_track_feat = None
+        self.online_coords_predicted = None
+        self.online_vis_predicted = None
+
+    def forward(self, video, queries, iters=4, cached_feat=None, is_train=False, is_online=False):
+        """Predict tracks
+
+        Args:
+            video (FloatTensor[B, T, 3]): input videos.
+            queries (FloatTensor[B, N, 3]): point queries.
+            iters (int, optional): number of updates. Defaults to 4.
+            is_train (bool, optional): enables training mode. Defaults to False.
+            is_online (bool, optional): enables online mode. Defaults to False. Before enabling, call model.init_video_online_processing().
+
+        Returns:
+            - coords_predicted (FloatTensor[B, T, N, 2]):
+            - vis_predicted (FloatTensor[B, T, N]):
+            - train_data: `None` if `is_train` is false, otherwise:
+                - all_vis_predictions (List[FloatTensor[B, S, N, 1]]):
+                - all_coords_predictions (List[FloatTensor[B, S, N, 2]]):
+                - mask (BoolTensor[B, T, N]):
+        """
+        B, T, C, H, W = video.shape
+        B, N, __ = queries.shape
+        S = self.window_len
+        device = queries.device
+
+        # B = batch size
+        # S = number of frames in the window of the padded video
+        # S_trimmed = actual number of frames in the window
+        # N = number of tracks
+        # C = color channels (3 for RGB)
+        # E = positional embedding size
+        # LRR = local receptive field radius
+        # D = dimension of the transformer input tokens
+
+        # video = B T C H W
+        # queries = B N 3
+        # coords_init = B S N 2
+        # vis_init = B S N 1
+
+        assert S >= 2  # A tracker needs at least two frames to track something
+        if is_online:
+            assert T <= S, "Online mode: video chunk must be <= window size."
+            assert self.online_ind is not None, "Call model.init_video_online_processing() first."
+            assert not is_train, "Training not supported in online mode."
+        step = S // 2  # How much the sliding window moves at every step
+        video = 2 * video - 1.0
+
+        # The first channel is the frame number
+        # The rest are the coordinates of points we want to track
+        queried_frames = queries[:, :, 0].long()
+
+        queried_coords = queries[..., 1:]
+        queried_coords = queried_coords / self.stride
+
+        # We store our predictions here
+        coords_predicted = torch.zeros((B, T, N, 2), device=device)
+        vis_predicted = torch.zeros((B, T, N), device=device)
+        if is_online:
+            if self.online_coords_predicted is None:
+                # Init online predictions with zeros
+                self.online_coords_predicted = coords_predicted
+                self.online_vis_predicted = vis_predicted
+            else:
+                # Pad online predictions with zeros for the current window
+                pad = min(step, T - step)
+                coords_predicted = F.pad(
+                    self.online_coords_predicted, (0, 0, 0, 0, 0, pad), "constant"
+                )
+                vis_predicted = F.pad(self.online_vis_predicted, (0, 0, 0, pad), "constant")
+        all_coords_predictions, all_vis_predictions = [], []
+
+        # Pad the video so that an integer number of sliding windows fit into it
+        # TODO: we may drop this requirement because the transformer should not care
+        # TODO: pad the features instead of the video
+        pad = S - T if is_online else (S - T % S) % S  # We don't want to pad if T % S == 0
+        video = F.pad(video.reshape(B, 1, T, C * H * W), (0, 0, 0, pad), "replicate").reshape(
+            B, -1, C, H, W
+        )
+
+        # Compute convolutional features for the video or for the current chunk in case of online mode
+        if cached_feat is None:
+            fmaps = self.fnet(video.reshape(-1, C, H, W)).reshape(
+                B, -1, self.latent_dim, H // self.stride, W // self.stride
+            )
+        else:
+            _, _, c, h, w = cached_feat.shape
+            fmaps = F.pad(cached_feat.reshape(B, 1, T, c * h * w), (0, 0, 0, pad), "replicate").reshape(B, -1, c, h, w)
+
+        # We compute track features
+        track_feat = self.get_track_feat(
+            fmaps,
+            queried_frames - self.online_ind if is_online else queried_frames,
+            queried_coords,
+        ).repeat(1, S, 1, 1)
+        if is_online:
+            # We update track features for the current window
+            sample_frames = queried_frames[:, None, :, None]  # B 1 N 1
+            left = 0 if self.online_ind == 0 else self.online_ind + step
+            right = self.online_ind + S
+            sample_mask = (sample_frames >= left) & (sample_frames < right)
+            if self.online_track_feat is None:
+                self.online_track_feat = torch.zeros_like(track_feat, device=device)
+            self.online_track_feat += track_feat * sample_mask
+            track_feat = self.online_track_feat.clone()
+        # We process ((num_windows - 1) * step + S) frames in total, so there are
+        # (ceil((T - S) / step) + 1) windows
+        num_windows = (T - S + step - 1) // step + 1
+        # We process only the current video chunk in the online mode
+        indices = [self.online_ind] if is_online else range(0, step * num_windows, step)
+
+        coords_init = queried_coords.reshape(B, 1, N, 2).expand(B, S, N, 2).float()
+        vis_init = torch.ones((B, S, N, 1), device=device).float() * 10
+        for ind in indices:
+            # We copy over coords and vis for tracks that are queried
+            # by the end of the previous window, which is ind + overlap
+            if ind > 0:
+                overlap = S - step
+                copy_over = (queried_frames < ind + overlap)[:, None, :, None]  # B 1 N 1
+                coords_prev = torch.nn.functional.pad(
+                    coords_predicted[:, ind : ind + overlap] / self.stride,
+                    (0, 0, 0, 0, 0, step),
+                    "replicate",
+                )  # B S N 2
+                vis_prev = torch.nn.functional.pad(
+                    vis_predicted[:, ind : ind + overlap, :, None].clone(),
+                    (0, 0, 0, 0, 0, step),
+                    "replicate",
+                )  # B S N 1
+                coords_init = torch.where(
+                    copy_over.expand_as(coords_init), coords_prev, coords_init
+                )
+                vis_init = torch.where(copy_over.expand_as(vis_init), vis_prev, vis_init)
+
+            # The attention mask is 1 for the spatio-temporal points within
+            # a track which is updated in the current window
+            attention_mask = (queried_frames < ind + S).reshape(B, 1, N).repeat(1, S, 1)  # B S N
+
+            # The track mask is 1 for the spatio-temporal points that actually
+            # need updating: only after begin queried, and not if contained
+            # in a previous window
+            track_mask = (
+                queried_frames[:, None, :, None]
+                <= torch.arange(ind, ind + S, device=device)[None, :, None, None]
+            ).contiguous()  # B S N 1
+
+            if ind > 0:
+                track_mask[:, :overlap, :, :] = False
+
+            # Predict the coordinates and visibility for the current window
+            coords, vis = self.forward_window(
+                fmaps=fmaps if is_online else fmaps[:, ind : ind + S],
+                coords=coords_init,
+                track_feat=attention_mask.unsqueeze(-1) * track_feat,
+                vis=vis_init,
+                track_mask=track_mask,
+                attention_mask=attention_mask,
+                iters=iters,
+            )
+
+            S_trimmed = T if is_online else min(T - ind, S)  # accounts for last window duration
+            coords_predicted[:, ind : ind + S] = coords[-1][:, :S_trimmed]
+            vis_predicted[:, ind : ind + S] = vis[:, :S_trimmed]
+            if is_train:
+                all_coords_predictions.append([coord[:, :S_trimmed] for coord in coords])
+                all_vis_predictions.append(torch.sigmoid(vis[:, :S_trimmed]))
+
+        if is_online:
+            self.online_ind += step
+            self.online_coords_predicted = coords_predicted
+            self.online_vis_predicted = vis_predicted
+        vis_predicted = torch.sigmoid(vis_predicted)
+
+        if is_train:
+            mask = queried_frames[:, None] <= torch.arange(0, T, device=device)[None, :, None]
+            train_data = (all_coords_predictions, all_vis_predictions, mask)
+        else:
+            train_data = None
+
+        return coords_predicted, vis_predicted, train_data
+
+
+class EfficientUpdateFormer(nn.Module):
+    """
+    Transformer model that updates track estimates.
+    """
+
+    def __init__(
+        self,
+        space_depth=6,
+        time_depth=6,
+        input_dim=320,
+        hidden_size=384,
+        num_heads=8,
+        output_dim=130,
+        mlp_ratio=4.0,
+        add_space_attn=True,
+        num_virtual_tracks=64,
+    ):
+        super().__init__()
+        self.out_channels = 2
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.add_space_attn = add_space_attn
+        self.input_transform = torch.nn.Linear(input_dim, hidden_size, bias=True)
+        self.flow_head = torch.nn.Linear(hidden_size, output_dim, bias=True)
+        self.num_virtual_tracks = num_virtual_tracks
+        self.virual_tracks = nn.Parameter(torch.randn(1, num_virtual_tracks, 1, hidden_size))
+        self.time_blocks = nn.ModuleList(
+            [
+                AttnBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    attn_class=Attention,
+                )
+                for _ in range(time_depth)
+            ]
+        )
+
+        if add_space_attn:
+            self.space_virtual_blocks = nn.ModuleList(
+                [
+                    AttnBlock(
+                        hidden_size,
+                        num_heads,
+                        mlp_ratio=mlp_ratio,
+                        attn_class=Attention,
+                    )
+                    for _ in range(space_depth)
+                ]
+            )
+            self.space_point2virtual_blocks = nn.ModuleList(
+                [
+                    CrossAttnBlock(hidden_size, hidden_size, num_heads, mlp_ratio=mlp_ratio)
+                    for _ in range(space_depth)
+                ]
+            )
+            self.space_virtual2point_blocks = nn.ModuleList(
+                [
+                    CrossAttnBlock(hidden_size, hidden_size, num_heads, mlp_ratio=mlp_ratio)
+                    for _ in range(space_depth)
+                ]
+            )
+            assert len(self.time_blocks) >= len(self.space_virtual2point_blocks)
+        self.initialize_weights()
+
+    def initialize_weights(self):
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+
+        self.apply(_basic_init)
+
+    def forward(self, input_tensor, mask=None):
+        tokens = self.input_transform(input_tensor)
+        B, _, T, _ = tokens.shape
+        virtual_tokens = self.virual_tracks.repeat(B, 1, T, 1)
+        tokens = torch.cat([tokens, virtual_tokens], dim=1)
+        _, N, _, _ = tokens.shape
+
+        j = 0
+        for i in range(len(self.time_blocks)):
+            time_tokens = tokens.contiguous().view(B * N, T, -1)  # B N T C -> (B N) T C
+            time_tokens = self.time_blocks[i](time_tokens)
+
+            tokens = time_tokens.view(B, N, T, -1)  # (B N) T C -> B N T C
+            if self.add_space_attn and (
+                i % (len(self.time_blocks) // len(self.space_virtual_blocks)) == 0
+            ):
+                space_tokens = (
+                    tokens.permute(0, 2, 1, 3).contiguous().view(B * T, N, -1)
+                )  # B N T C -> (B T) N C
+                point_tokens = space_tokens[:, : N - self.num_virtual_tracks]
+                virtual_tokens = space_tokens[:, N - self.num_virtual_tracks :]
+
+                virtual_tokens = self.space_virtual2point_blocks[j](
+                    virtual_tokens, point_tokens, mask=mask
+                )
+                virtual_tokens = self.space_virtual_blocks[j](virtual_tokens)
+                point_tokens = self.space_point2virtual_blocks[j](
+                    point_tokens, virtual_tokens, mask=mask
+                )
+                space_tokens = torch.cat([point_tokens, virtual_tokens], dim=1)
+                tokens = space_tokens.view(B, T, N, -1).permute(0, 2, 1, 3)  # (B T) N C -> B N T C
+                j += 1
+        tokens = tokens[:, : N - self.num_virtual_tracks]
+        flow = self.flow_head(tokens)
+        return flow
+
+
+class CrossAttnBlock(nn.Module):
+    def __init__(self, hidden_size, context_dim, num_heads=1, mlp_ratio=4.0, **block_kwargs):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.norm_context = nn.LayerNorm(hidden_size)
+        self.cross_attn = Attention(
+            hidden_size, context_dim=context_dim, num_heads=num_heads, qkv_bias=True, **block_kwargs
+        )
+
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size,
+            hidden_features=mlp_hidden_dim,
+            act_layer=approx_gelu,
+            drop=0,
+        )
+
+    def forward(self, x, context, mask=None):
+        if mask is not None:
+            if mask.shape[1] == x.shape[1]:
+                mask = mask[:, None, :, None].expand(
+                    -1, self.cross_attn.heads, -1, context.shape[1]
+                )
+            else:
+                mask = mask[:, None, None].expand(-1, self.cross_attn.heads, x.shape[1], -1)
+
+            max_neg_value = -torch.finfo(x.dtype).max
+            attn_bias = (~mask) * max_neg_value
+        x = x + self.cross_attn(
+            self.norm1(x), context=self.norm_context(context), attn_bias=attn_bias
+        )
+        x = x + self.mlp(self.norm2(x))
+        return x
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/losses.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/losses.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddfd8cbd4fea982b546ed82758ab75485316da43
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/cotracker/losses.py
@@ -0,0 +1,61 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+from dot.models.shelf.cotracker2_utils.models.core.model_utils import reduce_masked_mean
+
+EPS = 1e-6
+
+
+def balanced_ce_loss(pred, gt, valid=None):
+    total_balanced_loss = 0.0
+    for j in range(len(gt)):
+        B, S, N = gt[j].shape
+        # pred and gt are the same shape
+        for (a, b) in zip(pred[j].size(), gt[j].size()):
+            assert a == b  # some shape mismatch!
+        # if valid is not None:
+        for (a, b) in zip(pred[j].size(), valid[j].size()):
+            assert a == b  # some shape mismatch!
+
+        pos = (gt[j] > 0.95).float()
+        neg = (gt[j] < 0.05).float()
+
+        label = pos * 2.0 - 1.0
+        a = -label * pred[j]
+        b = F.relu(a)
+        loss = b + torch.log(torch.exp(-b) + torch.exp(a - b))
+
+        pos_loss = reduce_masked_mean(loss, pos * valid[j])
+        neg_loss = reduce_masked_mean(loss, neg * valid[j])
+
+        balanced_loss = pos_loss + neg_loss
+        total_balanced_loss += balanced_loss / float(N)
+    return total_balanced_loss
+
+
+def sequence_loss(flow_preds, flow_gt, vis, valids, gamma=0.8):
+    """Loss function defined over sequence of flow predictions"""
+    total_flow_loss = 0.0
+    for j in range(len(flow_gt)):
+        B, S, N, D = flow_gt[j].shape
+        assert D == 2
+        B, S1, N = vis[j].shape
+        B, S2, N = valids[j].shape
+        assert S == S1
+        assert S == S2
+        n_predictions = len(flow_preds[j])
+        flow_loss = 0.0
+        for i in range(n_predictions):
+            i_weight = gamma ** (n_predictions - i - 1)
+            flow_pred = flow_preds[j][i]
+            i_loss = (flow_pred - flow_gt[j]).abs()  # B, S, N, 2
+            i_loss = torch.mean(i_loss, dim=3)  # B, S, N
+            flow_loss += i_weight * reduce_masked_mean(i_loss, valids[j])
+        flow_loss = flow_loss / n_predictions
+        total_flow_loss += flow_loss / float(N)
+    return total_flow_loss
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/embeddings.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/embeddings.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ee4aeedeb68ef69d10667f4dc5b73e90335c34a
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/embeddings.py
@@ -0,0 +1,120 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import Tuple, Union
+import torch
+
+
+def get_2d_sincos_pos_embed(
+    embed_dim: int, grid_size: Union[int, Tuple[int, int]]
+) -> torch.Tensor:
+    """
+    This function initializes a grid and generates a 2D positional embedding using sine and cosine functions.
+    It is a wrapper of get_2d_sincos_pos_embed_from_grid.
+    Args:
+    - embed_dim: The embedding dimension.
+    - grid_size: The grid size.
+    Returns:
+    - pos_embed: The generated 2D positional embedding.
+    """
+    if isinstance(grid_size, tuple):
+        grid_size_h, grid_size_w = grid_size
+    else:
+        grid_size_h = grid_size_w = grid_size
+    grid_h = torch.arange(grid_size_h, dtype=torch.float)
+    grid_w = torch.arange(grid_size_w, dtype=torch.float)
+    grid = torch.meshgrid(grid_w, grid_h, indexing="xy")
+    grid = torch.stack(grid, dim=0)
+    grid = grid.reshape([2, 1, grid_size_h, grid_size_w])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    return pos_embed.reshape(1, grid_size_h, grid_size_w, -1).permute(0, 3, 1, 2)
+
+
+def get_2d_sincos_pos_embed_from_grid(
+    embed_dim: int, grid: torch.Tensor
+) -> torch.Tensor:
+    """
+    This function generates a 2D positional embedding from a given grid using sine and cosine functions.
+
+    Args:
+    - embed_dim: The embedding dimension.
+    - grid: The grid to generate the embedding from.
+
+    Returns:
+    - emb: The generated 2D positional embedding.
+    """
+    assert embed_dim % 2 == 0
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+
+    emb = torch.cat([emb_h, emb_w], dim=2)  # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(
+    embed_dim: int, pos: torch.Tensor
+) -> torch.Tensor:
+    """
+    This function generates a 1D positional embedding from a given grid using sine and cosine functions.
+
+    Args:
+    - embed_dim: The embedding dimension.
+    - pos: The position to generate the embedding from.
+
+    Returns:
+    - emb: The generated 1D positional embedding.
+    """
+    assert embed_dim % 2 == 0
+    omega = torch.arange(embed_dim // 2, dtype=torch.double)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000**omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = torch.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
+
+    emb_sin = torch.sin(out)  # (M, D/2)
+    emb_cos = torch.cos(out)  # (M, D/2)
+
+    emb = torch.cat([emb_sin, emb_cos], dim=1)  # (M, D)
+    return emb[None].float()
+
+
+def get_2d_embedding(xy: torch.Tensor, C: int, cat_coords: bool = True) -> torch.Tensor:
+    """
+    This function generates a 2D positional embedding from given coordinates using sine and cosine functions.
+
+    Args:
+    - xy: The coordinates to generate the embedding from.
+    - C: The size of the embedding.
+    - cat_coords: A flag to indicate whether to concatenate the original coordinates to the embedding.
+
+    Returns:
+    - pe: The generated 2D positional embedding.
+    """
+    B, N, D = xy.shape
+    assert D == 2
+
+    x = xy[:, :, 0:1]
+    y = xy[:, :, 1:2]
+    div_term = (
+        torch.arange(0, C, 2, device=xy.device, dtype=torch.float32) * (1000.0 / C)
+    ).reshape(1, 1, int(C / 2))
+
+    pe_x = torch.zeros(B, N, C, device=xy.device, dtype=torch.float32)
+    pe_y = torch.zeros(B, N, C, device=xy.device, dtype=torch.float32)
+
+    pe_x[:, :, 0::2] = torch.sin(x * div_term)
+    pe_x[:, :, 1::2] = torch.cos(x * div_term)
+
+    pe_y[:, :, 0::2] = torch.sin(y * div_term)
+    pe_y[:, :, 1::2] = torch.cos(y * div_term)
+
+    pe = torch.cat([pe_x, pe_y], dim=2)  # (B, N, C*3)
+    if cat_coords:
+        pe = torch.cat([xy, pe], dim=2)  # (B, N, C*3+3)
+    return pe
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/model_utils.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..12afd4e5e143f548a42350616456b614a88dfb2c
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/core/model_utils.py
@@ -0,0 +1,256 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+from typing import Optional, Tuple
+
+EPS = 1e-6
+
+
+def smart_cat(tensor1, tensor2, dim):
+    if tensor1 is None:
+        return tensor2
+    return torch.cat([tensor1, tensor2], dim=dim)
+
+
+def get_points_on_a_grid(
+    size: int,
+    extent: Tuple[float, ...],
+    center: Optional[Tuple[float, ...]] = None,
+    device: Optional[torch.device] = torch.device("cpu"),
+):
+    r"""Get a grid of points covering a rectangular region
+
+    `get_points_on_a_grid(size, extent)` generates a :attr:`size` by
+    :attr:`size` grid fo points distributed to cover a rectangular area
+    specified by `extent`.
+
+    The `extent` is a pair of integer :math:`(H,W)` specifying the height
+    and width of the rectangle.
+
+    Optionally, the :attr:`center` can be specified as a pair :math:`(c_y,c_x)`
+    specifying the vertical and horizontal center coordinates. The center
+    defaults to the middle of the extent.
+
+    Points are distributed uniformly within the rectangle leaving a margin
+    :math:`m=W/64` from the border.
+
+    It returns a :math:`(1, \text{size} \times \text{size}, 2)` tensor of
+    points :math:`P_{ij}=(x_i, y_i)` where
+
+    .. math::
+        P_{ij} = \left(
+             c_x + m -\frac{W}{2} + \frac{W - 2m}{\text{size} - 1}\, j,~
+             c_y + m -\frac{H}{2} + \frac{H - 2m}{\text{size} - 1}\, i
+        \right)
+
+    Points are returned in row-major order.
+
+    Args:
+        size (int): grid size.
+        extent (tuple): height and with of the grid extent.
+        center (tuple, optional): grid center.
+        device (str, optional): Defaults to `"cpu"`.
+
+    Returns:
+        Tensor: grid.
+    """
+    if size == 1:
+        return torch.tensor([extent[1] / 2, extent[0] / 2], device=device)[None, None]
+
+    if center is None:
+        center = [extent[0] / 2, extent[1] / 2]
+
+    margin = extent[1] / 64
+    range_y = (margin - extent[0] / 2 + center[0], extent[0] / 2 + center[0] - margin)
+    range_x = (margin - extent[1] / 2 + center[1], extent[1] / 2 + center[1] - margin)
+    grid_y, grid_x = torch.meshgrid(
+        torch.linspace(*range_y, size, device=device),
+        torch.linspace(*range_x, size, device=device),
+        indexing="ij",
+    )
+    return torch.stack([grid_x, grid_y], dim=-1).reshape(1, -1, 2)
+
+
+def reduce_masked_mean(input, mask, dim=None, keepdim=False):
+    r"""Masked mean
+
+    `reduce_masked_mean(x, mask)` computes the mean of a tensor :attr:`input`
+    over a mask :attr:`mask`, returning
+
+    .. math::
+        \text{output} =
+        \frac
+        {\sum_{i=1}^N \text{input}_i \cdot \text{mask}_i}
+        {\epsilon + \sum_{i=1}^N \text{mask}_i}
+
+    where :math:`N` is the number of elements in :attr:`input` and
+    :attr:`mask`, and :math:`\epsilon` is a small constant to avoid
+    division by zero.
+
+    `reduced_masked_mean(x, mask, dim)` computes the mean of a tensor
+    :attr:`input` over a mask :attr:`mask` along a dimension :attr:`dim`.
+    Optionally, the dimension can be kept in the output by setting
+    :attr:`keepdim` to `True`. Tensor :attr:`mask` must be broadcastable to
+    the same dimension as :attr:`input`.
+
+    The interface is similar to `torch.mean()`.
+
+    Args:
+        inout (Tensor): input tensor.
+        mask (Tensor): mask.
+        dim (int, optional): Dimension to sum over. Defaults to None.
+        keepdim (bool, optional): Keep the summed dimension. Defaults to False.
+
+    Returns:
+        Tensor: mean tensor.
+    """
+
+    mask = mask.expand_as(input)
+
+    prod = input * mask
+
+    if dim is None:
+        numer = torch.sum(prod)
+        denom = torch.sum(mask)
+    else:
+        numer = torch.sum(prod, dim=dim, keepdim=keepdim)
+        denom = torch.sum(mask, dim=dim, keepdim=keepdim)
+
+    mean = numer / (EPS + denom)
+    return mean
+
+
+def bilinear_sampler(input, coords, align_corners=True, padding_mode="border"):
+    r"""Sample a tensor using bilinear interpolation
+
+    `bilinear_sampler(input, coords)` samples a tensor :attr:`input` at
+    coordinates :attr:`coords` using bilinear interpolation. It is the same
+    as `torch.nn.functional.grid_sample()` but with a different coordinate
+    convention.
+
+    The input tensor is assumed to be of shape :math:`(B, C, H, W)`, where
+    :math:`B` is the batch size, :math:`C` is the number of channels,
+    :math:`H` is the height of the image, and :math:`W` is the width of the
+    image. The tensor :attr:`coords` of shape :math:`(B, H_o, W_o, 2)` is
+    interpreted as an array of 2D point coordinates :math:`(x_i,y_i)`.
+
+    Alternatively, the input tensor can be of size :math:`(B, C, T, H, W)`,
+    in which case sample points are triplets :math:`(t_i,x_i,y_i)`. Note
+    that in this case the order of the components is slightly different
+    from `grid_sample()`, which would expect :math:`(x_i,y_i,t_i)`.
+
+    If `align_corners` is `True`, the coordinate :math:`x` is assumed to be
+    in the range :math:`[0,W-1]`, with 0 corresponding to the center of the
+    left-most image pixel :math:`W-1` to the center of the right-most
+    pixel.
+
+    If `align_corners` is `False`, the coordinate :math:`x` is assumed to
+    be in the range :math:`[0,W]`, with 0 corresponding to the left edge of
+    the left-most pixel :math:`W` to the right edge of the right-most
+    pixel.
+
+    Similar conventions apply to the :math:`y` for the range
+    :math:`[0,H-1]` and :math:`[0,H]` and to :math:`t` for the range
+    :math:`[0,T-1]` and :math:`[0,T]`.
+
+    Args:
+        input (Tensor): batch of input images.
+        coords (Tensor): batch of coordinates.
+        align_corners (bool, optional): Coordinate convention. Defaults to `True`.
+        padding_mode (str, optional): Padding mode. Defaults to `"border"`.
+
+    Returns:
+        Tensor: sampled points.
+    """
+
+    sizes = input.shape[2:]
+
+    assert len(sizes) in [2, 3]
+
+    if len(sizes) == 3:
+        # t x y -> x y t to match dimensions T H W in grid_sample
+        coords = coords[..., [1, 2, 0]]
+
+    if align_corners:
+        coords = coords * torch.tensor(
+            [2 / max(size - 1, 1) for size in reversed(sizes)], device=coords.device
+        )
+    else:
+        coords = coords * torch.tensor([2 / size for size in reversed(sizes)], device=coords.device)
+
+    coords -= 1
+
+    return F.grid_sample(input, coords, align_corners=align_corners, padding_mode=padding_mode)
+
+
+def sample_features4d(input, coords):
+    r"""Sample spatial features
+
+    `sample_features4d(input, coords)` samples the spatial features
+    :attr:`input` represented by a 4D tensor :math:`(B, C, H, W)`.
+
+    The field is sampled at coordinates :attr:`coords` using bilinear
+    interpolation. :attr:`coords` is assumed to be of shape :math:`(B, R,
+    3)`, where each sample has the format :math:`(x_i, y_i)`. This uses the
+    same convention as :func:`bilinear_sampler` with `align_corners=True`.
+
+    The output tensor has one feature per point, and has shape :math:`(B,
+    R, C)`.
+
+    Args:
+        input (Tensor): spatial features.
+        coords (Tensor): points.
+
+    Returns:
+        Tensor: sampled features.
+    """
+
+    B, _, _, _ = input.shape
+
+    # B R 2 -> B R 1 2
+    coords = coords.unsqueeze(2)
+
+    # B C R 1
+    feats = bilinear_sampler(input, coords)
+
+    return feats.permute(0, 2, 1, 3).view(
+        B, -1, feats.shape[1] * feats.shape[3]
+    )  # B C R 1 -> B R C
+
+
+def sample_features5d(input, coords):
+    r"""Sample spatio-temporal features
+
+    `sample_features5d(input, coords)` works in the same way as
+    :func:`sample_features4d` but for spatio-temporal features and points:
+    :attr:`input` is a 5D tensor :math:`(B, T, C, H, W)`, :attr:`coords` is
+    a :math:`(B, R1, R2, 3)` tensor of spatio-temporal point :math:`(t_i,
+    x_i, y_i)`. The output tensor has shape :math:`(B, R1, R2, C)`.
+
+    Args:
+        input (Tensor): spatio-temporal features.
+        coords (Tensor): spatio-temporal points.
+
+    Returns:
+        Tensor: sampled features.
+    """
+
+    B, T, _, _, _ = input.shape
+
+    # B T C H W -> B C T H W
+    input = input.permute(0, 2, 1, 3, 4)
+
+    # B R1 R2 3 -> B R1 R2 1 3
+    coords = coords.unsqueeze(3)
+
+    # B C R1 R2 1
+    feats = bilinear_sampler(input, coords)
+
+    return feats.permute(0, 2, 3, 1, 4).view(
+        B, feats.shape[2], feats.shape[3], feats.shape[1]
+    )  # B C R1 R2 1 -> B R1 R2 C
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/evaluation_predictor.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/evaluation_predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..752dc013206c3ca220ccd86e2594a4ba652cb121
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/models/evaluation_predictor.py
@@ -0,0 +1,104 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+from typing import Tuple
+
+from dot.models.shelf.cotracker2_utils.models.core.cotracker.cotracker import CoTracker2
+from dot.models.shelf.cotracker2_utils.models.core.model_utils import get_points_on_a_grid
+
+
+class EvaluationPredictor(torch.nn.Module):
+    def __init__(
+        self,
+        cotracker_model: CoTracker2,
+        interp_shape: Tuple[int, int] = (384, 512),
+        grid_size: int = 5,
+        local_grid_size: int = 8,
+        single_point: bool = True,
+        n_iters: int = 6,
+    ) -> None:
+        super(EvaluationPredictor, self).__init__()
+        self.grid_size = grid_size
+        self.local_grid_size = local_grid_size
+        self.single_point = single_point
+        self.interp_shape = interp_shape
+        self.n_iters = n_iters
+
+        self.model = cotracker_model
+        self.model.eval()
+
+    def forward(self, video, queries):
+        queries = queries.clone()
+        B, T, C, H, W = video.shape
+        B, N, D = queries.shape
+
+        assert D == 3
+
+        video = video.reshape(B * T, C, H, W)
+        video = F.interpolate(video, tuple(self.interp_shape), mode="bilinear", align_corners=True)
+        video = video.reshape(B, T, 3, self.interp_shape[0], self.interp_shape[1])
+
+        device = video.device
+
+        queries[:, :, 1] *= (self.interp_shape[1] - 1) / (W - 1)
+        queries[:, :, 2] *= (self.interp_shape[0] - 1) / (H - 1)
+
+        if self.single_point:
+            traj_e = torch.zeros((B, T, N, 2), device=device)
+            vis_e = torch.zeros((B, T, N), device=device)
+            for pind in range((N)):
+                query = queries[:, pind : pind + 1]
+
+                t = query[0, 0, 0].long()
+
+                traj_e_pind, vis_e_pind = self._process_one_point(video, query)
+                traj_e[:, t:, pind : pind + 1] = traj_e_pind[:, :, :1]
+                vis_e[:, t:, pind : pind + 1] = vis_e_pind[:, :, :1]
+        else:
+            if self.grid_size > 0:
+                xy = get_points_on_a_grid(self.grid_size, video.shape[3:])
+                xy = torch.cat([torch.zeros_like(xy[:, :, :1]), xy], dim=2).to(device)  #
+                queries = torch.cat([queries, xy], dim=1)  #
+
+            traj_e, vis_e, __ = self.model(
+                video=video,
+                queries=queries,
+                iters=self.n_iters,
+            )
+
+        traj_e[:, :, :, 0] *= (W - 1) / float(self.interp_shape[1] - 1)
+        traj_e[:, :, :, 1] *= (H - 1) / float(self.interp_shape[0] - 1)
+        return traj_e, vis_e
+
+    def _process_one_point(self, video, query):
+        t = query[0, 0, 0].long()
+
+        device = query.device
+        if self.local_grid_size > 0:
+            xy_target = get_points_on_a_grid(
+                self.local_grid_size,
+                (50, 50),
+                [query[0, 0, 2].item(), query[0, 0, 1].item()],
+            )
+
+            xy_target = torch.cat([torch.zeros_like(xy_target[:, :, :1]), xy_target], dim=2).to(
+                device
+            )  #
+            query = torch.cat([query, xy_target], dim=1)  #
+
+        if self.grid_size > 0:
+            xy = get_points_on_a_grid(self.grid_size, video.shape[3:])
+            xy = torch.cat([torch.zeros_like(xy[:, :, :1]), xy], dim=2).to(device)  #
+            query = torch.cat([query, xy], dim=1)  #
+        # crop the video to start from the queried frame
+        query[0, 0, 0] = 0
+        traj_e_pind, vis_e_pind, __ = self.model(
+            video=video[:, t:], queries=query, iters=self.n_iters
+        )
+
+        return traj_e_pind, vis_e_pind
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/predictor.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..a24c5e52bd8bb25ff662fe6b29a936759b958b5a
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/predictor.py
@@ -0,0 +1,284 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+
+from .models.core.model_utils import smart_cat, get_points_on_a_grid
+from .models.build_cotracker import build_cotracker
+
+
+class CoTrackerPredictor(torch.nn.Module):
+    def __init__(self, patch_size, wind_size):
+        super().__init__()
+        self.support_grid_size = 6
+        model = build_cotracker(patch_size, wind_size)
+        self.interp_shape = model.model_resolution
+        self.model = model
+        self.model.eval()
+        self.cached_feat = None
+
+    @torch.no_grad()
+    def forward(
+        self,
+        video,  # (1, T, 3, H, W)
+        # input prompt types:
+        # - None. Dense tracks are computed in this case. You can adjust *query_frame* to compute tracks starting from a specific frame.
+        # *backward_tracking=True* will compute tracks in both directions.
+        # - queries. Queried points of shape (1, N, 3) in format (t, x, y) for frame index and pixel coordinates.
+        # - grid_size. Grid of N*N points from the first frame. if segm_mask is provided, then computed only for the mask.
+        # You can adjust *query_frame* and *backward_tracking* for the regular grid in the same way as for dense tracks.
+        queries: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,  # Segmentation mask of shape (B, 1, H, W)
+        grid_size: int = 0,
+        grid_query_frame: int = 0,  # only for dense and regular grid tracks
+        backward_tracking: bool = False,
+        cache_features: bool = False,
+    ):
+        if queries is None and grid_size == 0:
+            tracks, visibilities = self._compute_dense_tracks(
+                video,
+                grid_query_frame=grid_query_frame,
+                backward_tracking=backward_tracking,
+            )
+        else:
+            tracks, visibilities = self._compute_sparse_tracks(
+                video,
+                queries,
+                segm_mask,
+                grid_size,
+                add_support_grid=(grid_size == 0 or segm_mask is not None),
+                grid_query_frame=grid_query_frame,
+                backward_tracking=backward_tracking,
+                cache_features=cache_features
+            )
+
+        return tracks, visibilities
+
+    def _compute_dense_tracks(self, video, grid_query_frame, grid_size=80, backward_tracking=False):
+        *_, H, W = video.shape
+        grid_step = W // grid_size
+        grid_width = W // grid_step
+        grid_height = H // grid_step
+        tracks = visibilities = None
+        grid_pts = torch.zeros((1, grid_width * grid_height, 3)).to(video.device)
+        grid_pts[0, :, 0] = grid_query_frame
+        for offset in range(grid_step * grid_step):
+            print(f"step {offset} / {grid_step * grid_step}")
+            ox = offset % grid_step
+            oy = offset // grid_step
+            grid_pts[0, :, 1] = torch.arange(grid_width).repeat(grid_height) * grid_step + ox
+            grid_pts[0, :, 2] = (
+                torch.arange(grid_height).repeat_interleave(grid_width) * grid_step + oy
+            )
+            tracks_step, visibilities_step = self._compute_sparse_tracks(
+                video=video,
+                queries=grid_pts,
+                backward_tracking=backward_tracking,
+            )
+            tracks = smart_cat(tracks, tracks_step, dim=2)
+            visibilities = smart_cat(visibilities, visibilities_step, dim=2)
+
+        return tracks, visibilities
+
+    def _compute_sparse_tracks(
+        self,
+        video,
+        queries,
+        segm_mask=None,
+        grid_size=0,
+        add_support_grid=False,
+        grid_query_frame=0,
+        backward_tracking=False,
+        cache_features=False,
+    ):
+        B, T, C, H, W = video.shape
+
+        video = video.reshape(B * T, C, H, W)
+        video = F.interpolate(video, tuple(self.interp_shape), mode="bilinear", align_corners=True)
+        video = video.reshape(B, T, 3, self.interp_shape[0], self.interp_shape[1])
+
+        if cache_features:
+            h, w = self.interp_shape[0], self.interp_shape[1]
+            video_ = video.reshape(B * T, C, h, w)
+            video_ = 2 * video_ - 1.0
+            fmaps_ = self.model.fnet(video_)
+            fmaps_ = fmaps_.reshape(B, T, self.model.latent_dim, h // self.model.stride, w // self.model.stride)
+            self.cached_feat = fmaps_
+
+        if queries is not None:
+            B, N, D = queries.shape
+            assert D == 3
+            queries = queries.clone()
+            queries[:, :, 1:] *= queries.new_tensor(
+                [
+                    (self.interp_shape[1] - 1) / (W - 1),
+                    (self.interp_shape[0] - 1) / (H - 1),
+                ]
+            )
+        elif grid_size > 0:
+            grid_pts = get_points_on_a_grid(grid_size, self.interp_shape, device=video.device)
+            if segm_mask is not None:
+                segm_mask = F.interpolate(segm_mask, tuple(self.interp_shape), mode="nearest")
+                point_mask = segm_mask[0, 0][
+                    (grid_pts[0, :, 1]).round().long().cpu(),
+                    (grid_pts[0, :, 0]).round().long().cpu(),
+                ].bool()
+                grid_pts = grid_pts[:, point_mask]
+
+            queries = torch.cat(
+                [torch.ones_like(grid_pts[:, :, :1]) * grid_query_frame, grid_pts],
+                dim=2,
+            )
+
+        if add_support_grid:
+            grid_pts = get_points_on_a_grid(
+                self.support_grid_size, self.interp_shape, device=video.device
+            )
+            grid_pts = torch.cat([torch.zeros_like(grid_pts[:, :, :1]), grid_pts], dim=2)
+            grid_pts = grid_pts.repeat(B, 1, 1)
+            queries = torch.cat([queries, grid_pts], dim=1)
+
+        tracks, visibilities, __ = self.model.forward(
+            video=video,
+            queries=queries,
+            iters=6,
+            cached_feat=self.cached_feat
+        )
+
+        if backward_tracking:
+            tracks, visibilities = self._compute_backward_tracks(
+                video, queries, tracks, visibilities
+            )
+            if add_support_grid:
+                queries[:, -self.support_grid_size**2 :, 0] = T - 1
+        if add_support_grid:
+            tracks = tracks[:, :, : -self.support_grid_size**2]
+            visibilities = visibilities[:, :, : -self.support_grid_size**2]
+        thr = 0.9
+        visibilities = visibilities > thr
+
+        # correct query-point predictions
+        # see https://github.com/facebookresearch/co-tracker/issues/28
+
+        # TODO: batchify
+        for i in range(len(queries)):
+            queries_t = queries[i, : tracks.size(2), 0].to(torch.int64)
+            arange = torch.arange(0, len(queries_t))
+
+            # overwrite the predictions with the query points
+            tracks[i, queries_t, arange] = queries[i, : tracks.size(2), 1:]
+
+            # correct visibilities, the query points should be visible
+            visibilities[i, queries_t, arange] = True
+
+        tracks *= tracks.new_tensor(
+            [(W - 1) / (self.interp_shape[1] - 1), (H - 1) / (self.interp_shape[0] - 1)]
+        )
+        return tracks, visibilities
+
+    def _compute_backward_tracks(self, video, queries, tracks, visibilities):
+        inv_video = video.flip(1).clone()
+        inv_queries = queries.clone()
+        inv_queries[:, :, 0] = inv_video.shape[1] - inv_queries[:, :, 0] - 1
+
+        if self.cached_feat is not None:
+            inv_feat = self.cached_feat.flip(1)
+        else:
+            inv_feat = None
+
+        inv_tracks, inv_visibilities, __ = self.model(
+            video=inv_video,
+            queries=inv_queries,
+            iters=6,
+            cached_feat=inv_feat
+        )
+
+        inv_tracks = inv_tracks.flip(1)
+        inv_visibilities = inv_visibilities.flip(1)
+        arange = torch.arange(video.shape[1], device=queries.device)[None, :, None]
+
+        mask = (arange < queries[:, None, :, 0]).unsqueeze(-1).repeat(1, 1, 1, 2)
+
+        tracks[mask] = inv_tracks[mask]
+        visibilities[mask[:, :, :, 0]] = inv_visibilities[mask[:, :, :, 0]]
+        return tracks, visibilities
+
+
+class CoTrackerOnlinePredictor(torch.nn.Module):
+    def __init__(self, checkpoint="./checkpoints/cotracker2.pth"):
+        super().__init__()
+        self.support_grid_size = 6
+        model = build_cotracker(checkpoint)
+        self.interp_shape = model.model_resolution
+        self.step = model.window_len // 2
+        self.model = model
+        self.model.eval()
+
+    @torch.no_grad()
+    def forward(
+        self,
+        video_chunk,
+        is_first_step: bool = False,
+        queries: torch.Tensor = None,
+        grid_size: int = 10,
+        grid_query_frame: int = 0,
+        add_support_grid=False,
+    ):
+        # Initialize online video processing and save queried points
+        # This needs to be done before processing *each new video*
+        if is_first_step:
+            self.model.init_video_online_processing()
+            if queries is not None:
+                B, N, D = queries.shape
+                assert D == 3
+                queries = queries.clone()
+                queries[:, :, 1:] *= queries.new_tensor(
+                    [
+                        (self.interp_shape[1] - 1) / (W - 1),
+                        (self.interp_shape[0] - 1) / (H - 1),
+                    ]
+                )
+            elif grid_size > 0:
+                grid_pts = get_points_on_a_grid(
+                    grid_size, self.interp_shape, device=video_chunk.device
+                )
+                queries = torch.cat(
+                    [torch.ones_like(grid_pts[:, :, :1]) * grid_query_frame, grid_pts],
+                    dim=2,
+                )
+            if add_support_grid:
+                grid_pts = get_points_on_a_grid(
+                    self.support_grid_size, self.interp_shape, device=video_chunk.device
+                )
+                grid_pts = torch.cat([torch.zeros_like(grid_pts[:, :, :1]), grid_pts], dim=2)
+                queries = torch.cat([queries, grid_pts], dim=1)
+            self.queries = queries
+            return (None, None)
+        B, T, C, H, W = video_chunk.shape
+        video_chunk = video_chunk.reshape(B * T, C, H, W)
+        video_chunk = F.interpolate(
+            video_chunk, tuple(self.interp_shape), mode="bilinear", align_corners=True
+        )
+        video_chunk = video_chunk.reshape(B, T, 3, self.interp_shape[0], self.interp_shape[1])
+
+        tracks, visibilities, __ = self.model(
+            video=video_chunk,
+            queries=self.queries,
+            iters=6,
+            is_online=True,
+        )
+        thr = 0.9
+        return (
+            tracks
+            * tracks.new_tensor(
+                [
+                    (W - 1) / (self.interp_shape[1] - 1),
+                    (H - 1) / (self.interp_shape[0] - 1),
+                ]
+            ),
+            visibilities > thr,
+        )
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/visualizer.py b/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/visualizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..22ba43afe038e0829b9e2fd17cc670d0231c510c
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker2_utils/utils/visualizer.py
@@ -0,0 +1,343 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+import imageio
+import torch
+
+from matplotlib import cm
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+import matplotlib.pyplot as plt
+from PIL import Image, ImageDraw
+
+
+def read_video_from_path(path):
+    try:
+        reader = imageio.get_reader(path)
+    except Exception as e:
+        print("Error opening video file: ", e)
+        return None
+    frames = []
+    for i, im in enumerate(reader):
+        frames.append(np.array(im))
+    return np.stack(frames)
+
+
+def draw_circle(rgb, coord, radius, color=(255, 0, 0), visible=True):
+    # Create a draw object
+    draw = ImageDraw.Draw(rgb)
+    # Calculate the bounding box of the circle
+    left_up_point = (coord[0] - radius, coord[1] - radius)
+    right_down_point = (coord[0] + radius, coord[1] + radius)
+    # Draw the circle
+    draw.ellipse(
+        [left_up_point, right_down_point],
+        fill=tuple(color) if visible else None,
+        outline=tuple(color),
+    )
+    return rgb
+
+
+def draw_line(rgb, coord_y, coord_x, color, linewidth):
+    draw = ImageDraw.Draw(rgb)
+    draw.line(
+        (coord_y[0], coord_y[1], coord_x[0], coord_x[1]),
+        fill=tuple(color),
+        width=linewidth,
+    )
+    return rgb
+
+
+def add_weighted(rgb, alpha, original, beta, gamma):
+    return (rgb * alpha + original * beta + gamma).astype("uint8")
+
+
+class Visualizer:
+    def __init__(
+        self,
+        save_dir: str = "./results",
+        grayscale: bool = False,
+        pad_value: int = 0,
+        fps: int = 10,
+        mode: str = "rainbow",  # 'cool', 'optical_flow'
+        linewidth: int = 2,
+        show_first_frame: int = 10,
+        tracks_leave_trace: int = 0,  # -1 for infinite
+    ):
+        self.mode = mode
+        self.save_dir = save_dir
+        if mode == "rainbow":
+            self.color_map = cm.get_cmap("gist_rainbow")
+        elif mode == "cool":
+            self.color_map = cm.get_cmap(mode)
+        self.show_first_frame = show_first_frame
+        self.grayscale = grayscale
+        self.tracks_leave_trace = tracks_leave_trace
+        self.pad_value = pad_value
+        self.linewidth = linewidth
+        self.fps = fps
+
+    def visualize(
+        self,
+        video: torch.Tensor,  # (B,T,C,H,W)
+        tracks: torch.Tensor,  # (B,T,N,2)
+        visibility: torch.Tensor = None,  # (B, T, N, 1) bool
+        gt_tracks: torch.Tensor = None,  # (B,T,N,2)
+        segm_mask: torch.Tensor = None,  # (B,1,H,W)
+        filename: str = "video",
+        writer=None,  # tensorboard Summary Writer, used for visualization during training
+        step: int = 0,
+        query_frame: int = 0,
+        save_video: bool = True,
+        compensate_for_camera_motion: bool = False,
+    ):
+        if compensate_for_camera_motion:
+            assert segm_mask is not None
+        if segm_mask is not None:
+            coords = tracks[0, query_frame].round().long()
+            segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
+
+        video = F.pad(
+            video,
+            (self.pad_value, self.pad_value, self.pad_value, self.pad_value),
+            "constant",
+            255,
+        )
+        tracks = tracks + self.pad_value
+
+        if self.grayscale:
+            transform = transforms.Grayscale()
+            video = transform(video)
+            video = video.repeat(1, 1, 3, 1, 1)
+
+        res_video = self.draw_tracks_on_video(
+            video=video,
+            tracks=tracks,
+            visibility=visibility,
+            segm_mask=segm_mask,
+            gt_tracks=gt_tracks,
+            query_frame=query_frame,
+            compensate_for_camera_motion=compensate_for_camera_motion,
+        )
+        if save_video:
+            self.save_video(res_video, filename=filename, writer=writer, step=step)
+        return res_video
+
+    def save_video(self, video, filename, writer=None, step=0):
+        if writer is not None:
+            writer.add_video(
+                filename,
+                video.to(torch.uint8),
+                global_step=step,
+                fps=self.fps,
+            )
+        else:
+            os.makedirs(self.save_dir, exist_ok=True)
+            wide_list = list(video.unbind(1))
+            wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
+
+            # Prepare the video file path
+            save_path = os.path.join(self.save_dir, f"{filename}.mp4")
+
+            # Create a writer object
+            video_writer = imageio.get_writer(save_path, fps=self.fps)
+
+            # Write frames to the video file
+            for frame in wide_list[2:-1]:
+                video_writer.append_data(frame)
+
+            video_writer.close()
+
+            print(f"Video saved to {save_path}")
+
+    def draw_tracks_on_video(
+        self,
+        video: torch.Tensor,
+        tracks: torch.Tensor,
+        visibility: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,
+        gt_tracks=None,
+        query_frame: int = 0,
+        compensate_for_camera_motion=False,
+    ):
+        B, T, C, H, W = video.shape
+        _, _, N, D = tracks.shape
+
+        assert D == 2
+        assert C == 3
+        video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy()  # S, H, W, C
+        tracks = tracks[0].long().detach().cpu().numpy()  # S, N, 2
+        if gt_tracks is not None:
+            gt_tracks = gt_tracks[0].detach().cpu().numpy()
+
+        res_video = []
+
+        # process input video
+        for rgb in video:
+            res_video.append(rgb.copy())
+        vector_colors = np.zeros((T, N, 3))
+
+        if self.mode == "optical_flow":
+            import flow_vis
+
+            vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
+        elif segm_mask is None:
+            if self.mode == "rainbow":
+                y_min, y_max = (
+                    tracks[query_frame, :, 1].min(),
+                    tracks[query_frame, :, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    color = self.color_map(norm(tracks[query_frame, n, 1]))
+                    color = np.array(color[:3])[None] * 255
+                    vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with time
+                for t in range(T):
+                    color = np.array(self.color_map(t / T)[:3])[None] * 255
+                    vector_colors[t] = np.repeat(color, N, axis=0)
+        else:
+            if self.mode == "rainbow":
+                vector_colors[:, segm_mask <= 0, :] = 255
+
+                y_min, y_max = (
+                    tracks[0, segm_mask > 0, 1].min(),
+                    tracks[0, segm_mask > 0, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    if segm_mask[n] > 0:
+                        color = self.color_map(norm(tracks[0, n, 1]))
+                        color = np.array(color[:3])[None] * 255
+                        vector_colors[:, n] = np.repeat(color, T, axis=0)
+
+            else:
+                # color changes with segm class
+                segm_mask = segm_mask.cpu()
+                color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
+                color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
+                color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
+                vector_colors = np.repeat(color[None], T, axis=0)
+
+        #  draw tracks
+        if self.tracks_leave_trace != 0:
+            for t in range(query_frame + 1, T):
+                first_ind = (
+                    max(0, t - self.tracks_leave_trace) if self.tracks_leave_trace >= 0 else 0
+                )
+                curr_tracks = tracks[first_ind : t + 1]
+                curr_colors = vector_colors[first_ind : t + 1]
+                if compensate_for_camera_motion:
+                    diff = (
+                        tracks[first_ind : t + 1, segm_mask <= 0]
+                        - tracks[t : t + 1, segm_mask <= 0]
+                    ).mean(1)[:, None]
+
+                    curr_tracks = curr_tracks - diff
+                    curr_tracks = curr_tracks[:, segm_mask > 0]
+                    curr_colors = curr_colors[:, segm_mask > 0]
+
+                res_video[t] = self._draw_pred_tracks(
+                    res_video[t],
+                    curr_tracks,
+                    curr_colors,
+                )
+                if gt_tracks is not None:
+                    res_video[t] = self._draw_gt_tracks(res_video[t], gt_tracks[first_ind : t + 1])
+
+        #  draw points
+        for t in range(query_frame, T):
+            img = Image.fromarray(np.uint8(res_video[t]))
+            for i in range(N):
+                coord = (tracks[t, i, 0], tracks[t, i, 1])
+                visibile = True
+                if visibility is not None:
+                    visibile = visibility[0, t, i]
+                if coord[0] != 0 and coord[1] != 0:
+                    if not compensate_for_camera_motion or (
+                        compensate_for_camera_motion and segm_mask[i] > 0
+                    ):
+                        img = draw_circle(
+                            img,
+                            coord=coord,
+                            radius=int(self.linewidth * 2),
+                            color=vector_colors[t, i].astype(int),
+                            visible=visibile,
+                        )
+            res_video[t] = np.array(img)
+
+        #  construct the final rgb sequence
+        if self.show_first_frame > 0:
+            res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
+        return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
+
+    def _draw_pred_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3
+        tracks: np.ndarray,  # T x 2
+        vector_colors: np.ndarray,
+        alpha: float = 0.5,
+    ):
+        T, N, _ = tracks.shape
+        rgb = Image.fromarray(np.uint8(rgb))
+        for s in range(T - 1):
+            vector_color = vector_colors[s]
+            original = rgb.copy()
+            alpha = (s / T) ** 2
+            for i in range(N):
+                coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
+                coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
+                if coord_y[0] != 0 and coord_y[1] != 0:
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        vector_color[i].astype(int),
+                        self.linewidth,
+                    )
+            if self.tracks_leave_trace > 0:
+                rgb = Image.fromarray(
+                    np.uint8(add_weighted(np.array(rgb), alpha, np.array(original), 1 - alpha, 0))
+                )
+        rgb = np.array(rgb)
+        return rgb
+
+    def _draw_gt_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3,
+        gt_tracks: np.ndarray,  # T x 2
+    ):
+        T, N, _ = gt_tracks.shape
+        color = np.array((211, 0, 0))
+        rgb = Image.fromarray(np.uint8(rgb))
+        for t in range(T):
+            for i in range(N):
+                gt_tracks = gt_tracks[t][i]
+                #  draw a red cross
+                if gt_tracks[0] > 0 and gt_tracks[1] > 0:
+                    length = self.linewidth * 3
+                    coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+                    coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+        rgb = np.array(rgb)
+        return rgb
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/LICENSE.md b/data/dot_single_video/dot/models/shelf/cotracker_utils/LICENSE.md
new file mode 100644
index 0000000000000000000000000000000000000000..ba959871dca0f9b6775570410879e637de44d7b4
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/LICENSE.md
@@ -0,0 +1,399 @@
+Attribution-NonCommercial 4.0 International
+
+=======================================================================
+
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+
+Using Creative Commons Public Licenses
+
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+	wiki.creativecommons.org/Considerations_for_licensors
+
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More_considerations
+     for the public: 
+	wiki.creativecommons.org/Considerations_for_licensees
+
+=======================================================================
+
+Creative Commons Attribution-NonCommercial 4.0 International Public
+License
+
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial 4.0 International Public License ("Public
+License"). To the extent this Public License may be interpreted as a
+contract, You are granted the Licensed Rights in consideration of Your
+acceptance of these terms and conditions, and the Licensor grants You
+such rights in consideration of benefits the Licensor receives from
+making the Licensed Material available under these terms and
+conditions.
+
+Section 1 -- Definitions.
+
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+
+  i. NonCommercial means not primarily intended for or directed towards
+     commercial advantage or monetary compensation. For purposes of
+     this Public License, the exchange of the Licensed Material for
+     other material subject to Copyright and Similar Rights by digital
+     file-sharing or similar means is NonCommercial provided there is
+     no payment of monetary compensation in connection with the
+     exchange.
+
+  j. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+
+  k. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+
+  l. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+
+Section 2 -- Scope.
+
+  a. License grant.
+
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+
+            a. reproduce and Share the Licensed Material, in whole or
+               in part, for NonCommercial purposes only; and
+
+            b. produce, reproduce, and Share Adapted Material for
+               NonCommercial purposes only.
+
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+
+       5. Downstream recipients.
+
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+
+  b. Other rights.
+
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties, including when
+          the Licensed Material is used other than for NonCommercial
+          purposes.
+
+Section 3 -- License Conditions.
+
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+
+  a. Attribution.
+
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+
+                ii. a copyright notice;
+
+               iii. a notice that refers to this Public License;
+
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+
+Section 4 -- Sui Generis Database Rights.
+
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database for NonCommercial purposes
+     only;
+
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+
+Section 6 -- Term and Termination.
+
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+
+       2. upon express reinstatement by the Licensor.
+
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+
+Section 7 -- Other Terms and Conditions.
+
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+
+Section 8 -- Interpretation.
+
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+
+=======================================================================
+
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+
+Creative Commons may be contacted at creativecommons.org.
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/build_cotracker.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/build_cotracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..530d07d1df9ad602cf46744aaae35ae71455af00
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/build_cotracker.py
@@ -0,0 +1,70 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+
+from dot.models.shelf.cotracker_utils.models.core.cotracker.cotracker import CoTracker
+
+
+def build_cotracker(
+    patch_size: int,
+    wind_size: int,
+):
+    if patch_size == 4 and wind_size == 8:
+        return build_cotracker_stride_4_wind_8()
+    elif patch_size == 4 and wind_size == 12:
+        return build_cotracker_stride_4_wind_12()
+    elif patch_size == 8 and wind_size == 16:
+        return build_cotracker_stride_8_wind_16()
+    else:
+        raise ValueError(f"Unknown model for patch size {patch_size} and window size {window_size}")
+
+
+# model used to produce the results in the paper
+def build_cotracker_stride_4_wind_8(checkpoint=None):
+    return _build_cotracker(
+        stride=4,
+        sequence_len=8,
+        checkpoint=checkpoint,
+    )
+
+
+def build_cotracker_stride_4_wind_12(checkpoint=None):
+    return _build_cotracker(
+        stride=4,
+        sequence_len=12,
+        checkpoint=checkpoint,
+    )
+
+
+# the fastest model
+def build_cotracker_stride_8_wind_16(checkpoint=None):
+    return _build_cotracker(
+        stride=8,
+        sequence_len=16,
+        checkpoint=checkpoint,
+    )
+
+
+def _build_cotracker(
+    stride,
+    sequence_len,
+    checkpoint=None,
+):
+    cotracker = CoTracker(
+        stride=stride,
+        S=sequence_len,
+        add_space_attn=True,
+        space_depth=6,
+        time_depth=6,
+    )
+    if checkpoint is not None:
+        with open(checkpoint, "rb") as f:
+            state_dict = torch.load(f, map_location="cpu")
+            if "model" in state_dict:
+                state_dict = state_dict["model"]
+        cotracker.load_state_dict(state_dict)
+    return cotracker
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/blocks.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/blocks.py
new file mode 100644
index 0000000000000000000000000000000000000000..8880b679aae33325222339fa1e618fd010ea84c4
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/blocks.py
@@ -0,0 +1,400 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from einops import rearrange
+from timm.models.vision_transformer import Attention, Mlp
+
+
+class ResidualBlock(nn.Module):
+    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
+        super(ResidualBlock, self).__init__()
+
+        self.conv1 = nn.Conv2d(
+            in_planes,
+            planes,
+            kernel_size=3,
+            padding=1,
+            stride=stride,
+            padding_mode="zeros",
+        )
+        self.conv2 = nn.Conv2d(
+            planes, planes, kernel_size=3, padding=1, padding_mode="zeros"
+        )
+        self.relu = nn.ReLU(inplace=True)
+
+        num_groups = planes // 8
+
+        if norm_fn == "group":
+            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            if not stride == 1:
+                self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+
+        elif norm_fn == "batch":
+            self.norm1 = nn.BatchNorm2d(planes)
+            self.norm2 = nn.BatchNorm2d(planes)
+            if not stride == 1:
+                self.norm3 = nn.BatchNorm2d(planes)
+
+        elif norm_fn == "instance":
+            self.norm1 = nn.InstanceNorm2d(planes)
+            self.norm2 = nn.InstanceNorm2d(planes)
+            if not stride == 1:
+                self.norm3 = nn.InstanceNorm2d(planes)
+
+        elif norm_fn == "none":
+            self.norm1 = nn.Sequential()
+            self.norm2 = nn.Sequential()
+            if not stride == 1:
+                self.norm3 = nn.Sequential()
+
+        if stride == 1:
+            self.downsample = None
+
+        else:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm3
+            )
+
+    def forward(self, x):
+        y = x
+        y = self.relu(self.norm1(self.conv1(y)))
+        y = self.relu(self.norm2(self.conv2(y)))
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+
+        return self.relu(x + y)
+
+
+class BasicEncoder(nn.Module):
+    def __init__(
+        self, input_dim=3, output_dim=128, stride=8, norm_fn="batch", dropout=0.0
+    ):
+        super(BasicEncoder, self).__init__()
+        self.stride = stride
+        self.norm_fn = norm_fn
+        self.in_planes = 64
+
+        if self.norm_fn == "group":
+            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=self.in_planes)
+            self.norm2 = nn.GroupNorm(num_groups=8, num_channels=output_dim * 2)
+
+        elif self.norm_fn == "batch":
+            self.norm1 = nn.BatchNorm2d(self.in_planes)
+            self.norm2 = nn.BatchNorm2d(output_dim * 2)
+
+        elif self.norm_fn == "instance":
+            self.norm1 = nn.InstanceNorm2d(self.in_planes)
+            self.norm2 = nn.InstanceNorm2d(output_dim * 2)
+
+        elif self.norm_fn == "none":
+            self.norm1 = nn.Sequential()
+
+        self.conv1 = nn.Conv2d(
+            input_dim,
+            self.in_planes,
+            kernel_size=7,
+            stride=2,
+            padding=3,
+            padding_mode="zeros",
+        )
+        self.relu1 = nn.ReLU(inplace=True)
+
+        self.shallow = False
+        if self.shallow:
+            self.layer1 = self._make_layer(64, stride=1)
+            self.layer2 = self._make_layer(96, stride=2)
+            self.layer3 = self._make_layer(128, stride=2)
+            self.conv2 = nn.Conv2d(128 + 96 + 64, output_dim, kernel_size=1)
+        else:
+            self.layer1 = self._make_layer(64, stride=1)
+            self.layer2 = self._make_layer(96, stride=2)
+            self.layer3 = self._make_layer(128, stride=2)
+            self.layer4 = self._make_layer(128, stride=2)
+
+            self.conv2 = nn.Conv2d(
+                128 + 128 + 96 + 64,
+                output_dim * 2,
+                kernel_size=3,
+                padding=1,
+                padding_mode="zeros",
+            )
+            self.relu2 = nn.ReLU(inplace=True)
+            self.conv3 = nn.Conv2d(output_dim * 2, output_dim, kernel_size=1)
+
+        self.dropout = None
+        if dropout > 0:
+            self.dropout = nn.Dropout2d(p=dropout)
+
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
+                if m.weight is not None:
+                    nn.init.constant_(m.weight, 1)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+
+    def _make_layer(self, dim, stride=1):
+        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
+        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
+        layers = (layer1, layer2)
+
+        self.in_planes = dim
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        _, _, H, W = x.shape
+
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu1(x)
+
+        if self.shallow:
+            a = self.layer1(x)
+            b = self.layer2(a)
+            c = self.layer3(b)
+            a = F.interpolate(
+                a,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            b = F.interpolate(
+                b,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            c = F.interpolate(
+                c,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            x = self.conv2(torch.cat([a, b, c], dim=1))
+        else:
+            a = self.layer1(x)
+            b = self.layer2(a)
+            c = self.layer3(b)
+            d = self.layer4(c)
+            a = F.interpolate(
+                a,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            b = F.interpolate(
+                b,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            c = F.interpolate(
+                c,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            d = F.interpolate(
+                d,
+                (H // self.stride, W // self.stride),
+                mode="bilinear",
+                align_corners=True,
+            )
+            x = self.conv2(torch.cat([a, b, c, d], dim=1))
+            x = self.norm2(x)
+            x = self.relu2(x)
+            x = self.conv3(x)
+
+        if self.training and self.dropout is not None:
+            x = self.dropout(x)
+        return x
+
+
+class AttnBlock(nn.Module):
+    """
+    A DiT block with adaptive layer norm zero (adaLN-Zero) conditioning.
+    """
+
+    def __init__(self, hidden_size, num_heads, mlp_ratio=4.0, **block_kwargs):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = Attention(
+            hidden_size, num_heads=num_heads, qkv_bias=True, **block_kwargs
+        )
+
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size,
+            hidden_features=mlp_hidden_dim,
+            act_layer=approx_gelu,
+            drop=0,
+        )
+
+    def forward(self, x):
+        x = x + self.attn(self.norm1(x))
+        x = x + self.mlp(self.norm2(x))
+        return x
+
+
+def bilinear_sampler(img, coords, mode="bilinear", mask=False):
+    """Wrapper for grid_sample, uses pixel coordinates"""
+    H, W = img.shape[-2:]
+    xgrid, ygrid = coords.split([1, 1], dim=-1)
+    # go to 0,1 then 0,2 then -1,1
+    xgrid = 2 * xgrid / (W - 1) - 1
+    ygrid = 2 * ygrid / (H - 1) - 1
+
+    grid = torch.cat([xgrid, ygrid], dim=-1)
+    img = F.grid_sample(img, grid, align_corners=True)
+
+    if mask:
+        mask = (xgrid > -1) & (ygrid > -1) & (xgrid < 1) & (ygrid < 1)
+        return img, mask.float()
+
+    return img
+
+
+class CorrBlock:
+    def __init__(self, fmaps, num_levels=4, radius=4):
+        B, S, C, H, W = fmaps.shape
+        self.S, self.C, self.H, self.W = S, C, H, W
+
+        self.num_levels = num_levels
+        self.radius = radius
+        self.fmaps_pyramid = []
+
+        self.fmaps_pyramid.append(fmaps)
+        for i in range(self.num_levels - 1):
+            fmaps_ = fmaps.reshape(B * S, C, H, W)
+            fmaps_ = F.avg_pool2d(fmaps_, 2, stride=2)
+            _, _, H, W = fmaps_.shape
+            fmaps = fmaps_.reshape(B, S, C, H, W)
+            self.fmaps_pyramid.append(fmaps)
+
+    def sample(self, coords):
+        r = self.radius
+        B, S, N, D = coords.shape
+        assert D == 2
+
+        H, W = self.H, self.W
+        out_pyramid = []
+        for i in range(self.num_levels):
+            corrs = self.corrs_pyramid[i]  # B, S, N, H, W
+            _, _, _, H, W = corrs.shape
+
+            dx = torch.linspace(-r, r, 2 * r + 1)
+            dy = torch.linspace(-r, r, 2 * r + 1)
+            delta = torch.stack(torch.meshgrid(dy, dx, indexing="ij"), axis=-1).to(
+                coords.device
+            )
+
+            centroid_lvl = coords.reshape(B * S * N, 1, 1, 2) / 2 ** i
+            delta_lvl = delta.view(1, 2 * r + 1, 2 * r + 1, 2)
+            coords_lvl = centroid_lvl + delta_lvl
+
+            corrs = bilinear_sampler(corrs.reshape(B * S * N, 1, H, W), coords_lvl)
+            corrs = corrs.view(B, S, N, -1)
+            out_pyramid.append(corrs)
+
+        out = torch.cat(out_pyramid, dim=-1)  # B, S, N, LRR*2
+        return out.contiguous().float()
+
+    def corr(self, targets):
+        B, S, N, C = targets.shape
+        assert C == self.C
+        assert S == self.S
+
+        fmap1 = targets
+
+        self.corrs_pyramid = []
+        for fmaps in self.fmaps_pyramid:
+            _, _, _, H, W = fmaps.shape
+            fmap2s = fmaps.view(B, S, C, H * W)
+            corrs = torch.matmul(fmap1, fmap2s)
+            corrs = corrs.view(B, S, N, H, W)
+            corrs = corrs / torch.sqrt(torch.tensor(C).float())
+            self.corrs_pyramid.append(corrs)
+
+
+class UpdateFormer(nn.Module):
+    """
+    Transformer model that updates track estimates.
+    """
+
+    def __init__(
+        self,
+        space_depth=12,
+        time_depth=12,
+        input_dim=320,
+        hidden_size=384,
+        num_heads=8,
+        output_dim=130,
+        mlp_ratio=4.0,
+        add_space_attn=True,
+    ):
+        super().__init__()
+        self.out_channels = 2
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.add_space_attn = add_space_attn
+        self.input_transform = torch.nn.Linear(input_dim, hidden_size, bias=True)
+        self.flow_head = torch.nn.Linear(hidden_size, output_dim, bias=True)
+
+        self.time_blocks = nn.ModuleList(
+            [
+                AttnBlock(hidden_size, num_heads, mlp_ratio=mlp_ratio)
+                for _ in range(time_depth)
+            ]
+        )
+
+        if add_space_attn:
+            self.space_blocks = nn.ModuleList(
+                [
+                    AttnBlock(hidden_size, num_heads, mlp_ratio=mlp_ratio)
+                    for _ in range(space_depth)
+                ]
+            )
+            assert len(self.time_blocks) >= len(self.space_blocks)
+        self.initialize_weights()
+
+    def initialize_weights(self):
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+
+        self.apply(_basic_init)
+
+    def forward(self, input_tensor):
+        x = self.input_transform(input_tensor)
+
+        j = 0
+        for i in range(len(self.time_blocks)):
+            B, N, T, _ = x.shape
+            x_time = rearrange(x, "b n t c -> (b n) t c", b=B, t=T, n=N)
+            x_time = self.time_blocks[i](x_time)
+
+            x = rearrange(x_time, "(b n) t c -> b n t c ", b=B, t=T, n=N)
+            if self.add_space_attn and (
+                i % (len(self.time_blocks) // len(self.space_blocks)) == 0
+            ):
+                x_space = rearrange(x, "b n t c -> (b t) n c ", b=B, t=T, n=N)
+                x_space = self.space_blocks[j](x_space)
+                x = rearrange(x_space, "(b t) n c -> b n t c  ", b=B, t=T, n=N)
+                j += 1
+
+        flow = self.flow_head(x)
+        return flow
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/cotracker.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/cotracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6ebea54d619fea0d0638cd7a8cc1d3071278325
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/cotracker.py
@@ -0,0 +1,360 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+from einops import rearrange
+
+from dot.models.shelf.cotracker_utils.models.core.cotracker.blocks import (
+    BasicEncoder,
+    CorrBlock,
+    UpdateFormer,
+)
+
+from dot.models.shelf.cotracker_utils.models.core.model_utils import meshgrid2d, bilinear_sample2d, smart_cat
+from dot.models.shelf.cotracker_utils.models.core.embeddings import (
+    get_2d_embedding,
+    get_1d_sincos_pos_embed_from_grid,
+    get_2d_sincos_pos_embed,
+)
+
+
+torch.manual_seed(0)
+
+
+def get_points_on_a_grid(grid_size, interp_shape, grid_center=(0, 0), device="cpu"):
+    if grid_size == 1:
+        return torch.tensor([interp_shape[1] / 2, interp_shape[0] / 2], device=device)[
+            None, None
+        ]
+
+    grid_y, grid_x = meshgrid2d(
+        1, grid_size, grid_size, stack=False, norm=False, device=device
+    )
+    step = interp_shape[1] // 64
+    if grid_center[0] != 0 or grid_center[1] != 0:
+        grid_y = grid_y - grid_size / 2.0
+        grid_x = grid_x - grid_size / 2.0
+    grid_y = step + grid_y.reshape(1, -1) / float(grid_size - 1) * (
+        interp_shape[0] - step * 2
+    )
+    grid_x = step + grid_x.reshape(1, -1) / float(grid_size - 1) * (
+        interp_shape[1] - step * 2
+    )
+
+    grid_y = grid_y + grid_center[0]
+    grid_x = grid_x + grid_center[1]
+    xy = torch.stack([grid_x, grid_y], dim=-1).to(device)
+    return xy
+
+
+def sample_pos_embed(grid_size, embed_dim, coords):
+    pos_embed = get_2d_sincos_pos_embed(embed_dim=embed_dim, grid_size=grid_size)
+    pos_embed = (
+        torch.from_numpy(pos_embed)
+        .reshape(grid_size[0], grid_size[1], embed_dim)
+        .float()
+        .unsqueeze(0)
+        .to(coords.device)
+    )
+    sampled_pos_embed = bilinear_sample2d(
+        pos_embed.permute(0, 3, 1, 2), coords[:, 0, :, 0], coords[:, 0, :, 1]
+    )
+    return sampled_pos_embed
+
+
+class CoTracker(nn.Module):
+    def __init__(
+        self,
+        S=8,
+        stride=8,
+        add_space_attn=True,
+        num_heads=8,
+        hidden_size=384,
+        space_depth=12,
+        time_depth=12,
+    ):
+        super(CoTracker, self).__init__()
+        self.S = S
+        self.stride = stride
+        self.hidden_dim = 256
+        self.latent_dim = latent_dim = 128
+        self.corr_levels = 4
+        self.corr_radius = 3
+        self.add_space_attn = add_space_attn
+        self.fnet = BasicEncoder(
+            output_dim=self.latent_dim, norm_fn="instance", dropout=0, stride=stride
+        )
+
+        self.updateformer = UpdateFormer(
+            space_depth=space_depth,
+            time_depth=time_depth,
+            input_dim=456,
+            hidden_size=hidden_size,
+            num_heads=num_heads,
+            output_dim=latent_dim + 2,
+            mlp_ratio=4.0,
+            add_space_attn=add_space_attn,
+        )
+
+        self.norm = nn.GroupNorm(1, self.latent_dim)
+        self.ffeat_updater = nn.Sequential(
+            nn.Linear(self.latent_dim, self.latent_dim),
+            nn.GELU(),
+        )
+        self.vis_predictor = nn.Sequential(
+            nn.Linear(self.latent_dim, 1),
+        )
+
+    def forward_iteration(
+        self,
+        fmaps,
+        coords_init,
+        feat_init=None,
+        vis_init=None,
+        track_mask=None,
+        iters=4,
+    ):
+        B, S_init, N, D = coords_init.shape
+        assert D == 2
+        assert B == 1
+
+        B, S, __, H8, W8 = fmaps.shape
+
+        device = fmaps.device
+
+        if S_init < S:
+            coords = torch.cat(
+                [coords_init, coords_init[:, -1].repeat(1, S - S_init, 1, 1)], dim=1
+            )
+            vis_init = torch.cat(
+                [vis_init, vis_init[:, -1].repeat(1, S - S_init, 1, 1)], dim=1
+            )
+        else:
+            coords = coords_init.clone()
+
+        fcorr_fn = CorrBlock(
+            fmaps, num_levels=self.corr_levels, radius=self.corr_radius
+        )
+
+        ffeats = feat_init.clone()
+
+        times_ = torch.linspace(0, S - 1, S).reshape(1, S, 1)
+
+        pos_embed = sample_pos_embed(
+            grid_size=(H8, W8),
+            embed_dim=456,
+            coords=coords,
+        )
+        pos_embed = rearrange(pos_embed, "b e n -> (b n) e").unsqueeze(1)
+        times_embed = (
+            torch.from_numpy(get_1d_sincos_pos_embed_from_grid(456, times_[0]))[None]
+            .repeat(B, 1, 1)
+            .float()
+            .to(device)
+        )
+        coord_predictions = []
+
+        for __ in range(iters):
+            coords = coords.detach()
+            fcorr_fn.corr(ffeats)
+
+            fcorrs = fcorr_fn.sample(coords)  # B, S, N, LRR
+            LRR = fcorrs.shape[3]
+
+            fcorrs_ = fcorrs.permute(0, 2, 1, 3).reshape(B * N, S, LRR)
+            flows_ = (coords - coords[:, 0:1]).permute(0, 2, 1, 3).reshape(B * N, S, 2)
+
+            flows_cat = get_2d_embedding(flows_, 64, cat_coords=True)
+            ffeats_ = ffeats.permute(0, 2, 1, 3).reshape(B * N, S, self.latent_dim)
+
+            if track_mask.shape[1] < vis_init.shape[1]:
+                track_mask = torch.cat(
+                    [
+                        track_mask,
+                        torch.zeros_like(track_mask[:, 0]).repeat(
+                            1, vis_init.shape[1] - track_mask.shape[1], 1, 1
+                        ),
+                    ],
+                    dim=1,
+                )
+            concat = (
+                torch.cat([track_mask, vis_init], dim=2)
+                .permute(0, 2, 1, 3)
+                .reshape(B * N, S, 2)
+            )
+
+            transformer_input = torch.cat([flows_cat, fcorrs_, ffeats_, concat], dim=2)
+            x = transformer_input + pos_embed + times_embed
+
+            x = rearrange(x, "(b n) t d -> b n t d", b=B)
+
+            delta = self.updateformer(x)
+
+            delta = rearrange(delta, " b n t d -> (b n) t d")
+
+            delta_coords_ = delta[:, :, :2]
+            delta_feats_ = delta[:, :, 2:]
+
+            delta_feats_ = delta_feats_.reshape(B * N * S, self.latent_dim)
+            ffeats_ = ffeats.permute(0, 2, 1, 3).reshape(B * N * S, self.latent_dim)
+
+            ffeats_ = self.ffeat_updater(self.norm(delta_feats_)) + ffeats_
+
+            ffeats = ffeats_.reshape(B, N, S, self.latent_dim).permute(
+                0, 2, 1, 3
+            )  # B,S,N,C
+
+            coords = coords + delta_coords_.reshape(B, N, S, 2).permute(0, 2, 1, 3)
+            coord_predictions.append(coords * self.stride)
+
+        vis_e = self.vis_predictor(ffeats.reshape(B * S * N, self.latent_dim)).reshape(
+            B, S, N
+        )
+        return coord_predictions, vis_e, feat_init
+
+    def forward(self, rgbs, queries, iters=4, cached_feat=None, feat_init=None, is_train=False):
+        B, T, C, H, W = rgbs.shape
+        B, N, __ = queries.shape
+
+        device = rgbs.device
+        assert B == 1
+        # INIT for the first sequence
+        # We want to sort points by the first frame they are visible to add them to the tensor of tracked points consequtively
+        first_positive_inds = queries[:, :, 0].long()
+
+        __, sort_inds = torch.sort(first_positive_inds[0], dim=0, descending=False)
+        inv_sort_inds = torch.argsort(sort_inds, dim=0)
+        first_positive_sorted_inds = first_positive_inds[0][sort_inds]
+
+        assert torch.allclose(
+            first_positive_inds[0], first_positive_inds[0][sort_inds][inv_sort_inds]
+        )
+
+        coords_init = queries[:, :, 1:].reshape(B, 1, N, 2).repeat(
+            1, self.S, 1, 1
+        ) / float(self.stride)
+
+        rgbs = 2 * rgbs - 1.0
+
+        traj_e = torch.zeros((B, T, N, 2), device=device)
+        vis_e = torch.zeros((B, T, N), device=device)
+
+        ind_array = torch.arange(T, device=device)
+        ind_array = ind_array[None, :, None].repeat(B, 1, N)
+
+        track_mask = (ind_array >= first_positive_inds[:, None, :]).unsqueeze(-1)
+        # these are logits, so we initialize visibility with something that would give a value close to 1 after softmax
+        vis_init = torch.ones((B, self.S, N, 1), device=device).float() * 10
+
+        ind = 0
+
+        track_mask_ = track_mask[:, :, sort_inds].clone()
+        coords_init_ = coords_init[:, :, sort_inds].clone()
+        vis_init_ = vis_init[:, :, sort_inds].clone()
+
+        prev_wind_idx = 0
+        fmaps_ = None
+        vis_predictions = []
+        coord_predictions = []
+        wind_inds = []
+        while ind < T - self.S // 2:
+            rgbs_seq = rgbs[:, ind : ind + self.S]
+
+            S = S_local = rgbs_seq.shape[1]
+
+            if cached_feat is None:
+                if S < self.S:
+                    rgbs_seq = torch.cat(
+                        [rgbs_seq, rgbs_seq[:, -1, None].repeat(1, self.S - S, 1, 1, 1)],
+                        dim=1,
+                    )
+                    S = rgbs_seq.shape[1]
+                rgbs_ = rgbs_seq.reshape(B * S, C, H, W)
+
+                if fmaps_ is None:
+                    fmaps_ = self.fnet(rgbs_)
+                else:
+                    fmaps_ = torch.cat(
+                        [fmaps_[self.S // 2 :], self.fnet(rgbs_[self.S // 2 :])], dim=0
+                    )
+                fmaps = fmaps_.reshape(
+                    B, S, self.latent_dim, H // self.stride, W // self.stride
+                )
+            else:
+                fmaps = cached_feat[:, ind : ind + self.S]
+                if S < self.S:
+                    fmaps = torch.cat(
+                        [fmaps, fmaps[:, -1, None].repeat(1, self.S - S, 1, 1, 1)],
+                        dim=1,
+                    )
+
+            curr_wind_points = torch.nonzero(first_positive_sorted_inds < ind + self.S)
+            if curr_wind_points.shape[0] == 0:
+                ind = ind + self.S // 2
+                continue
+            wind_idx = curr_wind_points[-1] + 1
+
+            if wind_idx - prev_wind_idx > 0:
+                fmaps_sample = fmaps[
+                    :, first_positive_sorted_inds[prev_wind_idx:wind_idx] - ind
+                ]
+
+                feat_init_ = bilinear_sample2d(
+                    fmaps_sample,
+                    coords_init_[:, 0, prev_wind_idx:wind_idx, 0],
+                    coords_init_[:, 0, prev_wind_idx:wind_idx, 1],
+                ).permute(0, 2, 1)
+
+                feat_init_ = feat_init_.unsqueeze(1).repeat(1, self.S, 1, 1)
+                feat_init = smart_cat(feat_init, feat_init_, dim=2)
+
+            if prev_wind_idx > 0:
+                new_coords = coords[-1][:, self.S // 2 :] / float(self.stride)
+
+                coords_init_[:, : self.S // 2, :prev_wind_idx] = new_coords
+                coords_init_[:, self.S // 2 :, :prev_wind_idx] = new_coords[
+                    :, -1
+                ].repeat(1, self.S // 2, 1, 1)
+
+                new_vis = vis[:, self.S // 2 :].unsqueeze(-1)
+                vis_init_[:, : self.S // 2, :prev_wind_idx] = new_vis
+                vis_init_[:, self.S // 2 :, :prev_wind_idx] = new_vis[:, -1].repeat(
+                    1, self.S // 2, 1, 1
+                )
+
+            coords, vis, __ = self.forward_iteration(
+                fmaps=fmaps,
+                coords_init=coords_init_[:, :, :wind_idx],
+                feat_init=feat_init[:, :, :wind_idx],
+                vis_init=vis_init_[:, :, :wind_idx],
+                track_mask=track_mask_[:, ind : ind + self.S, :wind_idx],
+                iters=iters,
+            )
+            if is_train:
+                vis_predictions.append(torch.sigmoid(vis[:, :S_local]))
+                coord_predictions.append([coord[:, :S_local] for coord in coords])
+                wind_inds.append(wind_idx)
+
+            traj_e[:, ind : ind + self.S, :wind_idx] = coords[-1][:, :S_local]
+            vis_e[:, ind : ind + self.S, :wind_idx] = vis[:, :S_local]
+
+            track_mask_[:, : ind + self.S, :wind_idx] = 0.0
+            ind = ind + self.S // 2
+
+            prev_wind_idx = wind_idx
+
+        traj_e = traj_e[:, :, inv_sort_inds]
+        vis_e = vis_e[:, :, inv_sort_inds]
+
+        vis_e = torch.sigmoid(vis_e)
+
+        train_data = (
+            (vis_predictions, coord_predictions, wind_inds, sort_inds)
+            if is_train
+            else None
+        )
+        return traj_e, feat_init, vis_e, train_data
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/losses.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/losses.py
new file mode 100644
index 0000000000000000000000000000000000000000..41e381576a76b86e2a2f40de9dd11fca1750d199
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/cotracker/losses.py
@@ -0,0 +1,61 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+from dot.models.shelf.cotracker_utils.models.core.model_utils import reduce_masked_mean
+
+EPS = 1e-6
+
+
+def balanced_ce_loss(pred, gt, valid=None):
+    total_balanced_loss = 0.0
+    for j in range(len(gt)):
+        B, S, N = gt[j].shape
+        # pred and gt are the same shape
+        for (a, b) in zip(pred[j].size(), gt[j].size()):
+            assert a == b  # some shape mismatch!
+        # if valid is not None:
+        for (a, b) in zip(pred[j].size(), valid[j].size()):
+            assert a == b  # some shape mismatch!
+
+        pos = (gt[j] > 0.95).float()
+        neg = (gt[j] < 0.05).float()
+
+        label = pos * 2.0 - 1.0
+        a = -label * pred[j]
+        b = F.relu(a)
+        loss = b + torch.log(torch.exp(-b) + torch.exp(a - b))
+
+        pos_loss = reduce_masked_mean(loss, pos * valid[j])
+        neg_loss = reduce_masked_mean(loss, neg * valid[j])
+
+        balanced_loss = pos_loss + neg_loss
+        total_balanced_loss += balanced_loss / float(N)
+    return total_balanced_loss
+
+
+def sequence_loss(flow_preds, flow_gt, vis, valids, gamma=0.8):
+    """Loss function defined over sequence of flow predictions"""
+    total_flow_loss = 0.0
+    for j in range(len(flow_gt)):
+        B, S, N, D = flow_gt[j].shape
+        assert D == 2
+        B, S1, N = vis[j].shape
+        B, S2, N = valids[j].shape
+        assert S == S1
+        assert S == S2
+        n_predictions = len(flow_preds[j])
+        flow_loss = 0.0
+        for i in range(n_predictions):
+            i_weight = gamma ** (n_predictions - i - 1)
+            flow_pred = flow_preds[j][i]
+            i_loss = (flow_pred - flow_gt[j]).abs()  # B, S, N, 2
+            i_loss = torch.mean(i_loss, dim=3)  # B, S, N
+            flow_loss += i_weight * reduce_masked_mean(i_loss, valids[j])
+        flow_loss = flow_loss / n_predictions
+        total_flow_loss += flow_loss / float(N)
+    return total_flow_loss
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/embeddings.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/embeddings.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbcd86a55bda603b1729638de9ddb339cac42f84
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/embeddings.py
@@ -0,0 +1,154 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import numpy as np
+
+
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False, extra_tokens=0):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    if isinstance(grid_size, tuple):
+        grid_size_h, grid_size_w = grid_size
+    else:
+        grid_size_h = grid_size_w = grid_size
+    grid_h = np.arange(grid_size_h, dtype=np.float32)
+    grid_w = np.arange(grid_size_w, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+
+    grid = grid.reshape([2, 1, grid_size_h, grid_size_w])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token and extra_tokens > 0:
+        pos_embed = np.concatenate(
+            [np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0
+        )
+    return pos_embed
+
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+
+    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000 ** omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
+
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+
+
+def get_2d_embedding(xy, C, cat_coords=True):
+    B, N, D = xy.shape
+    assert D == 2
+
+    x = xy[:, :, 0:1]
+    y = xy[:, :, 1:2]
+    div_term = (
+        torch.arange(0, C, 2, device=xy.device, dtype=torch.float32) * (1000.0 / C)
+    ).reshape(1, 1, int(C / 2))
+
+    pe_x = torch.zeros(B, N, C, device=xy.device, dtype=torch.float32)
+    pe_y = torch.zeros(B, N, C, device=xy.device, dtype=torch.float32)
+
+    pe_x[:, :, 0::2] = torch.sin(x * div_term)
+    pe_x[:, :, 1::2] = torch.cos(x * div_term)
+
+    pe_y[:, :, 0::2] = torch.sin(y * div_term)
+    pe_y[:, :, 1::2] = torch.cos(y * div_term)
+
+    pe = torch.cat([pe_x, pe_y], dim=2)  # B, N, C*3
+    if cat_coords:
+        pe = torch.cat([xy, pe], dim=2)  # B, N, C*3+3
+    return pe
+
+
+def get_3d_embedding(xyz, C, cat_coords=True):
+    B, N, D = xyz.shape
+    assert D == 3
+
+    x = xyz[:, :, 0:1]
+    y = xyz[:, :, 1:2]
+    z = xyz[:, :, 2:3]
+    div_term = (
+        torch.arange(0, C, 2, device=xyz.device, dtype=torch.float32) * (1000.0 / C)
+    ).reshape(1, 1, int(C / 2))
+
+    pe_x = torch.zeros(B, N, C, device=xyz.device, dtype=torch.float32)
+    pe_y = torch.zeros(B, N, C, device=xyz.device, dtype=torch.float32)
+    pe_z = torch.zeros(B, N, C, device=xyz.device, dtype=torch.float32)
+
+    pe_x[:, :, 0::2] = torch.sin(x * div_term)
+    pe_x[:, :, 1::2] = torch.cos(x * div_term)
+
+    pe_y[:, :, 0::2] = torch.sin(y * div_term)
+    pe_y[:, :, 1::2] = torch.cos(y * div_term)
+
+    pe_z[:, :, 0::2] = torch.sin(z * div_term)
+    pe_z[:, :, 1::2] = torch.cos(z * div_term)
+
+    pe = torch.cat([pe_x, pe_y, pe_z], dim=2)  # B, N, C*3
+    if cat_coords:
+        pe = torch.cat([pe, xyz], dim=2)  # B, N, C*3+3
+    return pe
+
+
+def get_4d_embedding(xyzw, C, cat_coords=True):
+    B, N, D = xyzw.shape
+    assert D == 4
+
+    x = xyzw[:, :, 0:1]
+    y = xyzw[:, :, 1:2]
+    z = xyzw[:, :, 2:3]
+    w = xyzw[:, :, 3:4]
+    div_term = (
+        torch.arange(0, C, 2, device=xyzw.device, dtype=torch.float32) * (1000.0 / C)
+    ).reshape(1, 1, int(C / 2))
+
+    pe_x = torch.zeros(B, N, C, device=xyzw.device, dtype=torch.float32)
+    pe_y = torch.zeros(B, N, C, device=xyzw.device, dtype=torch.float32)
+    pe_z = torch.zeros(B, N, C, device=xyzw.device, dtype=torch.float32)
+    pe_w = torch.zeros(B, N, C, device=xyzw.device, dtype=torch.float32)
+
+    pe_x[:, :, 0::2] = torch.sin(x * div_term)
+    pe_x[:, :, 1::2] = torch.cos(x * div_term)
+
+    pe_y[:, :, 0::2] = torch.sin(y * div_term)
+    pe_y[:, :, 1::2] = torch.cos(y * div_term)
+
+    pe_z[:, :, 0::2] = torch.sin(z * div_term)
+    pe_z[:, :, 1::2] = torch.cos(z * div_term)
+
+    pe_w[:, :, 0::2] = torch.sin(w * div_term)
+    pe_w[:, :, 1::2] = torch.cos(w * div_term)
+
+    pe = torch.cat([pe_x, pe_y, pe_z, pe_w], dim=2)  # B, N, C*3
+    if cat_coords:
+        pe = torch.cat([pe, xyzw], dim=2)  # B, N, C*3+3
+    return pe
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/model_utils.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..e875f96a4bd3232707303d8c7fd8cff9368477d0
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/core/model_utils.py
@@ -0,0 +1,169 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+
+EPS = 1e-6
+
+
+def smart_cat(tensor1, tensor2, dim):
+    if tensor1 is None:
+        return tensor2
+    return torch.cat([tensor1, tensor2], dim=dim)
+
+
+def normalize_single(d):
+    # d is a whatever shape torch tensor
+    dmin = torch.min(d)
+    dmax = torch.max(d)
+    d = (d - dmin) / (EPS + (dmax - dmin))
+    return d
+
+
+def normalize(d):
+    # d is B x whatever. normalize within each element of the batch
+    out = torch.zeros(d.size())
+    if d.is_cuda:
+        out = out.cuda()
+    B = list(d.size())[0]
+    for b in list(range(B)):
+        out[b] = normalize_single(d[b])
+    return out
+
+
+def meshgrid2d(B, Y, X, stack=False, norm=False, device="cpu"):
+    # returns a meshgrid sized B x Y x X
+
+    grid_y = torch.linspace(0.0, Y - 1, Y, device=torch.device(device))
+    grid_y = torch.reshape(grid_y, [1, Y, 1])
+    grid_y = grid_y.repeat(B, 1, X)
+
+    grid_x = torch.linspace(0.0, X - 1, X, device=torch.device(device))
+    grid_x = torch.reshape(grid_x, [1, 1, X])
+    grid_x = grid_x.repeat(B, Y, 1)
+
+    if stack:
+        # note we stack in xy order
+        # (see https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.grid_sample)
+        grid = torch.stack([grid_x, grid_y], dim=-1)
+        return grid
+    else:
+        return grid_y, grid_x
+
+
+def reduce_masked_mean(x, mask, dim=None, keepdim=False):
+    # x and mask are the same shape, or at least broadcastably so < actually it's safer if you disallow broadcasting
+    # returns shape-1
+    # axis can be a list of axes
+    for (a, b) in zip(x.size(), mask.size()):
+        assert a == b  # some shape mismatch!
+    prod = x * mask
+    if dim is None:
+        numer = torch.sum(prod)
+        denom = EPS + torch.sum(mask)
+    else:
+        numer = torch.sum(prod, dim=dim, keepdim=keepdim)
+        denom = EPS + torch.sum(mask, dim=dim, keepdim=keepdim)
+
+    mean = numer / denom
+    return mean
+
+
+def bilinear_sample2d(im, x, y, return_inbounds=False):
+    # x and y are each B, N
+    # output is B, C, N
+    if len(im.shape) == 5:
+        B, N, C, H, W = list(im.shape)
+    else:
+        B, C, H, W = list(im.shape)
+    N = list(x.shape)[1]
+
+    x = x.float()
+    y = y.float()
+    H_f = torch.tensor(H, dtype=torch.float32)
+    W_f = torch.tensor(W, dtype=torch.float32)
+
+    # inbound_mask = (x>-0.5).float()*(y>-0.5).float()*(x<W_f+0.5).float()*(y<H_f+0.5).float()
+
+    max_y = (H_f - 1).int()
+    max_x = (W_f - 1).int()
+
+    x0 = torch.floor(x).int()
+    x1 = x0 + 1
+    y0 = torch.floor(y).int()
+    y1 = y0 + 1
+
+    x0_clip = torch.clamp(x0, 0, max_x)
+    x1_clip = torch.clamp(x1, 0, max_x)
+    y0_clip = torch.clamp(y0, 0, max_y)
+    y1_clip = torch.clamp(y1, 0, max_y)
+    dim2 = W
+    dim1 = W * H
+
+    base = torch.arange(0, B, dtype=torch.int64, device=x.device) * dim1
+    base = torch.reshape(base, [B, 1]).repeat([1, N])
+
+    base_y0 = base + y0_clip * dim2
+    base_y1 = base + y1_clip * dim2
+
+    idx_y0_x0 = base_y0 + x0_clip
+    idx_y0_x1 = base_y0 + x1_clip
+    idx_y1_x0 = base_y1 + x0_clip
+    idx_y1_x1 = base_y1 + x1_clip
+
+    # use the indices to lookup pixels in the flat image
+    # im is B x C x H x W
+    # move C out to last dim
+    if len(im.shape) == 5:
+        im_flat = (im.permute(0, 3, 4, 1, 2)).reshape(B * H * W, N, C)
+        i_y0_x0 = torch.diagonal(im_flat[idx_y0_x0.long()], dim1=1, dim2=2).permute(
+            0, 2, 1
+        )
+        i_y0_x1 = torch.diagonal(im_flat[idx_y0_x1.long()], dim1=1, dim2=2).permute(
+            0, 2, 1
+        )
+        i_y1_x0 = torch.diagonal(im_flat[idx_y1_x0.long()], dim1=1, dim2=2).permute(
+            0, 2, 1
+        )
+        i_y1_x1 = torch.diagonal(im_flat[idx_y1_x1.long()], dim1=1, dim2=2).permute(
+            0, 2, 1
+        )
+    else:
+        im_flat = (im.permute(0, 2, 3, 1)).reshape(B * H * W, C)
+        i_y0_x0 = im_flat[idx_y0_x0.long()]
+        i_y0_x1 = im_flat[idx_y0_x1.long()]
+        i_y1_x0 = im_flat[idx_y1_x0.long()]
+        i_y1_x1 = im_flat[idx_y1_x1.long()]
+
+    # Finally calculate interpolated values.
+    x0_f = x0.float()
+    x1_f = x1.float()
+    y0_f = y0.float()
+    y1_f = y1.float()
+
+    w_y0_x0 = ((x1_f - x) * (y1_f - y)).unsqueeze(2)
+    w_y0_x1 = ((x - x0_f) * (y1_f - y)).unsqueeze(2)
+    w_y1_x0 = ((x1_f - x) * (y - y0_f)).unsqueeze(2)
+    w_y1_x1 = ((x - x0_f) * (y - y0_f)).unsqueeze(2)
+
+    output = (
+        w_y0_x0 * i_y0_x0 + w_y0_x1 * i_y0_x1 + w_y1_x0 * i_y1_x0 + w_y1_x1 * i_y1_x1
+    )
+    # output is B*N x C
+    output = output.view(B, -1, C)
+    output = output.permute(0, 2, 1)
+    # output is B x C x N
+
+    if return_inbounds:
+        x_valid = (x > -0.5).byte() & (x < float(W_f - 0.5)).byte()
+        y_valid = (y > -0.5).byte() & (y < float(H_f - 0.5)).byte()
+        inbounds = (x_valid & y_valid).float()
+        inbounds = inbounds.reshape(
+            B, N
+        )  # something seems wrong here for B>1; i'm getting an error here (or downstream if i put -1)
+        return output, inbounds
+
+    return output  # B, C, N
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/models/evaluation_predictor.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/evaluation_predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..1bcc7fa4b9cb6553745e03dff0b1030db630f469
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/models/evaluation_predictor.py
@@ -0,0 +1,106 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+from typing import Tuple
+
+from dot.models.shelf.cotracker_utils.models.core.cotracker.cotracker import CoTracker, get_points_on_a_grid
+
+
+class EvaluationPredictor(torch.nn.Module):
+    def __init__(
+        self,
+        cotracker_model: CoTracker,
+        interp_shape: Tuple[int, int] = (384, 512),
+        grid_size: int = 6,
+        local_grid_size: int = 6,
+        single_point: bool = True,
+        n_iters: int = 6,
+    ) -> None:
+        super(EvaluationPredictor, self).__init__()
+        self.grid_size = grid_size
+        self.local_grid_size = local_grid_size
+        self.single_point = single_point
+        self.interp_shape = interp_shape
+        self.n_iters = n_iters
+
+        self.model = cotracker_model
+        self.model.eval()
+
+    def forward(self, video, queries):
+        queries = queries.clone()
+        B, T, C, H, W = video.shape
+        B, N, D = queries.shape
+
+        assert D == 3
+        assert B == 1
+
+        rgbs = video.reshape(B * T, C, H, W)
+        rgbs = F.interpolate(rgbs, tuple(self.interp_shape), mode="bilinear")
+        rgbs = rgbs.reshape(B, T, 3, self.interp_shape[0], self.interp_shape[1])
+
+        device = rgbs.device
+
+        queries[:, :, 1] *= self.interp_shape[1] / W
+        queries[:, :, 2] *= self.interp_shape[0] / H
+
+        if self.single_point:
+            traj_e = torch.zeros((B, T, N, 2), device=device)
+            vis_e = torch.zeros((B, T, N), device=device)
+            for pind in range((N)):
+                query = queries[:, pind : pind + 1]
+
+                t = query[0, 0, 0].long()
+
+                traj_e_pind, vis_e_pind = self._process_one_point(rgbs, query)
+                traj_e[:, t:, pind : pind + 1] = traj_e_pind[:, :, :1]
+                vis_e[:, t:, pind : pind + 1] = vis_e_pind[:, :, :1]
+        else:
+            if self.grid_size > 0:
+                xy = get_points_on_a_grid(self.grid_size, rgbs.shape[3:], device=device)
+                xy = torch.cat([torch.zeros_like(xy[:, :, :1]), xy], dim=2).to(
+                    device
+                )  #
+                queries = torch.cat([queries, xy], dim=1)  #
+
+            traj_e, __, vis_e, __ = self.model(
+                rgbs=rgbs,
+                queries=queries,
+                iters=self.n_iters,
+            )
+
+        traj_e[:, :, :, 0] *= W / float(self.interp_shape[1])
+        traj_e[:, :, :, 1] *= H / float(self.interp_shape[0])
+        return traj_e, vis_e
+
+    def _process_one_point(self, rgbs, query):
+        t = query[0, 0, 0].long()
+
+        device = rgbs.device
+        if self.local_grid_size > 0:
+            xy_target = get_points_on_a_grid(
+                self.local_grid_size,
+                (50, 50),
+                [query[0, 0, 2], query[0, 0, 1]],
+            )
+
+            xy_target = torch.cat(
+                [torch.zeros_like(xy_target[:, :, :1]), xy_target], dim=2
+            )  #
+            query = torch.cat([query, xy_target], dim=1).to(device)  #
+
+        if self.grid_size > 0:
+            xy = get_points_on_a_grid(self.grid_size, rgbs.shape[3:], device=device)
+            xy = torch.cat([torch.zeros_like(xy[:, :, :1]), xy], dim=2).to(device)  #
+            query = torch.cat([query, xy], dim=1).to(device)  #
+        # crop the video to start from the queried frame
+        query[0, 0, 0] = 0
+        traj_e_pind, __, vis_e_pind, __ = self.model(
+            rgbs=rgbs[:, t:], queries=query, iters=self.n_iters
+        )
+
+        return traj_e_pind, vis_e_pind
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/predictor.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..574a58d70b261dbdbf2d884da42aabed52f5b91a
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/predictor.py
@@ -0,0 +1,203 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn.functional as F
+
+from tqdm import tqdm
+from .models.core.cotracker.cotracker import get_points_on_a_grid
+from .models.core.model_utils import smart_cat
+from .models.build_cotracker import build_cotracker
+
+
+class CoTrackerPredictor(torch.nn.Module):
+    def __init__(self, patch_size, wind_size):
+        super().__init__()
+        self.interp_shape = (384, 512)
+        self.support_grid_size = 6
+        model = build_cotracker(patch_size, wind_size)
+
+        self.model = model
+        self.model.eval()
+        self.cached_feat = None
+
+    @torch.no_grad()
+    def forward(
+        self,
+        video,  # (1, T, 3, H, W)
+        # input prompt types:
+        # - None. Dense tracks are computed in this case. You can adjust *query_frame* to compute tracks starting from a specific frame.
+        # *backward_tracking=True* will compute tracks in both directions.
+        # - queries. Queried points of shape (1, N, 3) in format (t, x, y) for frame index and pixel coordinates.
+        # - grid_size. Grid of N*N points from the first frame. if segm_mask is provided, then computed only for the mask.
+        # You can adjust *query_frame* and *backward_tracking* for the regular grid in the same way as for dense tracks.
+        queries: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,  # Segmentation mask of shape (B, 1, H, W)
+        grid_size: int = 0,
+        grid_query_frame: int = 0,  # only for dense and regular grid tracks
+        backward_tracking: bool = False,
+        cache_features: bool = False,
+    ):
+
+        if queries is None and grid_size == 0:
+            tracks, visibilities = self._compute_dense_tracks(
+                video,
+                grid_query_frame=grid_query_frame,
+                backward_tracking=backward_tracking,
+            )
+        else:
+            tracks, visibilities = self._compute_sparse_tracks(
+                video,
+                queries,
+                segm_mask,
+                grid_size,
+                add_support_grid=(grid_size == 0 or segm_mask is not None),
+                grid_query_frame=grid_query_frame,
+                backward_tracking=backward_tracking,
+                cache_features=cache_features,
+            )
+
+        return tracks, visibilities
+
+    def _compute_dense_tracks(
+        self, video, grid_query_frame, grid_size=30, backward_tracking=False
+    ):
+        *_, H, W = video.shape
+        grid_step = W // grid_size
+        grid_width = W // grid_step
+        grid_height = H // grid_step
+        tracks = visibilities = None
+        grid_pts = torch.zeros((1, grid_width * grid_height, 3)).to(video.device)
+        grid_pts[0, :, 0] = grid_query_frame
+        for offset in tqdm(range(grid_step * grid_step)):
+            ox = offset % grid_step
+            oy = offset // grid_step
+            grid_pts[0, :, 1] = (
+                torch.arange(grid_width).repeat(grid_height) * grid_step + ox
+            )
+            grid_pts[0, :, 2] = (
+                torch.arange(grid_height).repeat_interleave(grid_width) * grid_step + oy
+            )
+            tracks_step, visibilities_step = self._compute_sparse_tracks(
+                video=video,
+                queries=grid_pts,
+                backward_tracking=backward_tracking,
+            )
+            tracks = smart_cat(tracks, tracks_step, dim=2)
+            visibilities = smart_cat(visibilities, visibilities_step, dim=2)
+
+        return tracks, visibilities
+
+    def _compute_sparse_tracks(
+        self,
+        video,
+        queries,
+        segm_mask=None,
+        grid_size=0,
+        add_support_grid=False,
+        grid_query_frame=0,
+        backward_tracking=False,
+        cache_features=False,
+    ):
+        B, T, C, H, W = video.shape
+        assert B == 1
+
+        video = video.reshape(B * T, C, H, W)
+        video = F.interpolate(video, tuple(self.interp_shape), mode="bilinear")
+        video = video.reshape(B, T, 3, self.interp_shape[0], self.interp_shape[1])
+
+        if cache_features:
+            h, w = self.interp_shape[0], self.interp_shape[1]
+            video_ = video.reshape(B * T, C, h, w)
+            video_ = 2 * video_ - 1.0
+            fmaps_ = self.model.fnet(video_)
+            fmaps_ = fmaps_.reshape(B, T, self.model.latent_dim, h // self.model.stride, w // self.model.stride)
+            self.cached_feat = fmaps_
+
+        if queries is not None:
+            queries = queries.clone()
+            B, N, D = queries.shape
+            assert D == 3
+            queries[:, :, 1] *= self.interp_shape[1] / W
+            queries[:, :, 2] *= self.interp_shape[0] / H
+        elif grid_size > 0:
+            grid_pts = get_points_on_a_grid(grid_size, self.interp_shape, device=video.device)
+            if segm_mask is not None:
+                segm_mask = F.interpolate(
+                    segm_mask, tuple(self.interp_shape), mode="nearest"
+                )
+                point_mask = segm_mask[0, 0][
+                    (grid_pts[0, :, 1]).round().long().cpu(),
+                    (grid_pts[0, :, 0]).round().long().cpu(),
+                ].bool()
+                grid_pts = grid_pts[:, point_mask]
+
+            queries = torch.cat(
+                [torch.ones_like(grid_pts[:, :, :1]) * grid_query_frame, grid_pts],
+                dim=2,
+            )
+
+        if add_support_grid:
+            grid_pts = get_points_on_a_grid(self.support_grid_size, self.interp_shape, device=video.device)
+            grid_pts = torch.cat(
+                [torch.zeros_like(grid_pts[:, :, :1]), grid_pts], dim=2
+            )
+            queries = torch.cat([queries, grid_pts], dim=1)
+
+        tracks, __, visibilities, __ = self.model(rgbs=video, queries=queries, iters=6, cached_feat=self.cached_feat)
+
+        if backward_tracking:
+            tracks, visibilities = self._compute_backward_tracks(
+                video, queries, tracks, visibilities
+            )
+            if add_support_grid:
+                queries[:, -self.support_grid_size ** 2 :, 0] = T - 1
+        if add_support_grid:
+            tracks = tracks[:, :, : -self.support_grid_size ** 2]
+            visibilities = visibilities[:, :, : -self.support_grid_size ** 2]
+        thr = 0.9
+        visibilities = visibilities > thr
+
+        # correct query-point predictions
+        # see https://github.com/facebookresearch/co-tracker/issues/28
+
+        # TODO: batchify
+        for i in range(len(queries)):
+            queries_t = queries[i, :tracks.size(2), 0].to(torch.int64)
+            arange = torch.arange(0, len(queries_t))
+
+            # overwrite the predictions with the query points
+            tracks[i, queries_t, arange] = queries[i, :tracks.size(2), 1:]
+
+            # correct visibilities, the query points should be visible
+            visibilities[i, queries_t, arange] = True
+
+        tracks[:, :, :, 0] *= W / float(self.interp_shape[1])
+        tracks[:, :, :, 1] *= H / float(self.interp_shape[0])
+        return tracks, visibilities
+
+    def _compute_backward_tracks(self, video, queries, tracks, visibilities):
+        inv_video = video.flip(1).clone()
+        inv_queries = queries.clone()
+        inv_queries[:, :, 0] = inv_video.shape[1] - inv_queries[:, :, 0] - 1
+
+        if self.cached_feat is not None:
+            inv_feat = self.cached_feat.flip(1)
+        else:
+            inv_feat = None
+
+        inv_tracks, __, inv_visibilities, __ = self.model(
+            rgbs=inv_video, queries=inv_queries, iters=6, cached_feat=inv_feat
+        )
+
+        inv_tracks = inv_tracks.flip(1)
+        inv_visibilities = inv_visibilities.flip(1)
+
+        mask = tracks == 0
+
+        tracks[mask] = inv_tracks[mask]
+        visibilities[mask[:, :, :, 0]] = inv_visibilities[mask[:, :, :, 0]]
+        return tracks, visibilities
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/__init__.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4547e070da2f3ddc5bf2f466cb2242e6135c7dc3
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/visualizer.py b/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/visualizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..17d565911fcd79f475944c0d3d34da7ee35edb11
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/cotracker_utils/utils/visualizer.py
@@ -0,0 +1,314 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import os
+import numpy as np
+import cv2
+import torch
+import flow_vis
+
+from matplotlib import cm
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+from moviepy.editor import ImageSequenceClip
+import matplotlib.pyplot as plt
+
+
+def read_video_from_path(path):
+    cap = cv2.VideoCapture(path)
+    if not cap.isOpened():
+        print("Error opening video file")
+    else:
+        frames = []
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if ret == True:
+                frames.append(np.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
+            else:
+                break
+        cap.release()
+    return np.stack(frames)
+
+
+class Visualizer:
+    def __init__(
+        self,
+        save_dir: str = "./results",
+        grayscale: bool = False,
+        pad_value: int = 0,
+        fps: int = 10,
+        mode: str = "rainbow",  # 'cool', 'optical_flow'
+        linewidth: int = 2,
+        show_first_frame: int = 10,
+        tracks_leave_trace: int = 0,  # -1 for infinite
+    ):
+        self.mode = mode
+        self.save_dir = save_dir
+        if mode == "rainbow":
+            self.color_map = cm.get_cmap("gist_rainbow")
+        elif mode == "cool":
+            self.color_map = cm.get_cmap(mode)
+        self.show_first_frame = show_first_frame
+        self.grayscale = grayscale
+        self.tracks_leave_trace = tracks_leave_trace
+        self.pad_value = pad_value
+        self.linewidth = linewidth
+        self.fps = fps
+
+    def visualize(
+        self,
+        video: torch.Tensor,  # (B,T,C,H,W)
+        tracks: torch.Tensor,  # (B,T,N,2)
+        visibility: torch.Tensor = None,  # (B, T, N, 1) bool
+        gt_tracks: torch.Tensor = None,  # (B,T,N,2)
+        segm_mask: torch.Tensor = None,  # (B,1,H,W)
+        filename: str = "video",
+        writer=None,  # tensorboard Summary Writer, used for visualization during training
+        step: int = 0,
+        query_frame: int = 0,
+        save_video: bool = True,
+        compensate_for_camera_motion: bool = False,
+    ):
+        if compensate_for_camera_motion:
+            assert segm_mask is not None
+        if segm_mask is not None:
+            coords = tracks[0, query_frame].round().long()
+            segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
+
+        video = F.pad(
+            video,
+            (self.pad_value, self.pad_value, self.pad_value, self.pad_value),
+            "constant",
+            255,
+        )
+        tracks = tracks + self.pad_value
+
+        if self.grayscale:
+            transform = transforms.Grayscale()
+            video = transform(video)
+            video = video.repeat(1, 1, 3, 1, 1)
+
+        res_video = self.draw_tracks_on_video(
+            video=video,
+            tracks=tracks,
+            visibility=visibility,
+            segm_mask=segm_mask,
+            gt_tracks=gt_tracks,
+            query_frame=query_frame,
+            compensate_for_camera_motion=compensate_for_camera_motion,
+        )
+        if save_video:
+            self.save_video(res_video, filename=filename, writer=writer, step=step)
+        return res_video
+
+    def save_video(self, video, filename, writer=None, step=0):
+        if writer is not None:
+            writer.add_video(
+                f"{filename}_pred_track",
+                video.to(torch.uint8),
+                global_step=step,
+                fps=self.fps,
+            )
+        else:
+            os.makedirs(self.save_dir, exist_ok=True)
+            wide_list = list(video.unbind(1))
+            wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
+            clip = ImageSequenceClip(wide_list[2:-1], fps=self.fps)
+
+            # Write the video file
+            save_path = os.path.join(self.save_dir, f"{filename}_pred_track.mp4")
+            clip.write_videofile(save_path, codec="libx264", fps=self.fps, logger=None)
+
+            print(f"Video saved to {save_path}")
+
+    def draw_tracks_on_video(
+        self,
+        video: torch.Tensor,
+        tracks: torch.Tensor,
+        visibility: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,
+        gt_tracks=None,
+        query_frame: int = 0,
+        compensate_for_camera_motion=False,
+    ):
+        B, T, C, H, W = video.shape
+        _, _, N, D = tracks.shape
+
+        assert D == 2
+        assert C == 3
+        video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy()  # S, H, W, C
+        tracks = tracks[0].long().detach().cpu().numpy()  # S, N, 2
+        if gt_tracks is not None:
+            gt_tracks = gt_tracks[0].detach().cpu().numpy()
+
+        res_video = []
+
+        # process input video
+        for rgb in video:
+            res_video.append(rgb.copy())
+
+        vector_colors = np.zeros((T, N, 3))
+        if self.mode == "optical_flow":
+            vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
+        elif segm_mask is None:
+            if self.mode == "rainbow":
+                y_min, y_max = (
+                    tracks[query_frame, :, 1].min(),
+                    tracks[query_frame, :, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    color = self.color_map(norm(tracks[query_frame, n, 1]))
+                    color = np.array(color[:3])[None] * 255
+                    vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with time
+                for t in range(T):
+                    color = np.array(self.color_map(t / T)[:3])[None] * 255
+                    vector_colors[t] = np.repeat(color, N, axis=0)
+        else:
+            if self.mode == "rainbow":
+                vector_colors[:, segm_mask <= 0, :] = 255
+
+                y_min, y_max = (
+                    tracks[0, segm_mask > 0, 1].min(),
+                    tracks[0, segm_mask > 0, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    if segm_mask[n] > 0:
+                        color = self.color_map(norm(tracks[0, n, 1]))
+                        color = np.array(color[:3])[None] * 255
+                        vector_colors[:, n] = np.repeat(color, T, axis=0)
+
+            else:
+                # color changes with segm class
+                segm_mask = segm_mask.cpu()
+                color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
+                color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
+                color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
+                vector_colors = np.repeat(color[None], T, axis=0)
+
+        #  draw tracks
+        if self.tracks_leave_trace != 0:
+            for t in range(1, T):
+                first_ind = (
+                    max(0, t - self.tracks_leave_trace)
+                    if self.tracks_leave_trace >= 0
+                    else 0
+                )
+                curr_tracks = tracks[first_ind : t + 1]
+                curr_colors = vector_colors[first_ind : t + 1]
+                if compensate_for_camera_motion:
+                    diff = (
+                        tracks[first_ind : t + 1, segm_mask <= 0]
+                        - tracks[t : t + 1, segm_mask <= 0]
+                    ).mean(1)[:, None]
+
+                    curr_tracks = curr_tracks - diff
+                    curr_tracks = curr_tracks[:, segm_mask > 0]
+                    curr_colors = curr_colors[:, segm_mask > 0]
+
+                res_video[t] = self._draw_pred_tracks(
+                    res_video[t],
+                    curr_tracks,
+                    curr_colors,
+                )
+                if gt_tracks is not None:
+                    res_video[t] = self._draw_gt_tracks(
+                        res_video[t], gt_tracks[first_ind : t + 1]
+                    )
+
+        #  draw points
+        for t in range(T):
+            for i in range(N):
+                coord = (tracks[t, i, 0], tracks[t, i, 1])
+                visibile = True
+                if visibility is not None:
+                    visibile = visibility[0, t, i]
+                if coord[0] != 0 and coord[1] != 0:
+                    if not compensate_for_camera_motion or (
+                        compensate_for_camera_motion and segm_mask[i] > 0
+                    ):
+
+                        cv2.circle(
+                            res_video[t],
+                            coord,
+                            int(self.linewidth * 2),
+                            vector_colors[t, i].tolist(),
+                            thickness=-1 if visibile else 2
+                            -1,
+                        )
+
+        #  construct the final rgb sequence
+        if self.show_first_frame > 0:
+            res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
+        return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
+
+    def _draw_pred_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3
+        tracks: np.ndarray,  # T x 2
+        vector_colors: np.ndarray,
+        alpha: float = 0.5,
+    ):
+        T, N, _ = tracks.shape
+
+        for s in range(T - 1):
+            vector_color = vector_colors[s]
+            original = rgb.copy()
+            alpha = (s / T) ** 2
+            for i in range(N):
+                coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
+                coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
+                if coord_y[0] != 0 and coord_y[1] != 0:
+                    cv2.line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        vector_color[i].tolist(),
+                        self.linewidth,
+                        cv2.LINE_AA,
+                    )
+            if self.tracks_leave_trace > 0:
+                rgb = cv2.addWeighted(rgb, alpha, original, 1 - alpha, 0)
+        return rgb
+
+    def _draw_gt_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3,
+        gt_tracks: np.ndarray,  # T x 2
+    ):
+        T, N, _ = gt_tracks.shape
+        color = np.array((211.0, 0.0, 0.0))
+
+        for t in range(T):
+            for i in range(N):
+                gt_tracks = gt_tracks[t][i]
+                #  draw a red cross
+                if gt_tracks[0] > 0 and gt_tracks[1] > 0:
+                    length = self.linewidth * 3
+                    coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
+                    cv2.line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                        cv2.LINE_AA,
+                    )
+                    coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
+                    cv2.line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                        cv2.LINE_AA,
+                    )
+        return rgb
diff --git a/data/dot_single_video/dot/models/shelf/raft.py b/data/dot_single_video/dot/models/shelf/raft.py
new file mode 100644
index 0000000000000000000000000000000000000000..afb82d33f4b2fd93bc85325ddaef254e3eb6188b
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft.py
@@ -0,0 +1,139 @@
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+from .raft_utils.update import BasicUpdateBlock
+from .raft_utils.extractor import BasicEncoder
+from .raft_utils.corr import CorrBlock
+from .raft_utils.utils import coords_grid
+
+
+class RAFT(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.fnet = BasicEncoder(output_dim=256, norm_fn=args.norm_fnet, dropout=0, patch_size=args.patch_size)
+        self.cnet = BasicEncoder(output_dim=256, norm_fn=args.norm_cnet, dropout=0, patch_size=args.patch_size)
+        self.update_block = BasicUpdateBlock(hidden_dim=128, patch_size=args.patch_size, refine_alpha=args.refine_alpha)
+        self.refine_alpha = args.refine_alpha
+        self.patch_size = args.patch_size
+        self.num_iter = args.num_iter
+
+    def encode(self, frame):
+        frame = frame * 2 - 1
+        fmap = self.fnet(frame)
+        cmap = self.cnet(frame)
+        feats = torch.cat([fmap, cmap], dim=1)
+        return feats.float()
+
+    def initialize_feats(self, feats, frame):
+        if feats is None:
+            feats = self.encode(frame)
+        fmap, cmap = feats.split([256, 256], dim=1)
+        return fmap, cmap
+
+    def initialize_flow(self, fmap, coarse_flow):
+        """ Flow is represented as difference between two coordinate grids flow = coords1 - coords0"""
+        N, _, h, w = fmap.shape
+        src_pts = coords_grid(N, h, w, device=fmap.device)
+
+        if coarse_flow is not None:
+            coarse_flow = coarse_flow.permute(0, 3, 1, 2)
+            # coarse_flow = torch.stack([coarse_flow[:, 0] * (w - 1), coarse_flow[:, 1] * (h - 1)], dim=1)
+            tgt_pts = src_pts + coarse_flow
+        else:
+            tgt_pts = src_pts
+
+        return src_pts, tgt_pts
+
+    def initialize_alpha(self, fmap, coarse_alpha):
+        N, _, h, w = fmap.shape
+        if coarse_alpha is None:
+            alpha = torch.ones(N, 1, h, w, device=fmap.device)
+        else:
+            alpha = coarse_alpha[:, None]
+        return alpha.logit(eps=1e-5)
+
+    def postprocess_alpha(self, alpha):
+        alpha = alpha[:, 0]
+        return alpha.sigmoid()
+
+    def postprocess_flow(self, flow):
+        # N, C, H, W = flow.shape
+        # flow = torch.stack([flow[:, 0] / (W - 1), flow[:, 1] / (H - 1)], dim=1)
+        flow = flow.permute(0, 2, 3, 1)
+        return flow
+
+    def upsample_flow(self, flow, mask):
+        """ Upsample flow field [H/P, W/P, 2] -> [H, W, 2] using convex combination """
+        N, _, H, W = flow.shape
+        mask = mask.view(N, 1, 9, self.patch_size, self.patch_size, H, W)
+        mask = torch.softmax(mask, dim=2)
+
+        up_flow = F.unfold(self.patch_size * flow, [3, 3], padding=1)
+        up_flow = up_flow.view(N, 2, 9, 1, 1, H, W)
+
+        up_flow = torch.sum(mask * up_flow, dim=2)
+        up_flow = up_flow.permute(0, 1, 4, 2, 5, 3)
+        return up_flow.reshape(N, 2, self.patch_size * H, self.patch_size * W)
+
+    def upsample_alpha(self, alpha, mask):
+        """ Upsample alpha field [H/P, W/P, 1] -> [H, W, 1] using convex combination """
+        N, _, H, W = alpha.shape
+        mask = mask.view(N, 1, 9, self.patch_size, self.patch_size, H, W)
+        mask = torch.softmax(mask, dim=2)
+
+        up_alpha = F.unfold(alpha, [3, 3], padding=1)
+        up_alpha = up_alpha.view(N, 1, 9, 1, 1, H, W)
+
+        up_alpha = torch.sum(mask * up_alpha, dim=2)
+        up_alpha = up_alpha.permute(0, 1, 4, 2, 5, 3)
+        return up_alpha.reshape(N, 1, self.patch_size * H, self.patch_size * W)
+
+    def forward(self, src_frame=None, tgt_frame=None, src_feats=None, tgt_feats=None, coarse_flow=None, coarse_alpha=None,
+                is_train=False):
+        src_fmap, src_cmap = self.initialize_feats(src_feats, src_frame)
+        tgt_fmap, _ = self.initialize_feats(tgt_feats, tgt_frame)
+
+        corr_fn = CorrBlock(src_fmap, tgt_fmap)
+
+        net, inp = torch.split(src_cmap, [128, 128], dim=1)
+        net = torch.tanh(net)
+        inp = torch.relu(inp)
+
+        src_pts, tgt_pts = self.initialize_flow(src_fmap, coarse_flow)
+        alpha = self.initialize_alpha(src_fmap, coarse_alpha) if self.refine_alpha else None
+
+        flows_up = []
+        alphas_up = []
+        for itr in range(self.num_iter):
+            tgt_pts = tgt_pts.detach()
+            if self.refine_alpha:
+                alpha = alpha.detach()
+
+            corr = corr_fn(tgt_pts)
+
+            flow = tgt_pts - src_pts
+            net, up_mask, delta_flow, up_mask_alpha, delta_alpha = self.update_block(net, inp, corr, flow, alpha)
+
+            # F(t+1) = F(t) + \Delta(t)
+            tgt_pts = tgt_pts + delta_flow
+            if self.refine_alpha:
+                alpha = alpha + delta_alpha
+
+            # upsample predictions
+            flow_up = self.upsample_flow(tgt_pts - src_pts, up_mask)
+            if self.refine_alpha:
+                alpha_up = self.upsample_alpha(alpha, up_mask_alpha)
+
+            if is_train or (itr == self.num_iter - 1):
+                flows_up.append(self.postprocess_flow(flow_up))
+                if self.refine_alpha:
+                    alphas_up.append(self.postprocess_alpha(alpha_up))
+
+        flows_up = torch.stack(flows_up, dim=1)
+        alphas_up = torch.stack(alphas_up, dim=1) if self.refine_alpha else None
+        if not is_train:
+            flows_up = flows_up[:, 0]
+            alphas_up = alphas_up[:, 0] if self.refine_alpha else None
+        return flows_up, alphas_up
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/LICENSE b/data/dot_single_video/dot/models/shelf/raft_utils/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..bbbd9fc645b0d5235f0d937515868a9f020c72bf
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft_utils/LICENSE
@@ -0,0 +1,29 @@
+BSD 3-Clause License
+
+Copyright (c) 2020, princeton-vl
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/__init__.py b/data/dot_single_video/dot/models/shelf/raft_utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/corr.py b/data/dot_single_video/dot/models/shelf/raft_utils/corr.py
new file mode 100644
index 0000000000000000000000000000000000000000..504ceef7ee4271cbee1792bcf622ec9b40b6490d
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft_utils/corr.py
@@ -0,0 +1,55 @@
+import torch
+import torch.nn.functional as F
+
+from .utils import bilinear_sampler
+
+
+class CorrBlock:
+    def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
+        self.num_levels = num_levels
+        self.radius = radius
+        self.corr_pyramid = []
+
+        # all pairs correlation
+        corr = CorrBlock.corr(fmap1, fmap2)
+
+        batch, h1, w1, dim, h2, w2 = corr.shape
+        corr = corr.reshape(batch * h1 * w1, dim, h2, w2)
+
+        self.corr_pyramid.append(corr)
+        for i in range(self.num_levels - 1):
+            corr = F.avg_pool2d(corr, 2, stride=2)
+            self.corr_pyramid.append(corr)
+
+    def __call__(self, coords):
+        r = self.radius
+        coords = coords.permute(0, 2, 3, 1)
+        batch, h1, w1, _ = coords.shape
+
+        out_pyramid = []
+        for i in range(self.num_levels):
+            corr = self.corr_pyramid[i]
+            dx = torch.linspace(-r, r, 2 * r + 1, device=coords.device)
+            dy = torch.linspace(-r, r, 2 * r + 1, device=coords.device)
+            delta = torch.stack(torch.meshgrid(dy, dx, indexing="ij"), axis=-1)
+
+            centroid_lvl = coords.reshape(batch * h1 * w1, 1, 1, 2) / 2 ** i
+            delta_lvl = delta.view(1, 2 * r + 1, 2 * r + 1, 2)
+            coords_lvl = centroid_lvl + delta_lvl
+
+            corr = bilinear_sampler(corr, coords_lvl)
+            corr = corr.view(batch, h1, w1, -1)
+            out_pyramid.append(corr)
+
+        out = torch.cat(out_pyramid, dim=-1)
+        return out.permute(0, 3, 1, 2).contiguous().float()
+
+    @staticmethod
+    def corr(fmap1, fmap2):
+        batch, dim, ht, wd = fmap1.shape
+        fmap1 = fmap1.view(batch, dim, ht * wd)
+        fmap2 = fmap2.view(batch, dim, ht * wd)
+
+        corr = torch.matmul(fmap1.transpose(1, 2), fmap2)
+        corr = corr.view(batch, ht, wd, 1, ht, wd)
+        return corr / torch.sqrt(torch.tensor(dim).float())
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/extractor.py b/data/dot_single_video/dot/models/shelf/raft_utils/extractor.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b88cbee6381bfbc570ac14ae0fb5de5476b7302
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft_utils/extractor.py
@@ -0,0 +1,194 @@
+import torch
+import torch.nn as nn
+
+
+class ResidualBlock(nn.Module):
+    def __init__(self, in_planes, planes, norm_fn='group', stride=1):
+        super().__init__()
+
+        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, padding=1, stride=stride)
+        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, padding=1)
+        self.relu = nn.ReLU(inplace=True)
+
+        num_groups = planes // 8
+
+        if norm_fn == 'group':
+            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            if not stride == 1:
+                norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+
+        elif norm_fn == 'batch':
+            self.norm1 = nn.BatchNorm2d(planes)
+            self.norm2 = nn.BatchNorm2d(planes)
+            if not stride == 1:
+                norm3 = nn.BatchNorm2d(planes)
+
+        elif norm_fn == 'instance':
+            self.norm1 = nn.InstanceNorm2d(planes)
+            self.norm2 = nn.InstanceNorm2d(planes)
+            if not stride == 1:
+                norm3 = nn.InstanceNorm2d(planes)
+
+        elif norm_fn == 'none':
+            self.norm1 = nn.Sequential()
+            self.norm2 = nn.Sequential()
+            if not stride == 1:
+                norm3 = nn.Sequential()
+
+        if stride == 1:
+            self.downsample = None
+
+        else:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), norm3)
+
+    def forward(self, x):
+        y = x
+        y = self.relu(self.norm1(self.conv1(y)))
+        y = self.relu(self.norm2(self.conv2(y)))
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+
+        return self.relu(x + y)
+
+
+class BottleneckBlock(nn.Module):
+    def __init__(self, in_planes, planes, norm_fn='group', stride=1):
+        super().__init__()
+
+        self.conv1 = nn.Conv2d(in_planes, planes // 4, kernel_size=1, padding=0)
+        self.conv2 = nn.Conv2d(planes // 4, planes // 4, kernel_size=3, padding=1, stride=stride)
+        self.conv3 = nn.Conv2d(planes // 4, planes, kernel_size=1, padding=0)
+        self.relu = nn.ReLU(inplace=True)
+
+        num_groups = planes // 8
+
+        if norm_fn == 'group':
+            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
+            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
+            self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+            if not stride == 1:
+                self.norm4 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
+
+        elif norm_fn == 'batch':
+            self.norm1 = nn.BatchNorm2d(planes // 4)
+            self.norm2 = nn.BatchNorm2d(planes // 4)
+            self.norm3 = nn.BatchNorm2d(planes)
+            if not stride == 1:
+                self.norm4 = nn.BatchNorm2d(planes)
+
+        elif norm_fn == 'instance':
+            self.norm1 = nn.InstanceNorm2d(planes // 4)
+            self.norm2 = nn.InstanceNorm2d(planes // 4)
+            self.norm3 = nn.InstanceNorm2d(planes)
+            if not stride == 1:
+                self.norm4 = nn.InstanceNorm2d(planes)
+
+        elif norm_fn == 'none':
+            self.norm1 = nn.Sequential()
+            self.norm2 = nn.Sequential()
+            self.norm3 = nn.Sequential()
+            if not stride == 1:
+                self.norm4 = nn.Sequential()
+
+        if stride == 1:
+            self.downsample = None
+
+        else:
+            self.downsample = nn.Sequential(
+                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm4)
+
+    def forward(self, x):
+        y = x
+        y = self.relu(self.norm1(self.conv1(y)))
+        y = self.relu(self.norm2(self.conv2(y)))
+        y = self.relu(self.norm3(self.conv3(y)))
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+
+        return self.relu(x + y)
+
+
+class BasicEncoder(nn.Module):
+    def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0, patch_size=8):
+        super().__init__()
+        assert patch_size in [4, 8]
+        if patch_size == 4:
+            stride1, stride2, stride3 = 1, 2, 2
+        else:
+            stride1, stride2, stride3 = 2, 2, 2
+
+        self.norm_fn = norm_fn
+
+        if self.norm_fn == 'group':
+            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=64)
+
+        elif self.norm_fn == 'batch':
+            self.norm1 = nn.BatchNorm2d(64)
+
+        elif self.norm_fn == 'instance':
+            self.norm1 = nn.InstanceNorm2d(64)
+
+        elif self.norm_fn == 'none':
+            self.norm1 = nn.Sequential()
+
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=stride1, padding=3)
+        self.relu1 = nn.ReLU(inplace=True)
+
+        self.in_planes = 64
+        self.layer1 = self._make_layer(64, stride=1)
+        self.layer2 = self._make_layer(96, stride=stride2)
+        self.layer3 = self._make_layer(128, stride=stride3)
+
+        # output convolution
+        self.conv2 = nn.Conv2d(128, output_dim, kernel_size=1)
+
+        self.dropout = None
+        if dropout > 0:
+            self.dropout = nn.Dropout2d(p=dropout)
+
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
+            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
+                if m.weight is not None:
+                    nn.init.constant_(m.weight, 1)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+
+    def _make_layer(self, dim, stride=1):
+        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
+        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
+        layers = (layer1, layer2)
+
+        self.in_planes = dim
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+
+        # if input is list, combine batch dimension
+        is_list = isinstance(x, tuple) or isinstance(x, list)
+        if is_list:
+            batch_dim = x[0].shape[0]
+            x = torch.cat(x, dim=0)
+
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu1(x)
+
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+
+        x = self.conv2(x)
+
+        if self.training and self.dropout is not None:
+            x = self.dropout(x)
+
+        if is_list:
+            x = torch.split(x, [batch_dim, batch_dim], dim=0)
+
+        return x
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/update.py b/data/dot_single_video/dot/models/shelf/raft_utils/update.py
new file mode 100644
index 0000000000000000000000000000000000000000..a66358d4e9ed0cbfabdee098c0a451c330344833
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft_utils/update.py
@@ -0,0 +1,116 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class FlowHead(nn.Module):
+    def __init__(self, input_dim=128, hidden_dim=256):
+        super().__init__()
+        self.conv1 = nn.Conv2d(input_dim, hidden_dim, 3, padding=1)
+        self.conv2 = nn.Conv2d(hidden_dim, 2, 3, padding=1)
+        self.relu = nn.ReLU(inplace=True)
+
+    def forward(self, x):
+        return self.conv2(self.relu(self.conv1(x)))
+
+
+class AlphaHead(nn.Module):
+    def __init__(self, input_dim=128, hidden_dim=256):
+        super().__init__()
+        self.conv1 = nn.Conv2d(input_dim, hidden_dim, 3, padding=1)
+        self.conv2 = nn.Conv2d(hidden_dim, 1, 3, padding=1)
+        self.relu = nn.ReLU(inplace=True)
+
+    def forward(self, x):
+        return self.conv2(self.relu(self.conv1(x)))
+
+
+class SepConvGRU(nn.Module):
+    def __init__(self, hidden_dim=128, input_dim=192 + 128):
+        super().__init__()
+        self.convz1 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2))
+        self.convr1 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2))
+        self.convq1 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2))
+
+        self.convz2 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0))
+        self.convr2 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0))
+        self.convq2 = nn.Conv2d(hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0))
+
+    def forward(self, h, x):
+        # horizontal
+        hx = torch.cat([h, x], dim=1)
+        z = torch.sigmoid(self.convz1(hx))
+        r = torch.sigmoid(self.convr1(hx))
+        q = torch.tanh(self.convq1(torch.cat([r * h, x], dim=1)))
+        h = (1 - z) * h + z * q
+
+        # vertical
+        hx = torch.cat([h, x], dim=1)
+        z = torch.sigmoid(self.convz2(hx))
+        r = torch.sigmoid(self.convr2(hx))
+        q = torch.tanh(self.convq2(torch.cat([r * h, x], dim=1)))
+        h = (1 - z) * h + z * q
+
+        return h
+
+
+class BasicMotionEncoder(nn.Module):
+    def __init__(self, refine_alpha, corr_levels=4, corr_radius=4):
+        super().__init__()
+        in_dim = 2 + (3 if refine_alpha else 0)
+        cor_planes = corr_levels * (2 * corr_radius + 1) ** 2
+        self.refine_alpha = refine_alpha
+        self.convc1 = nn.Conv2d(cor_planes, 256, 1, padding=0)
+        self.convc2 = nn.Conv2d(256, 192, 3, padding=1)
+        self.convf1 = nn.Conv2d(in_dim, 128, 7, padding=3)
+        self.convf2 = nn.Conv2d(128, 64, 3, padding=1)
+        self.conv = nn.Conv2d(64 + 192, 128 - in_dim, 3, padding=1)
+
+    def forward(self, flow, alpha, corr):
+        if self.refine_alpha:
+            flow = torch.cat([flow, alpha, torch.zeros_like(flow)], dim=1)
+        cor = F.relu(self.convc1(corr))
+        cor = F.relu(self.convc2(cor))
+        feat = F.relu(self.convf1(flow))
+        feat = F.relu(self.convf2(feat))
+        feat = torch.cat([cor, feat], dim=1)
+        feat = F.relu(self.conv(feat))
+        return torch.cat([feat, flow], dim=1)
+
+
+class BasicUpdateBlock(nn.Module):
+    def __init__(self, hidden_dim=128, patch_size=8, refine_alpha=False):
+        super().__init__()
+        self.refine_alpha = refine_alpha
+        self.encoder = BasicMotionEncoder(refine_alpha)
+        self.gru = SepConvGRU(hidden_dim=hidden_dim, input_dim=128 + hidden_dim)
+
+        self.flow_head = FlowHead(hidden_dim, hidden_dim=256)
+        self.mask = nn.Sequential(
+            nn.Conv2d(128, 256, 3, padding=1),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(256, patch_size * patch_size * 9, 1, padding=0)
+        )
+
+        if refine_alpha:
+            self.alpha_head = AlphaHead(hidden_dim, hidden_dim=256)
+            self.alpha_mask = nn.Sequential(
+                nn.Conv2d(128, 256, 3, padding=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(256, patch_size * patch_size * 9, 1, padding=0)
+            )
+
+    def forward(self, net, inp, corr, flow, alpha):
+        mot = self.encoder(flow, alpha, corr)
+        inp = torch.cat([inp, mot], dim=1)
+        net = self.gru(net, inp)
+
+        delta_flow = self.flow_head(net)
+        mask = .25 * self.mask(net)
+
+        delta_alpha, mask_alpha = None, None
+        if self.refine_alpha:
+            delta_alpha = self.alpha_head(net)
+            mask_alpha = .25 * self.alpha_mask(net)
+
+        return net, mask, delta_flow, mask_alpha, delta_alpha
diff --git a/data/dot_single_video/dot/models/shelf/raft_utils/utils.py b/data/dot_single_video/dot/models/shelf/raft_utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..74bd51a442cce371fc493baa22520e8b8f67a477
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/raft_utils/utils.py
@@ -0,0 +1,80 @@
+import torch
+import torch.nn.functional as F
+import numpy as np
+from scipy import interpolate
+
+
+class InputPadder:
+    """ Pads images such that dimensions are divisible by 8 """
+    def __init__(self, dims, mode='sintel'):
+        self.ht, self.wd = dims[-2:]
+        pad_ht = (((self.ht // 8) + 1) * 8 - self.ht) % 8
+        pad_wd = (((self.wd // 8) + 1) * 8 - self.wd) % 8
+        if mode == 'sintel':
+            self._pad = [pad_wd//2, pad_wd - pad_wd//2, pad_ht//2, pad_ht - pad_ht//2]
+        else:
+            self._pad = [pad_wd//2, pad_wd - pad_wd//2, 0, pad_ht]
+
+    def pad(self, *inputs):
+        return [F.pad(x, self._pad, mode='replicate') for x in inputs]
+
+    def unpad(self,x):
+        ht, wd = x.shape[-2:]
+        c = [self._pad[2], ht-self._pad[3], self._pad[0], wd-self._pad[1]]
+        return x[..., c[0]:c[1], c[2]:c[3]]
+
+def forward_interpolate(flow):
+    flow = flow.detach().cpu().numpy()
+    dx, dy = flow[0], flow[1]
+
+    ht, wd = dx.shape
+    x0, y0 = np.meshgrid(np.arange(wd), np.arange(ht), indexing="ij")
+
+    x1 = x0 + dx
+    y1 = y0 + dy
+    
+    x1 = x1.reshape(-1)
+    y1 = y1.reshape(-1)
+    dx = dx.reshape(-1)
+    dy = dy.reshape(-1)
+
+    valid = (x1 > 0) & (x1 < wd) & (y1 > 0) & (y1 < ht)
+    x1 = x1[valid]
+    y1 = y1[valid]
+    dx = dx[valid]
+    dy = dy[valid]
+
+    flow_x = interpolate.griddata((x1, y1), dx, (x0, y0), method='nearest', fill_value=0)
+
+    flow_y = interpolate.griddata((x1, y1), dy, (x0, y0), method='nearest', fill_value=0)
+
+    flow = np.stack([flow_x, flow_y], axis=0)
+    return torch.from_numpy(flow).float()
+
+
+def bilinear_sampler(img, coords, mode='bilinear', mask=False):
+    """ Wrapper for grid_sample, uses pixel coordinates """
+    H, W = img.shape[-2:]
+    xgrid, ygrid = coords.split([1,1], dim=-1)
+    xgrid = 2*xgrid/(W-1) - 1
+    ygrid = 2*ygrid/(H-1) - 1
+
+    grid = torch.cat([xgrid, ygrid], dim=-1)
+    img = F.grid_sample(img, grid, align_corners=True)
+
+    if mask:
+        mask = (xgrid > -1) & (ygrid > -1) & (xgrid < 1) & (ygrid < 1)
+        return img, mask.float()
+
+    return img
+
+
+def coords_grid(batch, ht, wd, device):
+    coords = torch.meshgrid(torch.arange(ht, device=device), torch.arange(wd, device=device), indexing="ij")
+    coords = torch.stack(coords[::-1], dim=0).float()
+    return coords[None].repeat(batch, 1, 1, 1)
+
+
+def upflow8(flow, mode='bilinear'):
+    new_size = (8 * flow.shape[2], 8 * flow.shape[3])
+    return  8 * F.interpolate(flow, size=new_size, mode=mode, align_corners=True)
diff --git a/data/dot_single_video/dot/models/shelf/tapir.py b/data/dot_single_video/dot/models/shelf/tapir.py
new file mode 100644
index 0000000000000000000000000000000000000000..22c8ed74a26ef8864cb039553c53887a8c629258
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/tapir.py
@@ -0,0 +1,33 @@
+from torch import nn
+import torch.nn.functional as F
+from einops import rearrange
+
+from .tapir_utils.tapir_model import TAPIR
+
+class Tapir(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.model = TAPIR(pyramid_level=args.pyramid_level,
+                           softmax_temperature=args.softmax_temperature,
+                           extra_convs=args.extra_convs)
+
+    def forward(self, video, queries, backward_tracking, cache_features=False):
+        # Preprocess video
+        video = video * 2 - 1  # conversion from [0, 1] to [-1, 1]
+        video = rearrange(video, "b t c h w -> b t h w c")
+
+        # Preprocess queries
+        queries = queries[..., [0, 2, 1]]
+
+        # Inference
+        outputs = self.model(video, queries, cache_features=cache_features)
+        tracks, occlusions, expected_dist = outputs['tracks'], outputs['occlusion'], outputs['expected_dist']
+
+        # Postprocess tracks
+        tracks = rearrange(tracks, "b s t c -> b t s c")
+
+        # Postprocess visibility
+        visibles = (1 - F.sigmoid(occlusions)) * (1 - F.sigmoid(expected_dist)) > 0.5
+        visibles = rearrange(visibles, "b s t -> b t s")
+
+        return tracks, visibles
\ No newline at end of file
diff --git a/data/dot_single_video/dot/models/shelf/tapir_utils/LICENSE b/data/dot_single_video/dot/models/shelf/tapir_utils/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..75b52484ea471f882c29e02693b4f02dba175b5e
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/tapir_utils/LICENSE
@@ -0,0 +1,202 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/data/dot_single_video/dot/models/shelf/tapir_utils/nets.py b/data/dot_single_video/dot/models/shelf/tapir_utils/nets.py
new file mode 100644
index 0000000000000000000000000000000000000000..67bb0dbdb89705ba4bfd18b6d74a696836605ec4
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/tapir_utils/nets.py
@@ -0,0 +1,382 @@
+# Copyright 2024 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Pytorch neural network definitions."""
+
+from typing import Sequence, Union
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+class ExtraConvBlock(nn.Module):
+  """Additional convolution block."""
+
+  def __init__(
+      self,
+      channel_dim,
+      channel_multiplier,
+  ):
+    super().__init__()
+    self.channel_dim = channel_dim
+    self.channel_multiplier = channel_multiplier
+
+    self.layer_norm = nn.LayerNorm(
+        normalized_shape=channel_dim, elementwise_affine=True#, bias=True
+    )
+    self.conv = nn.Conv2d(
+        self.channel_dim * 3,
+        self.channel_dim * self.channel_multiplier,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+    )
+    self.conv_1 = nn.Conv2d(
+        self.channel_dim * self.channel_multiplier,
+        self.channel_dim,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+    )
+
+  def forward(self, x):
+    x = self.layer_norm(x)
+    x = x.permute(0, 3, 1, 2)
+    prev_frame = torch.cat([x[0:1], x[:-1]], dim=0)
+    next_frame = torch.cat([x[1:], x[-1:]], dim=0)
+    resid = torch.cat([x, prev_frame, next_frame], axis=1)
+    resid = self.conv(resid)
+    resid = F.gelu(resid, approximate='tanh')
+    x += self.conv_1(resid)
+    x = x.permute(0, 2, 3, 1)
+    return x
+
+
+class ExtraConvs(nn.Module):
+  """Additional CNN."""
+
+  def __init__(
+      self,
+      num_layers=5,
+      channel_dim=256,
+      channel_multiplier=4,
+  ):
+    super().__init__()
+    self.num_layers = num_layers
+    self.channel_dim = channel_dim
+    self.channel_multiplier = channel_multiplier
+
+    self.blocks = nn.ModuleList()
+    for _ in range(self.num_layers):
+      self.blocks.append(
+          ExtraConvBlock(self.channel_dim, self.channel_multiplier)
+      )
+
+  def forward(self, x):
+    for block in self.blocks:
+      x = block(x)
+
+    return x
+
+
+class ConvChannelsMixer(nn.Module):
+  """Linear activation block for PIPs's MLP Mixer."""
+
+  def __init__(self, in_channels):
+    super().__init__()
+    self.mlp2_up = nn.Linear(in_channels, in_channels * 4)
+    self.mlp2_down = nn.Linear(in_channels * 4, in_channels)
+
+  def forward(self, x):
+    x = self.mlp2_up(x)
+    x = F.gelu(x, approximate='tanh')
+    x = self.mlp2_down(x)
+    return x
+
+
+class PIPsConvBlock(nn.Module):
+  """Convolutional block for PIPs's MLP Mixer."""
+
+  def __init__(self, in_channels, kernel_shape=3):
+    super().__init__()
+    self.layer_norm = nn.LayerNorm(
+        normalized_shape=in_channels, elementwise_affine=True#, bias=False
+    )
+    self.mlp1_up = nn.Conv1d(
+        in_channels, in_channels * 4, kernel_shape, 1, 1, groups=in_channels
+    )
+    self.mlp1_up_1 = nn.Conv1d(
+        in_channels * 4,
+        in_channels * 4,
+        kernel_shape,
+        1,
+        1,
+        groups=in_channels * 4,
+    )
+    self.layer_norm_1 = nn.LayerNorm(
+        normalized_shape=in_channels, elementwise_affine=True#, bias=False
+    )
+    self.conv_channels_mixer = ConvChannelsMixer(in_channels)
+
+  def forward(self, x):
+    to_skip = x
+    x = self.layer_norm(x)
+
+    x = x.permute(0, 2, 1)
+    x = self.mlp1_up(x)
+    x = F.gelu(x, approximate='tanh')
+    x = self.mlp1_up_1(x)
+    x = x.permute(0, 2, 1)
+    x = x[..., 0::4] + x[..., 1::4] + x[..., 2::4] + x[..., 3::4]
+
+    x = x + to_skip
+    to_skip = x
+    x = self.layer_norm_1(x)
+    x = self.conv_channels_mixer(x)
+
+    x = x + to_skip
+    return x
+
+
+class PIPSMLPMixer(nn.Module):
+  """Depthwise-conv version of PIPs's MLP Mixer."""
+
+  def __init__(
+      self,
+      input_channels: int,
+      output_channels: int,
+      hidden_dim: int = 512,
+      num_blocks: int = 12,
+      kernel_shape: int = 3,
+  ):
+    """Inits Mixer module.
+
+    A depthwise-convolutional version of a MLP Mixer for processing images.
+
+    Args:
+        input_channels (int): The number of input channels.
+        output_channels (int): The number of output channels.
+        hidden_dim (int, optional): The dimension of the hidden layer. Defaults
+          to 512.
+        num_blocks (int, optional): The number of convolution blocks in the
+          mixer. Defaults to 12.
+        kernel_shape (int, optional): The size of the kernel in the convolution
+          blocks. Defaults to 3.
+    """
+
+    super().__init__()
+    self.hidden_dim = hidden_dim
+    self.num_blocks = num_blocks
+    self.linear = nn.Linear(input_channels, self.hidden_dim)
+    self.layer_norm = nn.LayerNorm(
+        normalized_shape=hidden_dim, elementwise_affine=True#, bias=False
+    )
+    self.linear_1 = nn.Linear(hidden_dim, output_channels)
+    self.blocks = nn.ModuleList([
+        PIPsConvBlock(hidden_dim, kernel_shape) for _ in range(num_blocks)
+    ])
+
+  def forward(self, x):
+    x = self.linear(x)
+    for block in self.blocks:
+      x = block(x)
+
+    x = self.layer_norm(x)
+    x = self.linear_1(x)
+    return x
+
+
+class BlockV2(nn.Module):
+  """ResNet V2 block."""
+
+  def __init__(
+      self,
+      channels_in: int,
+      channels_out: int,
+      stride: Union[int, Sequence[int]],
+      use_projection: bool,
+  ):
+    super().__init__()
+    self.padding = (1, 1, 1, 1)
+    # Handle assymetric padding created by padding="SAME" in JAX/LAX
+    if stride == 1:
+      self.padding = (1, 1, 1, 1)
+    elif stride == 2:
+      self.padding = (0, 2, 0, 2)
+    else:
+      raise ValueError(
+          'Check correct padding using padtype_to_padsin jax._src.lax.lax'
+      )
+
+    self.use_projection = use_projection
+    if self.use_projection:
+      self.proj_conv = nn.Conv2d(
+          in_channels=channels_in,
+          out_channels=channels_out,
+          kernel_size=1,
+          stride=stride,
+          padding=0,
+          bias=False,
+      )
+
+    self.bn_0 = nn.InstanceNorm2d(
+        num_features=channels_in,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=False,
+    )
+    self.conv_0 = nn.Conv2d(
+        in_channels=channels_in,
+        out_channels=channels_out,
+        kernel_size=3,
+        stride=stride,
+        padding=0,
+        bias=False,
+    )
+
+    self.conv_1 = nn.Conv2d(
+        in_channels=channels_out,
+        out_channels=channels_out,
+        kernel_size=3,
+        stride=1,
+        padding=1,
+        bias=False,
+    )
+    self.bn_1 = nn.InstanceNorm2d(
+        num_features=channels_out,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=False,
+    )
+
+  def forward(self, inputs):
+    x = shortcut = inputs
+
+    x = self.bn_0(x)
+    x = torch.relu(x)
+    if self.use_projection:
+      shortcut = self.proj_conv(x)
+
+    x = self.conv_0(F.pad(x, self.padding))
+
+    x = self.bn_1(x)
+    x = torch.relu(x)
+    # no issues with padding here as this layer always has stride 1
+    x = self.conv_1(x)
+
+    return x + shortcut
+
+
+class BlockGroup(nn.Module):
+  """Higher level block for ResNet implementation."""
+
+  def __init__(
+      self,
+      channels_in: int,
+      channels_out: int,
+      num_blocks: int,
+      stride: Union[int, Sequence[int]],
+      use_projection: bool,
+  ):
+    super().__init__()
+    blocks = []
+    for i in range(num_blocks):
+      blocks.append(
+          BlockV2(
+              channels_in=channels_in if i == 0 else channels_out,
+              channels_out=channels_out,
+              stride=(1 if i else stride),
+              use_projection=(i == 0 and use_projection),
+          )
+      )
+    self.blocks = nn.ModuleList(blocks)
+
+  def forward(self, inputs):
+    out = inputs
+    for block in self.blocks:
+      out = block(out)
+    return out
+
+
+class ResNet(nn.Module):
+  """ResNet model."""
+
+  def __init__(
+      self,
+      blocks_per_group: Sequence[int],
+      channels_per_group: Sequence[int] = (64, 128, 256, 512),
+      use_projection: Sequence[bool] = (True, True, True, True),
+      strides: Sequence[int] = (1, 2, 2, 2),
+  ):
+    """Initializes a ResNet model with customizable layers and configurations.
+
+    This constructor allows defining the architecture of a ResNet model by
+    setting the number of blocks, channels, projection usage, and strides for
+    each group of blocks within the network. It provides flexibility in
+    creating various ResNet configurations.
+
+    Args:
+      blocks_per_group: A sequence of 4 integers, each indicating the number
+        of residual blocks in each group.
+      channels_per_group: A sequence of 4 integers, each specifying the number
+        of output channels for the blocks in each group. Defaults to (64, 128,
+        256, 512).
+      use_projection: A sequence of 4 booleans, each indicating whether to use
+        a projection shortcut (True) or an identity shortcut (False) in each
+        group. Defaults to (True, True, True, True).
+      strides: A sequence of 4 integers, each specifying the stride size for
+        the convolutions in each group. Defaults to (1, 2, 2, 2).
+
+    The ResNet model created will have 4 groups, with each group's
+    architecture defined by the corresponding elements in these sequences.
+    """
+    super().__init__()
+
+    self.initial_conv = nn.Conv2d(
+        in_channels=3,
+        out_channels=channels_per_group[0],
+        kernel_size=(7, 7),
+        stride=2,
+        padding=0,
+        bias=False,
+    )
+
+    block_groups = []
+    for i, _ in enumerate(strides):
+      block_groups.append(
+          BlockGroup(
+              channels_in=channels_per_group[i - 1] if i > 0 else 64,
+              channels_out=channels_per_group[i],
+              num_blocks=blocks_per_group[i],
+              stride=strides[i],
+              use_projection=use_projection[i],
+          )
+      )
+    self.block_groups = nn.ModuleList(block_groups)
+
+  def forward(self, inputs):
+    result = {}
+    out = inputs
+    out = self.initial_conv(F.pad(out, (2, 4, 2, 4)))
+    result['initial_conv'] = out
+
+    for block_id, block_group in enumerate(self.block_groups):
+      out = block_group(out)
+      result[f'resnet_unit_{block_id}'] = out
+
+    return result
diff --git a/data/dot_single_video/dot/models/shelf/tapir_utils/tapir_model.py b/data/dot_single_video/dot/models/shelf/tapir_utils/tapir_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..84d0cba28a73dcbd7dd28b836f51271f113d9511
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/tapir_utils/tapir_model.py
@@ -0,0 +1,712 @@
+# Copyright 2024 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""TAPIR models definition."""
+
+import functools
+from typing import Any, List, Mapping, NamedTuple, Optional, Sequence, Tuple
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+from . import nets
+from . import utils
+
+
+class FeatureGrids(NamedTuple):
+    """Feature grids for a video, used to compute trajectories.
+
+    These are per-frame outputs of the encoding resnet.
+
+    Attributes:
+      lowres: Low-resolution features, one for each resolution; 256 channels.
+      hires: High-resolution features, one for each resolution; 64 channels.
+      resolutions: Resolutions used for trajectory computation.  There will be one
+        entry for the initialization, and then an entry for each PIPs refinement
+        resolution.
+    """
+
+    lowres: Sequence[torch.Tensor]
+    hires: Sequence[torch.Tensor]
+    resolutions: Sequence[Tuple[int, int]]
+
+
+class QueryFeatures(NamedTuple):
+    """Query features used to compute trajectories.
+
+    These are sampled from the query frames and are a full descriptor of the
+    tracked points. They can be acquired from a query image and then reused in a
+    separate video.
+
+    Attributes:
+      lowres: Low-resolution features, one for each resolution; each has shape
+        [batch, num_query_points, 256]
+      hires: High-resolution features, one for each resolution; each has shape
+        [batch, num_query_points, 64]
+      resolutions: Resolutions used for trajectory computation.  There will be one
+        entry for the initialization, and then an entry for each PIPs refinement
+        resolution.
+    """
+
+    lowres: Sequence[torch.Tensor]
+    hires: Sequence[torch.Tensor]
+    resolutions: Sequence[Tuple[int, int]]
+
+
+class TAPIR(nn.Module):
+    """TAPIR model."""
+
+    def __init__(
+            self,
+            bilinear_interp_with_depthwise_conv: bool = False,
+            num_pips_iter: int = 4,
+            pyramid_level: int = 1,
+            mixer_hidden_dim: int = 512,
+            num_mixer_blocks: int = 12,
+            mixer_kernel_shape: int = 3,
+            patch_size: int = 7,
+            softmax_temperature: float = 20.0,
+            parallelize_query_extraction: bool = False,
+            initial_resolution: Tuple[int, int] = (256, 256),
+            blocks_per_group: Sequence[int] = (2, 2, 2, 2),
+            feature_extractor_chunk_size: int = 10,
+            extra_convs: bool = True,
+    ):
+        super().__init__()
+
+        self.highres_dim = 128
+        self.lowres_dim = 256
+        self.bilinear_interp_with_depthwise_conv = (
+            bilinear_interp_with_depthwise_conv
+        )
+        self.parallelize_query_extraction = parallelize_query_extraction
+
+        self.num_pips_iter = num_pips_iter
+        self.pyramid_level = pyramid_level
+        self.patch_size = patch_size
+        self.softmax_temperature = softmax_temperature
+        self.initial_resolution = tuple(initial_resolution)
+        self.feature_extractor_chunk_size = feature_extractor_chunk_size
+
+        highres_dim = 128
+        lowres_dim = 256
+        strides = (1, 2, 2, 1)
+        blocks_per_group = (2, 2, 2, 2)
+        channels_per_group = (64, highres_dim, 256, lowres_dim)
+        use_projection = (True, True, True, True)
+
+        self.resnet_torch = nets.ResNet(
+            blocks_per_group=blocks_per_group,
+            channels_per_group=channels_per_group,
+            use_projection=use_projection,
+            strides=strides,
+        )
+        self.torch_cost_volume_track_mods = nn.ModuleDict({
+            'hid1': torch.nn.Conv2d(1, 16, 3, 1, 1),
+            'hid2': torch.nn.Conv2d(16, 1, 3, 1, 1),
+            'hid3': torch.nn.Conv2d(16, 32, 3, 2, 0),
+            'hid4': torch.nn.Linear(32, 16),
+            'occ_out': torch.nn.Linear(16, 2),
+        })
+        dim = 4 + self.highres_dim + self.lowres_dim
+        input_dim = dim + (self.pyramid_level + 2) * 49
+        self.torch_pips_mixer = nets.PIPSMLPMixer(input_dim, dim)
+
+        if extra_convs:
+            self.extra_convs = nets.ExtraConvs()
+        else:
+            self.extra_convs = None
+
+        self.cached_feats = None
+
+    def forward(
+            self,
+            video: torch.Tensor,
+            query_points: torch.Tensor,
+            is_training: bool = False,
+            query_chunk_size: Optional[int] = 512,
+            get_query_feats: bool = False,
+            refinement_resolutions: Optional[List[Tuple[int, int]]] = None,
+            cache_features: bool = False,
+    ) -> Mapping[str, torch.Tensor]:
+        """Runs a forward pass of the model.
+
+        Args:
+          video: A 5-D tensor representing a batch of sequences of images.
+          query_points: The query points for which we compute tracks.
+          is_training: Whether we are training.
+          query_chunk_size: When computing cost volumes, break the queries into
+            chunks of this size to save memory.
+          get_query_feats: Return query features for other losses like contrastive.
+            Not supported in the current version.
+          refinement_resolutions: A list of (height, width) tuples.  Refinement will
+            be repeated at each specified resolution, in order to achieve high
+            accuracy on resolutions higher than what TAPIR was trained on. If None,
+            reasonable refinement resolutions will be inferred from the input video
+            size.
+
+        Returns:
+          A dict of outputs, including:
+            occlusion: Occlusion logits, of shape [batch, num_queries, num_frames]
+              where higher indicates more likely to be occluded.
+            tracks: predicted point locations, of shape
+              [batch, num_queries, num_frames, 2], where each point is [x, y]
+              in raster coordinates
+            expected_dist: uncertainty estimate logits, of shape
+              [batch, num_queries, num_frames], where higher indicates more likely
+              to be far from the correct answer.
+        """
+        if get_query_feats:
+            raise ValueError('Get query feats not supported in TAPIR.')
+
+        if self.cached_feats is None or cache_features:
+            feature_grids = self.get_feature_grids(
+                video,
+                is_training,
+                refinement_resolutions,
+            )
+        else:
+            feature_grids = self.cached_feats
+
+        if cache_features:
+            self.cached_feats = feature_grids
+
+        query_features = self.get_query_features(
+            video,
+            is_training,
+            query_points,
+            feature_grids,
+            refinement_resolutions,
+        )
+
+        trajectories = self.estimate_trajectories(
+            video.shape[-3:-1],
+            is_training,
+            feature_grids,
+            query_features,
+            query_points,
+            query_chunk_size,
+        )
+
+        p = self.num_pips_iter
+        out = dict(
+            occlusion=torch.mean(
+                torch.stack(trajectories['occlusion'][p::p]), dim=0
+            ),
+            tracks=torch.mean(torch.stack(trajectories['tracks'][p::p]), dim=0),
+            expected_dist=torch.mean(
+                torch.stack(trajectories['expected_dist'][p::p]), dim=0
+            ),
+            unrefined_occlusion=trajectories['occlusion'][:-1],
+            unrefined_tracks=trajectories['tracks'][:-1],
+            unrefined_expected_dist=trajectories['expected_dist'][:-1],
+        )
+
+        return out
+
+    def get_query_features(
+            self,
+            video: torch.Tensor,
+            is_training: bool,
+            query_points: torch.Tensor,
+            feature_grids: Optional[FeatureGrids] = None,
+            refinement_resolutions: Optional[List[Tuple[int, int]]] = None,
+    ) -> QueryFeatures:
+        """Computes query features, which can be used for estimate_trajectories.
+
+        Args:
+          video: A 5-D tensor representing a batch of sequences of images.
+          is_training: Whether we are training.
+          query_points: The query points for which we compute tracks.
+          feature_grids: If passed, we'll use these feature grids rather than
+            computing new ones.
+          refinement_resolutions: A list of (height, width) tuples.  Refinement will
+            be repeated at each specified resolution, in order to achieve high
+            accuracy on resolutions higher than what TAPIR was trained on. If None,
+            reasonable refinement resolutions will be inferred from the input video
+            size.
+
+        Returns:
+          A QueryFeatures object which contains the required features for every
+            required resolution.
+        """
+
+        if feature_grids is None:
+            feature_grids = self.get_feature_grids(
+                video,
+                is_training=is_training,
+                refinement_resolutions=refinement_resolutions,
+            )
+
+        feature_grid = feature_grids.lowres
+        hires_feats = feature_grids.hires
+        resize_im_shape = feature_grids.resolutions
+
+        shape = video.shape
+        # shape is [batch_size, time, height, width, channels]; conversion needs
+        # [time, width, height]
+        curr_resolution = (-1, -1)
+        query_feats = []
+        hires_query_feats = []
+        for i, resolution in enumerate(resize_im_shape):
+            if utils.is_same_res(curr_resolution, resolution):
+                query_feats.append(query_feats[-1])
+                hires_query_feats.append(hires_query_feats[-1])
+                continue
+            position_in_grid = utils.convert_grid_coordinates(
+                query_points,
+                shape[1:4],
+                feature_grid[i].shape[1:4],
+                coordinate_format='tyx',
+            )
+            position_in_grid_hires = utils.convert_grid_coordinates(
+                query_points,
+                shape[1:4],
+                hires_feats[i].shape[1:4],
+                coordinate_format='tyx',
+            )
+
+            interp_features = utils.map_coordinates_3d(
+                feature_grid[i], position_in_grid
+            )
+            hires_interp = utils.map_coordinates_3d(
+                hires_feats[i], position_in_grid_hires
+            )
+
+            hires_query_feats.append(hires_interp)
+            query_feats.append(interp_features)
+
+        return QueryFeatures(
+            tuple(query_feats), tuple(hires_query_feats), tuple(resize_im_shape)
+        )
+
+    def get_feature_grids(
+            self,
+            video: torch.Tensor,
+            is_training: bool,
+            refinement_resolutions: Optional[List[Tuple[int, int]]] = None,
+    ) -> FeatureGrids:
+        """Computes feature grids.
+
+        Args:
+          video: A 5-D tensor representing a batch of sequences of images.
+          is_training: Whether we are training.
+          refinement_resolutions: A list of (height, width) tuples. Refinement will
+            be repeated at each specified resolution, to achieve high accuracy on
+            resolutions higher than what TAPIR was trained on. If None, reasonable
+            refinement resolutions will be inferred from the input video size.
+
+        Returns:
+          A FeatureGrids object containing the required features for every
+          required resolution. Note that there will be one more feature grid
+          than there are refinement_resolutions, because there is always a
+          feature grid computed for TAP-Net initialization.
+        """
+        del is_training
+        if refinement_resolutions is None:
+            refinement_resolutions = utils.generate_default_resolutions(
+                video.shape[2:4], self.initial_resolution
+            )
+
+        all_required_resolutions = [self.initial_resolution]
+        all_required_resolutions.extend(refinement_resolutions)
+
+        feature_grid = []
+        hires_feats = []
+        resize_im_shape = []
+        curr_resolution = (-1, -1)
+
+        latent = None
+        hires = None
+        video_resize = None
+        for resolution in all_required_resolutions:
+            if resolution[0] % 8 != 0 or resolution[1] % 8 != 0:
+                raise ValueError('Image resolution must be a multiple of 8.')
+
+            if not utils.is_same_res(curr_resolution, resolution):
+                if utils.is_same_res(curr_resolution, video.shape[-3:-1]):
+                    video_resize = video
+                else:
+                    video_resize = utils.bilinear(video, resolution)
+
+                curr_resolution = resolution
+                n, f, h, w, c = video_resize.shape
+                video_resize = video_resize.view(n * f, h, w, c).permute(0, 3, 1, 2)
+
+                if self.feature_extractor_chunk_size > 0:
+                    latent_list = []
+                    hires_list = []
+                    chunk_size = self.feature_extractor_chunk_size
+                    for start_idx in range(0, video_resize.shape[0], chunk_size):
+                        video_chunk = video_resize[start_idx:start_idx + chunk_size]
+                        resnet_out = self.resnet_torch(video_chunk)
+
+                        u3 = resnet_out['resnet_unit_3'].permute(0, 2, 3, 1).detach()
+                        latent_list.append(u3)
+                        u1 = resnet_out['resnet_unit_1'].permute(0, 2, 3, 1).detach()
+                        hires_list.append(u1)
+
+                    latent = torch.cat(latent_list, dim=0)
+                    hires = torch.cat(hires_list, dim=0)
+
+                else:
+                    resnet_out = self.resnet_torch(video_resize)
+                    latent = resnet_out['resnet_unit_3'].permute(0, 2, 3, 1).detach()
+                    hires = resnet_out['resnet_unit_1'].permute(0, 2, 3, 1).detach()
+
+                if self.extra_convs:
+                    latent = self.extra_convs(latent)
+
+                latent = latent / torch.sqrt(
+                    torch.maximum(
+                        torch.sum(torch.square(latent), axis=-1, keepdims=True),
+                        torch.tensor(1e-12, device=latent.device),
+                    )
+                )
+                hires = hires / torch.sqrt(
+                    torch.maximum(
+                        torch.sum(torch.square(hires), axis=-1, keepdims=True),
+                        torch.tensor(1e-12, device=hires.device),
+                    )
+                )
+
+            feature_grid.append(latent[None, ...])
+            hires_feats.append(hires[None, ...])
+            resize_im_shape.append(video_resize.shape[2:4])
+
+        return FeatureGrids(
+            tuple(feature_grid), tuple(hires_feats), tuple(resize_im_shape)
+        )
+
+    def estimate_trajectories(
+            self,
+            video_size: Tuple[int, int],
+            is_training: bool,
+            feature_grids: FeatureGrids,
+            query_features: QueryFeatures,
+            query_points_in_video: Optional[torch.Tensor],
+            query_chunk_size: Optional[int] = None,
+    ) -> Mapping[str, Any]:
+        """Estimates trajectories given features for a video and query features.
+
+        Args:
+          video_size: A 2-tuple containing the original [height, width] of the
+            video.  Predictions will be scaled with respect to this resolution.
+          is_training: Whether we are training.
+          feature_grids: a FeatureGrids object computed for the given video.
+          query_features: a QueryFeatures object computed for the query points.
+          query_points_in_video: If provided, assume that the query points come from
+            the same video as feature_grids, and therefore constrain the resulting
+            trajectories to (approximately) pass through them.
+          query_chunk_size: When computing cost volumes, break the queries into
+            chunks of this size to save memory.
+
+        Returns:
+          A dict of outputs, including:
+            occlusion: Occlusion logits, of shape [batch, num_queries, num_frames]
+              where higher indicates more likely to be occluded.
+            tracks: predicted point locations, of shape
+              [batch, num_queries, num_frames, 2], where each point is [x, y]
+              in raster coordinates
+            expected_dist: uncertainty estimate logits, of shape
+              [batch, num_queries, num_frames], where higher indicates more likely
+              to be far from the correct answer.
+        """
+        del is_training
+
+        def train2orig(x):
+            return utils.convert_grid_coordinates(
+                x,
+                self.initial_resolution[::-1],
+                video_size[::-1],
+                coordinate_format='xy',
+            )
+
+        occ_iters = []
+        pts_iters = []
+        expd_iters = []
+        num_iters = self.num_pips_iter * (len(feature_grids.lowres) - 1)
+        for _ in range(num_iters + 1):
+            occ_iters.append([])
+            pts_iters.append([])
+            expd_iters.append([])
+
+        infer = functools.partial(
+            self.tracks_from_cost_volume,
+            im_shp=feature_grids.lowres[0].shape[0:2]
+                   + self.initial_resolution
+                   + (3,),
+        )
+
+        num_queries = query_features.lowres[0].shape[1]
+        perm = torch.randperm(num_queries)
+        inv_perm = torch.zeros_like(perm)
+        inv_perm[perm] = torch.arange(num_queries)
+
+        for ch in range(0, num_queries, query_chunk_size):
+            perm_chunk = perm[ch: ch + query_chunk_size]
+            chunk = query_features.lowres[0][:, perm_chunk]
+
+            if query_points_in_video is not None:
+                infer_query_points = query_points_in_video[
+                                     :, perm[ch: ch + query_chunk_size]
+                                     ]
+                num_frames = feature_grids.lowres[0].shape[1]
+                infer_query_points = utils.convert_grid_coordinates(
+                    infer_query_points,
+                    (num_frames,) + video_size,
+                    (num_frames,) + self.initial_resolution,
+                    coordinate_format='tyx',
+                )
+            else:
+                infer_query_points = None
+
+            points, occlusion, expected_dist = infer(
+                chunk,
+                feature_grids.lowres[0],
+                infer_query_points,
+            )
+            pts_iters[0].append(train2orig(points))
+            occ_iters[0].append(occlusion)
+            expd_iters[0].append(expected_dist)
+
+            mixer_feats = None
+            for i in range(num_iters):
+                feature_level = i // self.num_pips_iter + 1
+                queries = [
+                    query_features.hires[feature_level][:, perm_chunk],
+                    query_features.lowres[feature_level][:, perm_chunk],
+                ]
+                for _ in range(self.pyramid_level):
+                    queries.append(queries[-1])
+                pyramid = [
+                    feature_grids.hires[feature_level],
+                    feature_grids.lowres[feature_level],
+                ]
+                for _ in range(self.pyramid_level):
+                    pyramid.append(
+                        F.avg_pool3d(
+                            pyramid[-1],
+                            kernel_size=(2, 2, 1),
+                            stride=(2, 2, 1),
+                            padding=0,
+                        )
+                    )
+
+                refined = self.refine_pips(
+                    queries,
+                    None,
+                    pyramid,
+                    points,
+                    occlusion,
+                    expected_dist,
+                    orig_hw=self.initial_resolution,
+                    last_iter=mixer_feats,
+                    mixer_iter=i,
+                    resize_hw=feature_grids.resolutions[feature_level],
+                )
+                points, occlusion, expected_dist, mixer_feats = refined
+                pts_iters[i + 1].append(train2orig(points))
+                occ_iters[i + 1].append(occlusion)
+                expd_iters[i + 1].append(expected_dist)
+                if (i + 1) % self.num_pips_iter == 0:
+                    mixer_feats = None
+                    expected_dist = expd_iters[0][-1]
+                    occlusion = occ_iters[0][-1]
+
+        occlusion = []
+        points = []
+        expd = []
+        for i, _ in enumerate(occ_iters):
+            occlusion.append(torch.cat(occ_iters[i], dim=1)[:, inv_perm])
+            points.append(torch.cat(pts_iters[i], dim=1)[:, inv_perm])
+            expd.append(torch.cat(expd_iters[i], dim=1)[:, inv_perm])
+
+        out = dict(
+            occlusion=occlusion,
+            tracks=points,
+            expected_dist=expd,
+        )
+        return out
+
+    def refine_pips(
+            self,
+            target_feature,
+            frame_features,
+            pyramid,
+            pos_guess,
+            occ_guess,
+            expd_guess,
+            orig_hw,
+            last_iter=None,
+            mixer_iter=0.0,
+            resize_hw=None,
+    ):
+        del frame_features
+        del mixer_iter
+        orig_h, orig_w = orig_hw
+        resized_h, resized_w = resize_hw
+        corrs_pyr = []
+        assert len(target_feature) == len(pyramid)
+        for pyridx, (query, grid) in enumerate(zip(target_feature, pyramid)):
+            # note: interp needs [y,x]
+            coords = utils.convert_grid_coordinates(
+                pos_guess, (orig_w, orig_h), grid.shape[-2:-4:-1]
+            )
+            coords = torch.flip(coords, dims=(-1,))
+            last_iter_query = None
+            if last_iter is not None:
+                if pyridx == 0:
+                    last_iter_query = last_iter[..., : self.highres_dim]
+                else:
+                    last_iter_query = last_iter[..., self.highres_dim:]
+
+            ctxy, ctxx = torch.meshgrid(
+                torch.arange(-3, 4), torch.arange(-3, 4), indexing='ij'
+            )
+            ctx = torch.stack([ctxy, ctxx], dim=-1)
+            ctx = ctx.reshape(-1, 2).to(coords.device)
+            coords2 = coords.unsqueeze(3) + ctx.unsqueeze(0).unsqueeze(0).unsqueeze(0)
+            neighborhood = utils.map_coordinates_2d(grid, coords2)
+
+            # s is spatial context size
+            if last_iter_query is None:
+                patches = torch.einsum('bnfsc,bnc->bnfs', neighborhood, query)
+            else:
+                patches = torch.einsum(
+                    'bnfsc,bnfc->bnfs', neighborhood, last_iter_query
+                )
+
+            corrs_pyr.append(patches)
+        corrs_pyr = torch.concatenate(corrs_pyr, dim=-1)
+
+        corrs_chunked = corrs_pyr
+        pos_guess_input = pos_guess
+        occ_guess_input = occ_guess[..., None]
+        expd_guess_input = expd_guess[..., None]
+
+        # mlp_input is batch, num_points, num_chunks, frames_per_chunk, channels
+        if last_iter is None:
+            both_feature = torch.cat([target_feature[0], target_feature[1]], axis=-1)
+            mlp_input_features = torch.tile(
+                both_feature.unsqueeze(2), (1, 1, corrs_chunked.shape[-2], 1)
+            )
+        else:
+            mlp_input_features = last_iter
+
+        pos_guess_input = torch.zeros_like(pos_guess_input)
+
+        mlp_input = torch.cat(
+            [
+                pos_guess_input,
+                occ_guess_input,
+                expd_guess_input,
+                mlp_input_features,
+                corrs_chunked,
+            ],
+            axis=-1,
+        )
+        x = utils.einshape('bnfc->(bn)fc', mlp_input)
+        res = self.torch_pips_mixer(x.float())
+        res = utils.einshape('(bn)fc->bnfc', res, b=mlp_input.shape[0])
+
+        pos_update = utils.convert_grid_coordinates(
+            res[..., :2].detach(),
+            (resized_w, resized_h),
+            (orig_w, orig_h),
+        )
+        return (
+            pos_update + pos_guess,
+            res[..., 2] + occ_guess,
+            res[..., 3] + expd_guess,
+            res[..., 4:] + (mlp_input_features if last_iter is None else last_iter),
+        )
+
+    def tracks_from_cost_volume(
+            self,
+            interp_feature: torch.Tensor,
+            feature_grid: torch.Tensor,
+            query_points: Optional[torch.Tensor],
+            im_shp=None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Converts features into tracks by computing a cost volume.
+
+        The computed cost volume will have shape
+          [batch, num_queries, time, height, width], which can be very
+          memory intensive.
+
+        Args:
+          interp_feature: A tensor of features for each query point, of shape
+            [batch, num_queries, channels, heads].
+          feature_grid: A tensor of features for the video, of shape [batch, time,
+            height, width, channels, heads].
+          query_points: When computing tracks, we assume these points are given as
+            ground truth and we reproduce them exactly.  This is a set of points of
+            shape [batch, num_points, 3], where each entry is [t, y, x] in frame/
+            raster coordinates.
+          im_shp: The shape of the original image, i.e., [batch, num_frames, time,
+            height, width, 3].
+
+        Returns:
+          A 2-tuple of the inferred points (of shape
+            [batch, num_points, num_frames, 2] where each point is [x, y]) and
+            inferred occlusion (of shape [batch, num_points, num_frames], where
+            each is a logit where higher means occluded)
+        """
+
+        mods = self.torch_cost_volume_track_mods
+        cost_volume = torch.einsum(
+            'bnc,bthwc->tbnhw',
+            interp_feature,
+            feature_grid,
+        )
+
+        shape = cost_volume.shape
+        batch_size, num_points = cost_volume.shape[1:3]
+        cost_volume = utils.einshape('tbnhw->(tbn)hw1', cost_volume)
+
+        cost_volume = cost_volume.permute(0, 3, 1, 2)
+        occlusion = mods['hid1'](cost_volume)
+        occlusion = torch.nn.functional.relu(occlusion)
+
+        pos = mods['hid2'](occlusion)
+        pos = pos.permute(0, 2, 3, 1)
+        pos_rshp = utils.einshape('(tb)hw1->t(b)hw1', pos, t=shape[0])
+
+        pos = utils.einshape(
+            't(bn)hw1->bnthw', pos_rshp, b=batch_size, n=num_points
+        )
+        pos_sm = pos.reshape(pos.size(0), pos.size(1), pos.size(2), -1)
+        softmaxed = F.softmax(pos_sm * self.softmax_temperature, dim=-1)
+        pos = softmaxed.view_as(pos)
+
+        points = utils.heatmaps_to_points(pos, im_shp, query_points=query_points)
+
+        occlusion = torch.nn.functional.pad(occlusion, (0, 2, 0, 2))
+        occlusion = mods['hid3'](occlusion)
+        occlusion = torch.nn.functional.relu(occlusion)
+        occlusion = torch.mean(occlusion, dim=(-1, -2))
+        occlusion = mods['hid4'](occlusion)
+        occlusion = torch.nn.functional.relu(occlusion)
+        occlusion = mods['occ_out'](occlusion)
+
+        expected_dist = utils.einshape(
+            '(tbn)1->bnt', occlusion[..., 1:2], n=shape[2], t=shape[0]
+        )
+        occlusion = utils.einshape(
+            '(tbn)1->bnt', occlusion[..., 0:1], n=shape[2], t=shape[0]
+        )
+        return points, occlusion, expected_dist
diff --git a/data/dot_single_video/dot/models/shelf/tapir_utils/utils.py b/data/dot_single_video/dot/models/shelf/tapir_utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c1b68db3a9dc35c02ec10d691c8ce598016a446
--- /dev/null
+++ b/data/dot_single_video/dot/models/shelf/tapir_utils/utils.py
@@ -0,0 +1,317 @@
+# Copyright 2024 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Pytorch model utilities."""
+
+from typing import Any, Sequence, Union
+from einshape.src import abstract_ops
+from einshape.src import backend
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+
+def bilinear(x: torch.Tensor, resolution: tuple[int, int]) -> torch.Tensor:
+  """Resizes a 5D tensor using bilinear interpolation.
+
+  Args:
+        x: A 5D tensor of shape (B, T, W, H, C) where B is batch size, T is
+          time, W is width, H is height, and C is the number of channels.
+    resolution: The target resolution as a tuple (new_width, new_height).
+
+  Returns:
+    The resized tensor.
+  """
+  b, t, h, w, c = x.size()
+  x = x.permute(0, 1, 4, 2, 3).reshape(b, t * c, h, w)
+  x = F.interpolate(x, size=resolution, mode='bilinear', align_corners=False)
+  b, _, h, w = x.size()
+  x = x.reshape(b, t, c, h, w).permute(0, 1, 3, 4, 2)
+  return x
+
+
+def map_coordinates_3d(
+    feats: torch.Tensor, coordinates: torch.Tensor
+) -> torch.Tensor:
+  """Maps 3D coordinates to corresponding features using bilinear interpolation.
+
+  Args:
+    feats: A 5D tensor of features with shape (B, W, H, D, C), where B is batch
+      size, W is width, H is height, D is depth, and C is the number of
+      channels.
+    coordinates: A 3D tensor of coordinates with shape (B, N, 3), where N is the
+      number of coordinates and the last dimension represents (W, H, D)
+      coordinates.
+
+  Returns:
+    The mapped features tensor.
+  """
+  x = feats.permute(0, 4, 1, 2, 3)
+  y = coordinates[:, :, None, None, :].float()
+  y[..., 0] += 0.5
+  y = 2 * (y / torch.tensor(x.shape[2:], device=y.device)) - 1
+  y = torch.flip(y, dims=(-1,))
+  out = (
+      F.grid_sample(
+          x, y, mode='bilinear', align_corners=False, padding_mode='border'
+      )
+      .squeeze(dim=(3, 4))
+      .permute(0, 2, 1)
+  )
+  return out
+
+
+def map_coordinates_2d(
+    feats: torch.Tensor, coordinates: torch.Tensor
+) -> torch.Tensor:
+  """Maps 2D coordinates to feature maps using bilinear interpolation.
+
+  The function performs bilinear interpolation on the feature maps (`feats`)
+  at the specified `coordinates`. The coordinates are normalized between
+  -1 and 1 The result is a tensor of sampled features corresponding
+  to these coordinates.
+
+  Args:
+    feats (Tensor): A 5D tensor of shape (N, T, H, W, C) representing feature
+      maps, where N is the batch size, T is the number of frames, H and W are
+      height and width, and C is the number of channels.
+    coordinates (Tensor): A 5D tensor of shape (N, P, T, S, XY) representing
+      coordinates, where N is the batch size, P is the number of points, T is
+      the number of frames, S is the number of samples, and XY represents the 2D
+      coordinates.
+
+  Returns:
+    Tensor: A 5D tensor of the sampled features corresponding to the
+      given coordinates, of shape (N, P, T, S, C).
+  """
+  n, t, h, w, c = feats.shape
+  x = feats.permute(0, 1, 4, 2, 3).view(n * t, c, h, w)
+
+  n, p, t, s, xy = coordinates.shape
+  y = coordinates.permute(0, 2, 1, 3, 4).view(n * t, p, s, xy)
+  y = 2 * (y / h) - 1
+  y = torch.flip(y, dims=(-1,)).float()
+
+  out = F.grid_sample(
+      x, y, mode='bilinear', align_corners=False, padding_mode='zeros'
+  )
+  _, c, _, _ = out.shape
+  out = out.permute(0, 2, 3, 1).view(n, t, p, s, c).permute(0, 2, 1, 3, 4)
+
+  return out
+
+
+def soft_argmax_heatmap_batched(softmax_val, threshold=5):
+  """Test if two image resolutions are the same."""
+  b, h, w, d1, d2 = softmax_val.shape
+  y, x = torch.meshgrid(
+      torch.arange(d1, device=softmax_val.device),
+      torch.arange(d2, device=softmax_val.device),
+      indexing='ij',
+  )
+  coords = torch.stack([x + 0.5, y + 0.5], dim=-1).to(softmax_val.device)
+  softmax_val_flat = softmax_val.reshape(b, h, w, -1)
+  argmax_pos = torch.argmax(softmax_val_flat, dim=-1)
+
+  pos = coords.reshape(-1, 2)[argmax_pos]
+  valid = (
+      torch.sum(
+          torch.square(
+              coords[None, None, None, :, :, :] - pos[:, :, :, None, None, :]
+          ),
+          dim=-1,
+          keepdims=True,
+      )
+      < threshold**2
+  )
+
+  weighted_sum = torch.sum(
+      coords[None, None, None, :, :, :]
+      * valid
+      * softmax_val[:, :, :, :, :, None],
+      dim=(3, 4),
+  )
+  sum_of_weights = torch.maximum(
+      torch.sum(valid * softmax_val[:, :, :, :, :, None], dim=(3, 4)),
+      torch.tensor(1e-12, device=softmax_val.device),
+  )
+  return weighted_sum / sum_of_weights
+
+
+def heatmaps_to_points(
+    all_pairs_softmax,
+    image_shape,
+    threshold=5,
+    query_points=None,
+):
+  """Convert heatmaps to points using soft argmax."""
+
+  out_points = soft_argmax_heatmap_batched(all_pairs_softmax, threshold)
+  feature_grid_shape = all_pairs_softmax.shape[1:]
+  # Note: out_points is now [x, y]; we need to divide by [width, height].
+  # image_shape[3] is width and image_shape[2] is height.
+  out_points = convert_grid_coordinates(
+      out_points.detach(),
+      feature_grid_shape[3:1:-1],
+      image_shape[3:1:-1],
+  )
+  assert feature_grid_shape[1] == image_shape[1]
+  if query_points is not None:
+    # The [..., 0:1] is because we only care about the frame index.
+    query_frame = convert_grid_coordinates(
+        query_points.detach(),
+        image_shape[1:4],
+        feature_grid_shape[1:4],
+        coordinate_format='tyx',
+    )[..., 0:1]
+
+    query_frame = torch.round(query_frame)
+    frame_indices = torch.arange(image_shape[1], device=query_frame.device)[
+        None, None, :
+    ]
+    is_query_point = query_frame == frame_indices
+
+    is_query_point = is_query_point[:, :, :, None]
+    out_points = (
+        out_points * ~is_query_point
+        + torch.flip(query_points[:, :, None], dims=(-1,))[..., 0:2]
+        * is_query_point
+    )
+
+  return out_points
+
+
+def is_same_res(r1, r2):
+  """Test if two image resolutions are the same."""
+  return all([x == y for x, y in zip(r1, r2)])
+
+
+def convert_grid_coordinates(
+    coords: torch.Tensor,
+    input_grid_size: Sequence[int],
+    output_grid_size: Sequence[int],
+    coordinate_format: str = 'xy',
+) -> torch.Tensor:
+  """Convert grid coordinates to correct format."""
+  if isinstance(input_grid_size, tuple):
+    input_grid_size = torch.tensor(input_grid_size, device=coords.device)
+  if isinstance(output_grid_size, tuple):
+    output_grid_size = torch.tensor(output_grid_size, device=coords.device)
+
+  if coordinate_format == 'xy':
+    if input_grid_size.shape[0] != 2 or output_grid_size.shape[0] != 2:
+      raise ValueError(
+          'If coordinate_format is xy, the shapes must be length 2.'
+      )
+  elif coordinate_format == 'tyx':
+    if input_grid_size.shape[0] != 3 or output_grid_size.shape[0] != 3:
+      raise ValueError(
+          'If coordinate_format is tyx, the shapes must be length 3.'
+      )
+    if input_grid_size[0] != output_grid_size[0]:
+      raise ValueError('converting frame count is not supported.')
+  else:
+    raise ValueError('Recognized coordinate formats are xy and tyx.')
+
+  position_in_grid = coords
+  position_in_grid = position_in_grid * output_grid_size / input_grid_size
+
+  return position_in_grid
+
+
+class _JaxBackend(backend.Backend[torch.Tensor]):
+  """Einshape implementation for PyTorch."""
+
+  # https://github.com/vacancy/einshape/blob/main/einshape/src/pytorch/pytorch_ops.py
+
+  def reshape(self, x: torch.Tensor, op: abstract_ops.Reshape) -> torch.Tensor:
+    return x.reshape(op.shape)
+
+  def transpose(
+      self, x: torch.Tensor, op: abstract_ops.Transpose
+  ) -> torch.Tensor:
+    return x.permute(op.perm)
+
+  def broadcast(
+      self, x: torch.Tensor, op: abstract_ops.Broadcast
+  ) -> torch.Tensor:
+    shape = op.transform_shape(x.shape)
+    for axis_position in sorted(op.axis_sizes.keys()):
+      x = x.unsqueeze(axis_position)
+    return x.expand(shape)
+
+
+def einshape(
+    equation: str, value: Union[torch.Tensor, Any], **index_sizes: int
+) -> torch.Tensor:
+  """Reshapes `value` according to the given Shape Equation.
+
+  Args:
+    equation: The Shape Equation specifying the index regrouping and reordering.
+    value: Input tensor, or tensor-like object.
+    **index_sizes: Sizes of indices, where they cannot be inferred from
+      `input_shape`.
+
+  Returns:
+    Tensor derived from `value` by reshaping as specified by `equation`.
+  """
+  if not isinstance(value, torch.Tensor):
+    value = torch.tensor(value)
+  return _JaxBackend().exec(equation, value, value.shape, **index_sizes)
+
+
+def generate_default_resolutions(full_size, train_size, num_levels=None):
+  """Generate a list of logarithmically-spaced resolutions.
+
+  Generated resolutions are between train_size and full_size, inclusive, with
+  num_levels different resolutions total.  Useful for generating the input to
+  refinement_resolutions in PIPs.
+
+  Args:
+    full_size: 2-tuple of ints.  The full image size desired.
+    train_size: 2-tuple of ints.  The smallest refinement level.  Should
+      typically match the training resolution, which is (256, 256) for TAPIR.
+    num_levels: number of levels.  Typically each resolution should be less than
+      twice the size of prior resolutions.
+
+  Returns:
+    A list of resolutions.
+  """
+  if all([x == y for x, y in zip(train_size, full_size)]):
+    return [train_size]
+
+  if num_levels is None:
+    size_ratio = np.array(full_size) / np.array(train_size)
+    num_levels = int(np.ceil(np.max(np.log2(size_ratio))) + 1)
+
+  if num_levels <= 1:
+    return [train_size]
+
+  h, w = full_size[0:2]
+  if h % 8 != 0 or w % 8 != 0:
+    print(
+        'Warning: output size is not a multiple of 8. Final layer '
+        + 'will round size down.'
+    )
+  ll_h, ll_w = train_size[0:2]
+
+  sizes = []
+  for i in range(num_levels):
+    size = (
+        int(round((ll_h * (h / ll_h) ** (i / (num_levels - 1))) // 8)) * 8,
+        int(round((ll_w * (w / ll_w) ** (i / (num_levels - 1))) // 8)) * 8,
+    )
+    sizes.append(size)
+  return sizes
diff --git a/data/dot_single_video/dot/utils/__init__.py b/data/dot_single_video/dot/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/data/dot_single_video/dot/utils/io.py b/data/dot_single_video/dot/utils/io.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d3ee176861043fcee5e7f52aa2ea0c6b4c75c3f
--- /dev/null
+++ b/data/dot_single_video/dot/utils/io.py
@@ -0,0 +1,129 @@
+import os
+import argparse
+from PIL import Image
+from glob import glob
+import numpy as np
+import json
+import torch
+import torchvision
+from torch.nn import functional as F
+
+
+def create_folder(path, verbose=False, exist_ok=True, safe=True):
+    if os.path.exists(path) and not exist_ok:
+        if not safe:
+            raise OSError
+        return False
+    try:
+        os.makedirs(path)
+    except:
+        if not safe:
+            raise OSError
+        return False
+    if verbose:
+        print(f"Created folder: {path}")
+    return True
+
+
+def read_video(path, start_step=0, time_steps=None, channels="first", exts=("jpg", "png"), resolution=None):
+    if path.endswith(".mp4"):
+        video = read_video_from_file(path, start_step, time_steps, channels, resolution)
+    else:
+        video = read_video_from_folder(path, start_step, time_steps, channels, resolution, exts)
+    return video
+
+
+def read_video_from_file(path, start_step, time_steps, channels, resolution):
+    video, _, _ = torchvision.io.read_video(path, output_format="TCHW", pts_unit="sec")
+    if time_steps is None:
+        time_steps = len(video) - start_step
+    video = video[start_step: start_step + time_steps]
+    if resolution is not None:
+        video = F.interpolate(video, size=resolution, mode="bilinear")
+    if channels == "last":
+        video = video.permute(0, 2, 3, 1)
+    video = video / 255.
+    return video
+
+
+def read_video_from_folder(path, start_step, time_steps, channels, resolution, exts):
+    paths = []
+    for ext in exts:
+        paths += glob(os.path.join(path, f"*.{ext}"))
+    paths = sorted(paths)
+    if time_steps is None:
+        time_steps = len(paths) - start_step
+    video = []
+    for step in range(start_step, start_step + time_steps):
+        frame = read_frame(paths[step], resolution, channels)
+        video.append(frame)
+    video = torch.stack(video)
+    return video
+
+
+def read_frame(path, resolution=None, channels="first"):
+    frame = Image.open(path).convert('RGB')
+    frame = np.array(frame)
+    frame = frame.astype(np.float32)
+    frame = frame / 255
+    frame = torch.from_numpy(frame)
+    frame = frame.permute(2, 0, 1)
+    if resolution is not None:
+        frame = F.interpolate(frame[None], size=resolution, mode="bilinear")[0]
+    if channels == "last":
+        frame = frame.permute(1, 2, 0)
+    return frame
+
+
+def write_video(video, path, channels="first", zero_padded=True, ext="png", dtype="torch"):
+    if dtype == "numpy":
+        video = torch.from_numpy(video)
+    if path.endswith(".mp4"):
+        write_video_to_file(video, path, channels)
+    else:
+        write_video_to_folder(video, path, channels, zero_padded, ext)
+
+
+def write_video_to_file(video, path, channels):
+    create_folder(os.path.dirname(path))
+    if channels == "first":
+        video = video.permute(0, 2, 3, 1)
+    video = (video.cpu() * 255.).to(torch.uint8)
+    torchvision.io.write_video(path, video, 24, "h264", options={"pix_fmt": "yuv420p", "crf": "23"})
+    return video
+
+
+def write_video_to_folder(video, path, channels, zero_padded, ext):
+    create_folder(path)
+    time_steps = video.shape[0]
+    for step in range(time_steps):
+        pad = "0" * (len(str(time_steps)) - len(str(step))) if zero_padded else ""
+        frame_path = os.path.join(path, f"{pad}{step}.{ext}")
+        write_frame(video[step], frame_path, channels)
+
+
+def write_frame(frame, path, channels="first"):
+    create_folder(os.path.dirname(path))
+    frame = frame.cpu().numpy()
+    if channels == "first":
+        frame = np.transpose(frame, (1, 2, 0))
+    frame = np.clip(np.round(frame * 255), 0, 255).astype(np.uint8)
+    frame = Image.fromarray(frame)
+    frame.save(path)
+
+
+def read_tracks(path):
+    return np.load(path)
+
+
+def write_tracks(tracks, path):
+    np.save(path, tracks)
+
+
+def read_config(path):
+    with open(path, 'r') as f:
+        config = json.load(f)
+    args = argparse.Namespace(**config)
+    return args
+
+
diff --git a/data/dot_single_video/dot/utils/log.py b/data/dot_single_video/dot/utils/log.py
new file mode 100644
index 0000000000000000000000000000000000000000..12de66c0fbbbcabe34049ca8988aaaa6a335cb3a
--- /dev/null
+++ b/data/dot_single_video/dot/utils/log.py
@@ -0,0 +1,57 @@
+import os
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
+
+import torch
+from torchvision.utils import make_grid
+from torch.utils.tensorboard import SummaryWriter
+
+from dot.utils.plot import to_rgb
+
+
+def detach(tensor):
+    if isinstance(tensor, torch.Tensor):
+        return tensor.detach().cpu()
+    return tensor
+
+
+def number(tensor):
+    if isinstance(tensor, torch.Tensor) and tensor.isnan().any():
+        return torch.zeros_like(tensor)
+    return tensor
+
+
+class Logger():
+    def __init__(self, args):
+        self.writer = SummaryWriter(args.log_path)
+        self.factor = args.log_factor
+        self.world_size = args.world_size
+
+    def log_scalar(self, name, scalar, global_iter):
+        if scalar is not None:
+            if type(scalar) == list:
+                for i, x in enumerate(scalar):
+                    self.log_scalar(f"{name}_{i}", x, global_iter)
+            else:
+                self.writer.add_scalar(name, number(detach(scalar)), global_iter)
+
+    def log_scalars(self, name, scalars, global_iter):
+        for s in scalars:
+            self.log_scalar(f"{name}/{s}", scalars[s], global_iter)
+
+    def log_image(self, name, tensor, mode, nrow, global_iter, pos=None, occ=None):
+        tensor = detach(tensor)
+        tensor = to_rgb(tensor, mode, pos, occ)
+        grid = make_grid(tensor, nrow=nrow, normalize=False, value_range=[0, 1], pad_value=0)
+        grid = torch.nn.functional.interpolate(grid[None], scale_factor=self.factor)[0]
+        self.writer.add_image(name, grid, global_iter)
+
+    def log_video(self, name, tensor, mode, nrow, global_iter, fps=4, pos=None, occ=None):
+        tensor = detach(tensor)
+        tensor = to_rgb(tensor, mode, pos, occ, is_video=True)
+        grid = []
+        for i in range(tensor.shape[1]):
+            grid.append(make_grid(tensor[:, i], nrow=nrow, normalize=False, value_range=[0, 1], pad_value=0))
+        grid = torch.stack(grid, dim=0)
+        grid = torch.nn.functional.interpolate(grid, scale_factor=self.factor)
+        grid = grid[None]
+        self.writer.add_video(name, grid, global_iter, fps=fps)
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/metrics/__init__.py b/data/dot_single_video/dot/utils/metrics/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c20d4718a6848cd4444fc59cacb223a1aec7eb0c
--- /dev/null
+++ b/data/dot_single_video/dot/utils/metrics/__init__.py
@@ -0,0 +1,7 @@
+def save_metrics(metrics, path):
+    names = list(metrics.keys())
+    num_values = len(metrics[names[0]])
+    with open(path, "w") as f:
+        f.write(",".join(names) + "\n")
+        for i in range(num_values):
+            f.write(",".join([str(metrics[name][i]) for name in names]) + "\n")
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/metrics/cvo_metrics.py b/data/dot_single_video/dot/utils/metrics/cvo_metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..e177c9a94159d3377cd278fee4725be83620f2ba
--- /dev/null
+++ b/data/dot_single_video/dot/utils/metrics/cvo_metrics.py
@@ -0,0 +1,32 @@
+import torch
+
+
+def compute_metrics(gt, pred, time):
+    epe_all, epe_occ, epe_vis = get_epe(pred["flow"], gt["flow"], gt["alpha"])
+    iou = get_iou(gt["alpha"], pred["alpha"])
+    metrics = {
+        "epe_all": epe_all.cpu().numpy(),
+        "epe_occ": epe_occ.cpu().numpy(),
+        "epe_vis": epe_vis.cpu().numpy(),
+        "iou": iou.cpu().numpy(),
+        "time": time
+    }
+    return metrics
+
+
+def get_epe(pred, label, vis):
+    diff = torch.norm(pred - label, p=2, dim=-1, keepdim=True)
+    epe_all = torch.mean(diff, dim=(1, 2, 3))
+    vis = vis[..., None]
+    epe_occ = torch.sum(diff * (1 - vis), dim=(1, 2, 3)) / torch.sum((1 - vis), dim=(1, 2, 3))
+    epe_vis = torch.sum((diff * vis), dim=(1, 2, 3)) / torch.sum(vis, dim=(1, 2, 3))
+    return epe_all, epe_occ, epe_vis
+
+
+def get_iou(vis1, vis2):
+    occ1 = (1 - vis1).bool()
+    occ2 = (1 - vis2).bool()
+    intersection = (occ1 & occ2).float().sum(dim=[1, 2])
+    union = (occ1 | occ2).float().sum(dim=[1, 2])
+    iou = intersection / union
+    return iou
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/metrics/tap_metrics.py b/data/dot_single_video/dot/utils/metrics/tap_metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..d296459b0bfc9ea616ef7143c50da8088c77c326
--- /dev/null
+++ b/data/dot_single_video/dot/utils/metrics/tap_metrics.py
@@ -0,0 +1,152 @@
+import numpy as np
+from typing import Mapping
+
+
+def compute_metrics(gt, pred, time, query_mode):
+    query_points = gt["query_points"].cpu().numpy()
+    gt_tracks = gt["tracks"][..., :2].permute(0, 2, 1, 3).cpu().numpy()
+    gt_occluded = (1 - gt["tracks"][..., 2]).permute(0, 2, 1).cpu().numpy()
+    pred_tracks = pred["tracks"][..., :2].permute(0, 2, 1, 3).cpu().numpy()
+    pred_occluded = (1 - pred["tracks"][..., 2]).permute(0, 2, 1).cpu().numpy()
+
+    metrics = compute_tapvid_metrics(
+        query_points,
+        gt_occluded,
+        gt_tracks,
+        pred_occluded,
+        pred_tracks,
+        query_mode=query_mode
+    )
+
+    metrics["time"] = time
+
+    return metrics
+
+
+def compute_tapvid_metrics(
+    query_points: np.ndarray,
+    gt_occluded: np.ndarray,
+    gt_tracks: np.ndarray,
+    pred_occluded: np.ndarray,
+    pred_tracks: np.ndarray,
+    query_mode: str,
+) -> Mapping[str, np.ndarray]:
+    """Computes TAP-Vid metrics (Jaccard, Pts. Within Thresh, Occ. Acc.)
+    See the TAP-Vid paper for details on the metric computation.  All inputs are
+    given in raster coordinates.  The first three arguments should be the direct
+    outputs of the reader: the 'query_points', 'occluded', and 'target_points'.
+    The paper metrics assume these are scaled relative to 256x256 images.
+    pred_occluded and pred_tracks are your algorithm's predictions.
+    This function takes a batch of inputs, and computes metrics separately for
+    each video.  The metrics for the full benchmark are a simple mean of the
+    metrics across the full set of videos.  These numbers are between 0 and 1,
+    but the paper multiplies them by 100 to ease reading.
+    Args:
+       query_points: The query points, an in the format [t, y, x].  Its size is
+         [b, n, 3], where b is the batch size and n is the number of queries
+       gt_occluded: A boolean array of shape [b, n, t], where t is the number
+         of frames.  True indicates that the point is occluded.
+       gt_tracks: The target points, of shape [b, n, t, 2].  Each point is
+         in the format [x, y]
+       pred_occluded: A boolean array of predicted occlusions, in the same
+         format as gt_occluded.
+       pred_tracks: An array of track predictions from your algorithm, in the
+         same format as gt_tracks.
+       query_mode: Either 'first' or 'strided', depending on how queries are
+         sampled.  If 'first', we assume the prior knowledge that all points
+         before the query point are occluded, and these are removed from the
+         evaluation.
+    Returns:
+        A dict with the following keys:
+        occlusion_accuracy: Accuracy at predicting occlusion.
+        pts_within_{x} for x in [1, 2, 4, 8, 16]: Fraction of points
+          predicted to be within the given pixel threshold, ignoring occlusion
+          prediction.
+        jaccard_{x} for x in [1, 2, 4, 8, 16]: Jaccard metric for the given
+          threshold
+        average_pts_within_thresh: average across pts_within_{x}
+        average_jaccard: average across jaccard_{x}
+    """
+
+    metrics = {}
+    # Fixed bug is described in:
+    # https://github.com/facebookresearch/co-tracker/issues/20
+    eye = np.eye(gt_tracks.shape[2], dtype=np.int32)
+
+    if query_mode == "first":
+        # evaluate frames after the query frame
+        query_frame_to_eval_frames = np.cumsum(eye, axis=1) - eye
+    elif query_mode == "strided":
+        # evaluate all frames except the query frame
+        query_frame_to_eval_frames = 1 - eye
+    else:
+        raise ValueError("Unknown query mode " + query_mode)
+
+    query_frame = query_points[..., 0]
+    query_frame = np.round(query_frame).astype(np.int32)
+    evaluation_points = query_frame_to_eval_frames[query_frame] > 0
+
+    # Occlusion accuracy is simply how often the predicted occlusion equals the
+    # ground truth.
+    occ_acc = np.sum(
+        np.equal(pred_occluded, gt_occluded) & evaluation_points,
+        axis=(1, 2),
+    ) / np.sum(evaluation_points)
+    metrics["occlusion_accuracy"] = occ_acc
+
+    # Next, convert the predictions and ground truth positions into pixel
+    # coordinates.
+    visible = np.logical_not(gt_occluded)
+    pred_visible = np.logical_not(pred_occluded)
+    all_frac_within = []
+    all_jaccard = []
+    for thresh in [1, 2, 4, 8, 16]:
+        # True positives are points that are within the threshold and where both
+        # the prediction and the ground truth are listed as visible.
+        within_dist = np.sum(
+            np.square(pred_tracks - gt_tracks),
+            axis=-1,
+        ) < np.square(thresh)
+        is_correct = np.logical_and(within_dist, visible)
+
+        # Compute the frac_within_threshold, which is the fraction of points
+        # within the threshold among points that are visible in the ground truth,
+        # ignoring whether they're predicted to be visible.
+        count_correct = np.sum(
+            is_correct & evaluation_points,
+            axis=(1, 2),
+        )
+        count_visible_points = np.sum(visible & evaluation_points, axis=(1, 2))
+        frac_correct = count_correct / count_visible_points
+        metrics["pts_within_" + str(thresh)] = frac_correct
+        all_frac_within.append(frac_correct)
+
+        true_positives = np.sum(
+            is_correct & pred_visible & evaluation_points, axis=(1, 2)
+        )
+
+        # The denominator of the jaccard metric is the true positives plus
+        # false positives plus false negatives.  However, note that true positives
+        # plus false negatives is simply the number of points in the ground truth
+        # which is easier to compute than trying to compute all three quantities.
+        # Thus we just add the number of points in the ground truth to the number
+        # of false positives.
+        #
+        # False positives are simply points that are predicted to be visible,
+        # but the ground truth is not visible or too far from the prediction.
+        gt_positives = np.sum(visible & evaluation_points, axis=(1, 2))
+        false_positives = (~visible) & pred_visible
+        false_positives = false_positives | ((~within_dist) & pred_visible)
+        false_positives = np.sum(false_positives & evaluation_points, axis=(1, 2))
+        jaccard = true_positives / (gt_positives + false_positives)
+        metrics["jaccard_" + str(thresh)] = jaccard
+        all_jaccard.append(jaccard)
+    metrics["average_jaccard"] = np.mean(
+        np.stack(all_jaccard, axis=1),
+        axis=1,
+    )
+    metrics["average_pts_within_thresh"] = np.mean(
+        np.stack(all_frac_within, axis=1),
+        axis=1,
+    )
+    return metrics
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/options/base_options.py b/data/dot_single_video/dot/utils/options/base_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0e5ef6e5559dfe0284418170050525c32ac8126
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/base_options.py
@@ -0,0 +1,73 @@
+import argparse
+import random
+from datetime import datetime
+
+
+def str2bool(v):
+    if isinstance(v, bool):
+       return v
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Boolean value expected.')
+
+
+class BaseOptions:
+    def initialize(self, parser):
+        parser.add_argument("--name", type=str)
+        parser.add_argument("--model", type=str, default="dot", choices=["dot", "of", "pt"])
+        parser.add_argument("--datetime", type=str, default=None)
+        parser.add_argument("--data_root", type=str)
+        parser.add_argument("--height", type=int, default=512)
+        parser.add_argument("--width", type=int, default=512)
+        parser.add_argument("--aspect_ratio", type=float, default=1)
+        parser.add_argument("--batch_size", type=int)
+        parser.add_argument("--num_tracks", type=int, default=2048)
+        parser.add_argument("--sim_tracks", type=int, default=2048)
+        parser.add_argument("--alpha_thresh", type=float, default=0.8)
+        parser.add_argument("--is_train", type=str2bool, nargs='?', const=True, default=False)
+
+        # Parallelization
+        parser.add_argument('--worker_idx', type=int, default=0)
+        parser.add_argument("--num_workers", type=int, default=2)
+
+        # Optical flow estimator
+        parser.add_argument("--estimator_config", type=str, default="configs/raft_patch_8.json")
+        parser.add_argument("--estimator_path", type=str, default="checkpoints/cvo_raft_patch_8.pth")
+        parser.add_argument("--flow_mode", type=str, default="direct", choices=["direct", "chain", "warm_start"])
+
+        # Optical flow refiner
+        parser.add_argument("--refiner_config", type=str, default="configs/raft_patch_4_alpha.json")
+        parser.add_argument("--refiner_path", type=str, default="checkpoints/movi_f_raft_patch_4_alpha.pth")
+
+        # Point tracker
+        parser.add_argument("--tracker_config", type=str, default="configs/cotracker2_patch_4_wind_8.json")
+        parser.add_argument("--tracker_path", type=str, default="checkpoints/movi_f_cotracker2_patch_4_wind_8.pth")
+        parser.add_argument("--sample_mode", type=str, default="all", choices=["all", "first", "last"])
+
+        # Dense optical tracker
+        parser.add_argument("--cell_size", type=int, default=1)
+        parser.add_argument("--cell_time_steps", type=int, default=20)
+
+        # Interpolation
+        parser.add_argument("--interpolation_version", type=str, default="torch3d", choices=["torch3d", "torch"])
+        return parser
+
+    def parse_args(self):
+        parser = argparse.ArgumentParser()
+        parser = self.initialize(parser)
+        args = parser.parse_args()
+        if args.datetime is None:
+            args.datetime = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+        name = f"{args.datetime}_{args.name}_{args.model}"
+        if hasattr(args, 'split'):
+            name += f"_{args.split}"
+        args.checkpoint_path = f"checkpoints/{name}"
+        args.log_path = f"logs/{name}"
+        args.result_path = f"results/{name}"
+        if hasattr(args, 'world_size'):
+            args.batch_size = args.batch_size // args.world_size
+            args.master_port = f'{10000 + random.randrange(1, 10000)}'
+        return args
diff --git a/data/dot_single_video/dot/utils/options/demo_options.py b/data/dot_single_video/dot/utils/options/demo_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..07f01eb06db9ef288fb408d376c0941d78d1baf8
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/demo_options.py
@@ -0,0 +1,23 @@
+from .base_options import BaseOptions, str2bool
+
+
+class DemoOptions(BaseOptions):
+    def initialize(self, parser):
+        BaseOptions.initialize(self, parser)
+        parser.add_argument("--inference_mode", type=str, default="tracks_from_first_to_every_other_frame")
+        parser.add_argument("--visualization_modes", type=str, nargs="+", default=["overlay", "spaghetti_last_static"])
+        parser.add_argument("--video_path", type=str, default="orange.mp4")
+        parser.add_argument("--mask_path", type=str, default="orange.png")
+        parser.add_argument("--save_tracks", type=str2bool, nargs='?', const=True, default=False)
+        parser.add_argument("--recompute_tracks", type=str2bool, nargs='?', const=True, default=False)
+        parser.add_argument("--overlay_factor", type=float, default=0.75)
+        parser.add_argument("--rainbow_mode", type=str, default="left_right", choices=["left_right", "up_down"])
+        parser.add_argument("--save_mode", type=str, default="video", choices=["image", "video"])
+        parser.add_argument("--spaghetti_radius", type=float, default=1.5)
+        parser.add_argument("--spaghetti_length", type=int, default=40)
+        parser.add_argument("--spaghetti_grid", type=int, default=30)
+        parser.add_argument("--spaghetti_scale", type=float, default=2)
+        parser.add_argument("--spaghetti_every", type=int, default=10)
+        parser.add_argument("--spaghetti_dropout", type=float, default=0)
+        parser.set_defaults(data_root="datasets/demo", name="demo", batch_size=1, height=480, width=856, num_tracks=8192)
+        return parser
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/options/preprocess_options.py b/data/dot_single_video/dot/utils/options/preprocess_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bf90cb5f65b887a0b8e4a2a0f53cf269cff6062
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/preprocess_options.py
@@ -0,0 +1,13 @@
+from .base_options import BaseOptions, str2bool
+
+
+class PreprocessOptions(BaseOptions):
+    def initialize(self, parser):
+        BaseOptions.initialize(self, parser)
+        parser.add_argument("--extract_movi_f", type=str2bool, nargs='?', const=True, default=False)
+        parser.add_argument("--save_tracks", type=str2bool, nargs='?', const=True, default=False)
+        parser.add_argument('--download_path', type=str, default="gs://kubric-public/tfds")
+        parser.add_argument('--num_videos', type=int, default=11000)
+        parser.set_defaults(data_root="datasets/kubric/movi_f", name="preprocess", num_workers=2, num_tracks=2048,
+                            model="pt")
+        return parser
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/options/test_cvo_options.py b/data/dot_single_video/dot/utils/options/test_cvo_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a1d634c40ebaaf6c28a7513028652c0da4d75be
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/test_cvo_options.py
@@ -0,0 +1,15 @@
+from .base_options import BaseOptions, str2bool
+
+
+class TestOptions(BaseOptions):
+    def initialize(self, parser):
+        BaseOptions.initialize(self, parser)
+        parser.add_argument("--split", type=str, choices=["clean", "final", "extended"], default="clean")
+        parser.add_argument("--filter", type=str2bool, nargs='?', const=True, default=True)
+        parser.add_argument('--filter_indices', type=int, nargs="+",
+                            default=[70, 77, 93, 96, 140, 143, 162, 172, 174, 179, 187, 215, 236, 284, 285, 293, 330,
+                                     358, 368, 402, 415, 458, 483, 495, 534])
+        parser.add_argument('--plot_indices', type=int, nargs="+", default=[])
+        parser.set_defaults(data_root="datasets/kubric/cvo", name="test_cvo", batch_size=1, num_workers=0,
+                            sample_mode="last")
+        return parser
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/options/test_tap_options.py b/data/dot_single_video/dot/utils/options/test_tap_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cc3ca5a82f752e77496721ef51a2f66f02ddc0b
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/test_tap_options.py
@@ -0,0 +1,11 @@
+from .base_options import BaseOptions
+
+
+class TestOptions(BaseOptions):
+    def initialize(self, parser):
+        BaseOptions.initialize(self, parser)
+        parser.add_argument("--split", type=str, choices=["davis", "rgb_stacking", "kinetics"], default="davis")
+        parser.add_argument("--query_mode", type=str, default="first", choices=["first", "strided"])
+        parser.add_argument('--plot_indices', type=int, nargs="+", default=[])
+        parser.set_defaults(data_root="datasets/tap", name="test_tap", batch_size=1, num_workers=0, num_tracks=8192)
+        return parser
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/options/train_options.py b/data/dot_single_video/dot/utils/options/train_options.py
new file mode 100644
index 0000000000000000000000000000000000000000..215b491f5f35faddd9737e24b0597f59c8257661
--- /dev/null
+++ b/data/dot_single_video/dot/utils/options/train_options.py
@@ -0,0 +1,27 @@
+from .base_options import BaseOptions
+
+
+class TrainOptions(BaseOptions):
+    def initialize(self, parser):
+        BaseOptions.initialize(self, parser)
+        parser.add_argument("--in_track_name", type=str, default="cotracker")
+        parser.add_argument("--out_track_name", type=str, default="ground_truth")
+        parser.add_argument("--num_in_tracks", type=int, default=2048)
+        parser.add_argument("--num_out_tracks", type=int, default=2048)
+        parser.add_argument("--batch_size_valid", type=int, default=4)
+        parser.add_argument("--train_iter", type=int, default=1000000)
+        parser.add_argument("--log_iter", type=int, default=10000)
+        parser.add_argument("--log_factor", type=float, default=1.)
+        parser.add_argument("--print_iter", type=int, default=100)
+        parser.add_argument("--valid_iter", type=int, default=10000)
+        parser.add_argument("--num_valid_batches", type=int, default=24)
+        parser.add_argument("--save_iter", type=int, default=1000)
+        parser.add_argument("--lr", type=float, default=0.0001)
+        parser.add_argument("--world_size", type=int, default=1)
+        parser.add_argument("--valid_ratio", type=float, default=0.01)
+        parser.add_argument("--lambda_motion_loss", type=float, default=1.)
+        parser.add_argument("--lambda_visibility_loss", type=float, default=1.)
+        parser.add_argument("--optimizer_path", type=str, default=None)
+        parser.set_defaults(data_root="datasets/kubric/movi_f", name="train", batch_size=8, refiner_path=None,
+                            is_train=True, model="ofr")
+        return parser
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/plot.py b/data/dot_single_video/dot/utils/plot.py
new file mode 100644
index 0000000000000000000000000000000000000000..de035d7a02e49662dd0918fdd10e5e52cb087960
--- /dev/null
+++ b/data/dot_single_video/dot/utils/plot.py
@@ -0,0 +1,197 @@
+import os.path as osp
+import matplotlib
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+
+from dot.utils.io import create_folder
+
+
+def to_rgb(tensor, mode, tracks=None, is_video=False, to_torch=True, reshape_as_video=False):
+    if isinstance(tensor, list):
+        tensor = torch.stack(tensor)
+    tensor = tensor.cpu().numpy()
+    if is_video:
+        batch_size, time_steps = tensor.shape[:2]
+    if mode == "flow":
+        height, width = tensor.shape[-3: -1]
+        tensor = np.reshape(tensor, (-1, height, width, 2))
+        tensor = flow_to_rgb(tensor)
+    elif mode == "mask":
+        height, width = tensor.shape[-2:]
+        tensor = np.reshape(tensor, (-1, 1, height, width))
+        tensor = np.repeat(tensor, 3, axis=1)
+    else:
+        height, width = tensor.shape[-2:]
+        tensor = np.reshape(tensor, (-1, 3, height, width))
+        if tracks is not None:
+            samples = tracks.size(-2)
+            tracks = tracks.cpu().numpy()
+            tracks = np.reshape(tracks, (-1, samples, 3))
+            traj, occ = tracks[..., :2], 1 - tracks[..., 2]
+            if is_video:
+                tensor = np.reshape(tensor, (-1, time_steps, 3, height, width))
+                traj = np.reshape(traj, (-1, time_steps, samples, 2))
+                occ = np.reshape(occ, (-1, time_steps, samples))
+                new_tensor = []
+                for t in range(time_steps):
+                    pos_t = traj[:, t]
+                    occ_t = occ[:, t]
+                    new_tensor.append(plot_tracks(tensor[:, t], pos_t, occ_t, tracks=traj[:, :t + 1]))
+                tensor = np.stack(new_tensor, axis=1)
+            else:
+                tensor = plot_tracks(tensor, traj, occ)
+    if is_video and reshape_as_video:
+        tensor = np.reshape(tensor, (batch_size, time_steps, 3, height, width))
+    else:
+        tensor = np.reshape(tensor, (-1, 3, height, width))
+    if to_torch:
+        tensor = torch.from_numpy(tensor)
+    return tensor
+
+
+def flow_to_rgb(flow, transparent=False):
+    flow = np.copy(flow)
+    H, W = flow.shape[-3: -1]
+    mul = 20.
+    scaling = mul / (H ** 2 + W ** 2) ** 0.5
+    direction = (np.arctan2(flow[..., 0], flow[..., 1]) + np.pi) / (2 * np.pi)
+    norm = np.linalg.norm(flow, axis=-1)
+    magnitude = np.clip(norm * scaling, 0., 1.)
+    saturation = np.ones_like(direction)
+    if transparent:
+        hsv = np.stack([direction, saturation, np.ones_like(magnitude)], axis=-1)
+    else:
+        hsv = np.stack([direction, saturation, magnitude], axis=-1)
+    rgb = matplotlib.colors.hsv_to_rgb(hsv)
+    rgb = np.moveaxis(rgb, -1, -3)
+    if transparent:
+        return np.concatenate([rgb, np.expand_dims(magnitude, axis=-3)], axis=-3)
+    return rgb
+
+
+def plot_tracks(rgb, points, occluded, tracks=None, trackgroup=None):
+    """Plot tracks with matplotlib.
+    Adapted from: https://github.com/google-research/kubric/blob/main/challenges/point_tracking/dataset.py"""
+    rgb = rgb.transpose(0, 2, 3, 1)
+    _, height, width, _ = rgb.shape
+    points = points.transpose(1, 0, 2).copy()  # clone, otherwise it updates points array
+    # points[..., 0] *= (width - 1)
+    # points[..., 1] *= (height - 1)
+    if tracks is not None:
+        tracks = tracks.copy()
+        # tracks[..., 0] *= (width - 1)
+        # tracks[..., 1] *= (height - 1)
+    if occluded is not None:
+        occluded = occluded.transpose(1, 0)
+    disp = []
+    cmap = plt.cm.hsv
+
+    z_list = np.arange(points.shape[0]) if trackgroup is None else np.array(trackgroup)
+    # random permutation of the colors so nearby points in the list can get different colors
+    np.random.seed(0)
+    z_list = np.random.permutation(np.max(z_list) + 1)[z_list]
+    colors = cmap(z_list / (np.max(z_list) + 1))
+    figure_dpi = 64
+
+    for i in range(rgb.shape[0]):
+        fig = plt.figure(
+            figsize=(width / figure_dpi, height / figure_dpi),
+            dpi=figure_dpi,
+            frameon=False,
+            facecolor='w')
+        ax = fig.add_subplot()
+        ax.axis('off')
+        ax.imshow(rgb[i])
+
+        valid = points[:, i, 0] > 0
+        valid = np.logical_and(valid, points[:, i, 0] < rgb.shape[2] - 1)
+        valid = np.logical_and(valid, points[:, i, 1] > 0)
+        valid = np.logical_and(valid, points[:, i, 1] < rgb.shape[1] - 1)
+
+        if occluded is not None:
+            colalpha = np.concatenate([colors[:, :-1], 1 - occluded[:, i:i + 1]], axis=1)
+        else:
+            colalpha = colors[:, :-1]
+        # Note: matplotlib uses pixel coordinates, not raster.
+        ax.scatter(
+            points[valid, i, 0] - 0.5,
+            points[valid, i, 1] - 0.5,
+            s=3,
+            c=colalpha[valid],
+        )
+
+        if tracks is not None:
+            for j in range(tracks.shape[2]):
+                track_color = colors[j]  # Use a different color for each track
+                x = tracks[i, :, j, 0]
+                y = tracks[i, :, j, 1]
+                valid_track = x > 0
+                valid_track = np.logical_and(valid_track, x < rgb.shape[2] - 1)
+                valid_track = np.logical_and(valid_track, y > 0)
+                valid_track = np.logical_and(valid_track, y < rgb.shape[1] - 1)
+                ax.plot(x[valid_track] - 0.5, y[valid_track] - 0.5, color=track_color, marker=None)
+
+        if occluded is not None:
+            occ2 = occluded[:, i:i + 1]
+
+            colalpha = np.concatenate([colors[:, :-1], occ2], axis=1)
+
+            ax.scatter(
+                points[valid, i, 0],
+                points[valid, i, 1],
+                s=20,
+                facecolors='none',
+                edgecolors=colalpha[valid],
+            )
+
+        plt.subplots_adjust(top=1, bottom=0, right=1, left=0, hspace=0, wspace=0)
+        plt.margins(0, 0)
+        fig.canvas.draw()
+        width, height = fig.get_size_inches() * fig.get_dpi()
+        img = np.frombuffer(
+            fig.canvas.tostring_rgb(),
+            dtype='uint8').reshape(int(height), int(width), 3)
+        disp.append(np.copy(img))
+        plt.close("all")
+
+    return np.stack(disp, axis=0).astype(float).transpose(0, 3, 1, 2) / 255 # TODO : inconsistent
+
+
+def plot_points(src_frame, tgt_frame, src_points, tgt_points, save_path, max_points=256):
+    _, H, W = src_frame.shape
+    src_frame = src_frame.permute(1, 2, 0).cpu().numpy()
+    tgt_frame = tgt_frame.permute(1, 2, 0).cpu().numpy()
+    src_points = src_points.cpu().numpy()
+    tgt_points = tgt_points.cpu().numpy()
+    src_pos, src_alpha = src_points[..., :2], src_points[..., 2]
+    tgt_pos, tgt_alpha = tgt_points[..., :2], tgt_points[..., 2]
+    src_pos = np.stack([src_pos[..., 0] * (W - 1), src_pos[..., 1] * (H - 1)], axis=-1)
+    tgt_pos = np.stack([tgt_pos[..., 0] * (W - 1), tgt_pos[..., 1] * (H - 1)], axis=-1)
+
+    plt.figure()
+    ax = plt.gca()
+    P = 10
+    plt.imshow(np.concatenate((src_frame, np.ones_like(src_frame[:, :P]), tgt_frame), axis=1))
+    indices = np.random.choice(len(src_pos), size=min(max_points, len(src_pos)), replace=False)
+    for i in indices:
+        if src_alpha[i] == 1:
+            ax.scatter(src_pos[i, 0], src_pos[i, 1], s=5, c="black", marker='x')
+        else:
+            ax.scatter(src_pos[i, 0], src_pos[i, 1], s=5, linewidths=1.5, c="black", marker='o')
+            ax.scatter(src_pos[i, 0], src_pos[i, 1], s=2.5, c="white", marker='o')
+        if tgt_alpha[i] == 1:
+            ax.scatter(W + P + tgt_pos[i, 0], tgt_pos[i, 1], s=5, c="black", marker='x')
+        else:
+            ax.scatter(W + P + tgt_pos[i, 0], tgt_pos[i, 1], s=5, linewidths=1.5, c="black", marker='o')
+            ax.scatter(W + P + tgt_pos[i, 0], tgt_pos[i, 1], s=2.5, c="white", marker='o')
+
+        plt.plot([src_pos[i, 0], W + P + tgt_pos[i, 0]], [src_pos[i, 1], tgt_pos[i, 1]], linewidth=0.5, c="black")
+
+    # Save
+    ax.axis('off')
+    plt.tight_layout()
+    plt.subplots_adjust(wspace=0)
+    create_folder(osp.dirname(save_path))
+    plt.savefig(save_path, bbox_inches='tight', pad_inches=0)
+    plt.close()
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch.py b/data/dot_single_video/dot/utils/torch.py
new file mode 100644
index 0000000000000000000000000000000000000000..6dd06e5ac671aee78d40523f00b759ef5abbc288
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch.py
@@ -0,0 +1,133 @@
+import numpy as np
+import torch
+import torch.distributed as dist
+
+
+def reduce(tensor, world_size):
+    if isinstance(tensor, torch.Tensor):
+        tensor = tensor.clone()
+        dist.all_reduce(tensor, dist.ReduceOp.SUM)
+        tensor.div_(world_size)
+    return tensor
+
+
+def expand(mask, num=1):
+    # mask: ... H W
+    # -----------------
+    # mask: ... H W
+    for _ in range(num):
+        mask[..., 1:, :] = mask[..., 1:, :] | mask[..., :-1, :]
+        mask[..., :-1, :] = mask[..., :-1, :] | mask[..., 1:, :]
+        mask[..., :, 1:] = mask[..., :, 1:] | mask[..., :, :-1]
+        mask[..., :, :-1] = mask[..., :, :-1] | mask[..., :, 1:]
+    return mask
+
+
+def differentiate(mask):
+    # mask: ... H W
+    # -----------------
+    # diff: ... H W
+    diff = torch.zeros_like(mask).bool()
+    diff_y = mask[..., 1:, :] != mask[..., :-1, :]
+    diff_x = mask[..., :, 1:] != mask[..., :, :-1]
+    diff[..., 1:, :] = diff[..., 1:, :] | diff_y
+    diff[..., :-1, :] = diff[..., :-1, :] | diff_y
+    diff[..., :, 1:] = diff[..., :, 1:] | diff_x
+    diff[..., :, :-1] = diff[..., :, :-1] | diff_x
+    return diff
+
+
+def sample_points(step, boundaries, num_samples):
+    if boundaries.ndim == 3:
+        points = []
+        for boundaries_k in boundaries:
+            points_k = sample_points(step, boundaries_k, num_samples)
+            points.append(points_k)
+        points = torch.stack(points)
+    else:
+        H, W = boundaries.shape
+        boundary_points, _ = sample_mask_points(step, boundaries, num_samples // 2)
+        num_boundary_points = boundary_points.shape[0]
+        num_random_points = num_samples - num_boundary_points
+        random_points = sample_random_points(step, H, W, num_random_points)
+        random_points = random_points.to(boundary_points.device)
+        points = torch.cat((boundary_points, random_points), dim=0)
+    return points
+
+
+def sample_mask_points(step, mask, num_points):
+    num_nonzero = int(mask.sum())
+    i, j = torch.nonzero(mask, as_tuple=True)
+    if num_points < num_nonzero:
+        sample = np.random.choice(num_nonzero, size=num_points, replace=False)
+        i, j = i[sample], j[sample]
+    t = torch.ones_like(i) * step
+    x, y = j, i
+    points = torch.stack((t, x, y), dim=-1)  # [num_points, 3]
+    return points.float(), (i, j)
+
+
+def sample_random_points(step, height, width, num_points):
+    x = torch.randint(width, size=[num_points])
+    y = torch.randint(height, size=[num_points])
+    t = torch.ones(num_points) * step
+    points = torch.stack((t, x, y), dim=-1)  # [num_points, 3]
+    return points.float()
+
+
+def get_grid(height, width, shape=None, dtype="torch", device="cpu", align_corners=True, normalize=True):
+    H, W = height, width
+    S = shape if shape else []
+    if align_corners:
+        x = torch.linspace(0, 1, W, device=device)
+        y = torch.linspace(0, 1, H, device=device)
+        if not normalize:
+            x = x * (W - 1)
+            y = y * (H - 1)
+    else:
+        x = torch.linspace(0.5 / W, 1.0 - 0.5 / W, W, device=device)
+        y = torch.linspace(0.5 / H, 1.0 - 0.5 / H, H, device=device)
+        if not normalize:
+            x = x * W
+            y = y * H
+    x_view, y_view, exp = [1 for _ in S] + [1, -1], [1 for _ in S] + [-1, 1], S + [H, W]
+    x = x.view(*x_view).expand(*exp)
+    y = y.view(*y_view).expand(*exp)
+    grid = torch.stack([x, y], dim=-1)
+    if dtype == "numpy":
+        grid = grid.numpy()
+    return grid
+
+
+def get_sobel_kernel(kernel_size):
+    K = kernel_size
+    sobel = torch.tensor(list(range(K))) - K // 2
+    sobel_x, sobel_y = sobel.view(-1, 1), sobel.view(1, -1)
+    sum_xy = sobel_x ** 2 + sobel_y ** 2
+    sum_xy[sum_xy == 0] = 1
+    sobel_x, sobel_y = sobel_x / sum_xy, sobel_y / sum_xy
+    sobel_kernel = torch.stack([sobel_x.unsqueeze(0), sobel_y.unsqueeze(0)], dim=0)
+    return sobel_kernel
+
+
+def to_device(data, device):
+    data = {k: v.to(device) for k, v in data.items()}
+    return data
+
+
+def get_alpha_consistency(bflow, fflow, thresh_1=0.01, thresh_2=0.5, thresh_mul=1):
+    norm = lambda x: x.pow(2).sum(dim=-1).sqrt()
+    B, H, W, C = bflow.shape
+
+    mag = norm(fflow) + norm(bflow)
+    grid = get_grid(H, W, shape=[B], device=fflow.device)
+    grid[..., 0] = grid[..., 0] + bflow[..., 0] / (W - 1)
+    grid[..., 1] = grid[..., 1] + bflow[..., 1] / (H - 1)
+    grid = grid * 2 - 1
+    fflow_warped = torch.nn.functional.grid_sample(fflow.permute(0, 3, 1, 2), grid, mode="bilinear", align_corners=True)
+    flow_diff = bflow + fflow_warped.permute(0, 2, 3, 1)
+    occ_thresh = thresh_1 * mag + thresh_2
+    occ_thresh = occ_thresh * thresh_mul
+    alpha = norm(flow_diff) < occ_thresh
+    alpha = alpha.float()
+    return alpha
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/LICENSE b/data/dot_single_video/dot/utils/torch3d/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..caab102f8b1bb5578bea0395d1a3c8dd62da6308
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/LICENSE
@@ -0,0 +1,30 @@
+BSD License
+
+For PyTorch3D software
+
+Copyright (c) Meta Platforms, Inc. and affiliates. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+
+ * Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+ * Neither the name Meta nor the names of its contributors may be used to
+   endorse or promote products derived from this software without specific
+   prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/data/dot_single_video/dot/utils/torch3d/__init__.py b/data/dot_single_video/dot/utils/torch3d/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..28195c1fbca7697dc428da32823467d9c4fc427d
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/__init__.py
@@ -0,0 +1,2 @@
+from .knn import knn_points
+from .packed_to_padded import packed_to_padded
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/build.ninja b/data/dot_single_video/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/build.ninja
new file mode 100644
index 0000000000000000000000000000000000000000..29a603cfd9c35f4ab0399c9e501551e2cae251e8
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/build.ninja
@@ -0,0 +1,36 @@
+ninja_required_version = 1.3
+cxx = c++
+nvcc = /usr/local/cuda/bin/nvcc
+
+cflags = -pthread -B /mnt/zhongwei/subapp/miniconda3/envs/torch2/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -fPIC -O2 -isystem /mnt/zhongwei/subapp/miniconda3/envs/torch2/include -fPIC -O2 -isystem /mnt/zhongwei/subapp/miniconda3/envs/torch2/include -fPIC -DWITH_CUDA -DTHRUST_IGNORE_CUB_VERSION_CHECK -I/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/TH -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/include/python3.10 -c
+post_cflags = -std=c++14 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
+cuda_cflags = -DWITH_CUDA -DTHRUST_IGNORE_CUB_VERSION_CHECK -I/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/TH -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/zhongwei/subapp/miniconda3/envs/torch2/include/python3.10 -c
+cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -std=c++14 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80
+cuda_dlink_post_cflags = 
+ldflags = 
+
+rule compile
+  command = $cxx -MMD -MF $out.d $cflags -c $in -o $out $post_cflags
+  depfile = $out.d
+  deps = gcc
+
+rule cuda_compile
+  depfile = $out.d
+  deps = gcc
+  command = $nvcc --generate-dependencies-with-compile --dependency-output $out.d $cuda_cflags -c $in -o $out $cuda_post_cflags
+
+
+
+
+
+build /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/ext.o: compile /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/ext.cpp
+build /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn.o: cuda_compile /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn.cu
+build /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn_cpu.o: compile /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn_cpu.cpp
+build /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.o: cuda_compile /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.cu
+build /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/build/temp.linux-x86_64-cpython-310/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.o: compile /mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.cpp
+
+
+
+
+
+
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/ext.cpp b/data/dot_single_video/dot/utils/torch3d/csrc/ext.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..ab0337e5db86f2aaea9eaf3d7049fd40d98d4884
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/ext.cpp
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+// clang-format off
+#include <torch/extension.h>
+// clang-format on
+#include "knn/knn.h"
+#include "packed_to_padded_tensor/packed_to_padded_tensor.h"
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+  m.def("packed_to_padded", &PackedToPadded);
+  m.def("padded_to_packed", &PaddedToPacked);
+#ifdef WITH_CUDA
+  m.def("knn_check_version", &KnnCheckVersion);
+#endif
+  m.def("knn_points_idx", &KNearestNeighborIdx);
+  m.def("knn_points_backward", &KNearestNeighborBackward);
+}
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.cu b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.cu
new file mode 100644
index 0000000000000000000000000000000000000000..633065c991eaa036f1b2041000c81a2638430a1a
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.cu
@@ -0,0 +1,584 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <float.h>
+#include <iostream>
+#include <tuple>
+
+#include "utils/dispatch.cuh"
+#include "utils/mink.cuh"
+
+// A chunk of work is blocksize-many points of P1.
+// The number of potential chunks to do is N*(1+(P1-1)/blocksize)
+// call (1+(P1-1)/blocksize) chunks_per_cloud
+// These chunks are divided among the gridSize-many blocks.
+// In block b, we work on chunks b, b+gridSize, b+2*gridSize etc .
+// In chunk i, we work on cloud i/chunks_per_cloud on points starting from
+// blocksize*(i%chunks_per_cloud).
+
+template <typename scalar_t>
+__global__ void KNearestNeighborKernelV0(
+    const scalar_t* __restrict__ points1,
+    const scalar_t* __restrict__ points2,
+    const int64_t* __restrict__ lengths1,
+    const int64_t* __restrict__ lengths2,
+    scalar_t* __restrict__ dists,
+    int64_t* __restrict__ idxs,
+    const size_t N,
+    const size_t P1,
+    const size_t P2,
+    const size_t D,
+    const size_t K,
+    const size_t norm) {
+  // Store both dists and indices for knn in global memory.
+  const int64_t chunks_per_cloud = (1 + (P1 - 1) / blockDim.x);
+  const int64_t chunks_to_do = N * chunks_per_cloud;
+  for (int64_t chunk = blockIdx.x; chunk < chunks_to_do; chunk += gridDim.x) {
+    const int64_t n = chunk / chunks_per_cloud;
+    const int64_t start_point = blockDim.x * (chunk % chunks_per_cloud);
+    int64_t p1 = start_point + threadIdx.x;
+    if (p1 >= lengths1[n])
+      continue;
+    int offset = n * P1 * K + p1 * K;
+    int64_t length2 = lengths2[n];
+    MinK<scalar_t, int64_t> mink(dists + offset, idxs + offset, K);
+    for (int p2 = 0; p2 < length2; ++p2) {
+      // Find the distance between points1[n, p1] and points[n, p2]
+      scalar_t dist = 0;
+      for (int d = 0; d < D; ++d) {
+        scalar_t coord1 = points1[n * P1 * D + p1 * D + d];
+        scalar_t coord2 = points2[n * P2 * D + p2 * D + d];
+        scalar_t diff = coord1 - coord2;
+        scalar_t norm_diff = (norm == 2) ? (diff * diff) : abs(diff);
+        dist += norm_diff;
+      }
+      mink.add(dist, p2);
+    }
+  }
+}
+
+template <typename scalar_t, int64_t D>
+__global__ void KNearestNeighborKernelV1(
+    const scalar_t* __restrict__ points1,
+    const scalar_t* __restrict__ points2,
+    const int64_t* __restrict__ lengths1,
+    const int64_t* __restrict__ lengths2,
+    scalar_t* __restrict__ dists,
+    int64_t* __restrict__ idxs,
+    const size_t N,
+    const size_t P1,
+    const size_t P2,
+    const size_t K,
+    const size_t norm) {
+  // Same idea as the previous version, but hoist D into a template argument
+  // so we can cache the current point in a thread-local array. We still store
+  // the current best K dists and indices in global memory, so this should work
+  // for very large K and fairly large D.
+  scalar_t cur_point[D];
+  const int64_t chunks_per_cloud = (1 + (P1 - 1) / blockDim.x);
+  const int64_t chunks_to_do = N * chunks_per_cloud;
+  for (int64_t chunk = blockIdx.x; chunk < chunks_to_do; chunk += gridDim.x) {
+    const int64_t n = chunk / chunks_per_cloud;
+    const int64_t start_point = blockDim.x * (chunk % chunks_per_cloud);
+    int64_t p1 = start_point + threadIdx.x;
+    if (p1 >= lengths1[n])
+      continue;
+    for (int d = 0; d < D; ++d) {
+      cur_point[d] = points1[n * P1 * D + p1 * D + d];
+    }
+    int offset = n * P1 * K + p1 * K;
+    int64_t length2 = lengths2[n];
+    MinK<scalar_t, int64_t> mink(dists + offset, idxs + offset, K);
+    for (int p2 = 0; p2 < length2; ++p2) {
+      // Find the distance between cur_point and points[n, p2]
+      scalar_t dist = 0;
+      for (int d = 0; d < D; ++d) {
+        scalar_t diff = cur_point[d] - points2[n * P2 * D + p2 * D + d];
+        scalar_t norm_diff = (norm == 2) ? (diff * diff) : abs(diff);
+        dist += norm_diff;
+      }
+      mink.add(dist, p2);
+    }
+  }
+}
+
+// This is a shim functor to allow us to dispatch using DispatchKernel1D
+template <typename scalar_t, int64_t D>
+struct KNearestNeighborV1Functor {
+  static void run(
+      size_t blocks,
+      size_t threads,
+      const scalar_t* __restrict__ points1,
+      const scalar_t* __restrict__ points2,
+      const int64_t* __restrict__ lengths1,
+      const int64_t* __restrict__ lengths2,
+      scalar_t* __restrict__ dists,
+      int64_t* __restrict__ idxs,
+      const size_t N,
+      const size_t P1,
+      const size_t P2,
+      const size_t K,
+      const size_t norm) {
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+    KNearestNeighborKernelV1<scalar_t, D><<<blocks, threads, 0, stream>>>(
+        points1, points2, lengths1, lengths2, dists, idxs, N, P1, P2, K, norm);
+  }
+};
+
+template <typename scalar_t, int64_t D, int64_t K>
+__global__ void KNearestNeighborKernelV2(
+    const scalar_t* __restrict__ points1,
+    const scalar_t* __restrict__ points2,
+    const int64_t* __restrict__ lengths1,
+    const int64_t* __restrict__ lengths2,
+    scalar_t* __restrict__ dists,
+    int64_t* __restrict__ idxs,
+    const int64_t N,
+    const int64_t P1,
+    const int64_t P2,
+    const size_t norm) {
+  // Same general implementation as V2, but also hoist K into a template arg.
+  scalar_t cur_point[D];
+  scalar_t min_dists[K];
+  int min_idxs[K];
+  const int64_t chunks_per_cloud = (1 + (P1 - 1) / blockDim.x);
+  const int64_t chunks_to_do = N * chunks_per_cloud;
+  for (int64_t chunk = blockIdx.x; chunk < chunks_to_do; chunk += gridDim.x) {
+    const int64_t n = chunk / chunks_per_cloud;
+    const int64_t start_point = blockDim.x * (chunk % chunks_per_cloud);
+    int64_t p1 = start_point + threadIdx.x;
+    if (p1 >= lengths1[n])
+      continue;
+    for (int d = 0; d < D; ++d) {
+      cur_point[d] = points1[n * P1 * D + p1 * D + d];
+    }
+    int64_t length2 = lengths2[n];
+    MinK<scalar_t, int> mink(min_dists, min_idxs, K);
+    for (int p2 = 0; p2 < length2; ++p2) {
+      scalar_t dist = 0;
+      for (int d = 0; d < D; ++d) {
+        int offset = n * P2 * D + p2 * D + d;
+        scalar_t diff = cur_point[d] - points2[offset];
+        scalar_t norm_diff = (norm == 2) ? (diff * diff) : abs(diff);
+        dist += norm_diff;
+      }
+      mink.add(dist, p2);
+    }
+    for (int k = 0; k < mink.size(); ++k) {
+      idxs[n * P1 * K + p1 * K + k] = min_idxs[k];
+      dists[n * P1 * K + p1 * K + k] = min_dists[k];
+    }
+  }
+}
+
+// This is a shim so we can dispatch using DispatchKernel2D
+template <typename scalar_t, int64_t D, int64_t K>
+struct KNearestNeighborKernelV2Functor {
+  static void run(
+      size_t blocks,
+      size_t threads,
+      const scalar_t* __restrict__ points1,
+      const scalar_t* __restrict__ points2,
+      const int64_t* __restrict__ lengths1,
+      const int64_t* __restrict__ lengths2,
+      scalar_t* __restrict__ dists,
+      int64_t* __restrict__ idxs,
+      const int64_t N,
+      const int64_t P1,
+      const int64_t P2,
+      const size_t norm) {
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+    KNearestNeighborKernelV2<scalar_t, D, K><<<blocks, threads, 0, stream>>>(
+        points1, points2, lengths1, lengths2, dists, idxs, N, P1, P2, norm);
+  }
+};
+
+template <typename scalar_t, int D, int K>
+__global__ void KNearestNeighborKernelV3(
+    const scalar_t* __restrict__ points1,
+    const scalar_t* __restrict__ points2,
+    const int64_t* __restrict__ lengths1,
+    const int64_t* __restrict__ lengths2,
+    scalar_t* __restrict__ dists,
+    int64_t* __restrict__ idxs,
+    const size_t N,
+    const size_t P1,
+    const size_t P2,
+    const size_t norm) {
+  // Same idea as V2, but use register indexing for thread-local arrays.
+  // Enabling sorting for this version leads to huge slowdowns; I suspect
+  // that it forces min_dists into local memory rather than registers.
+  // As a result this version is always unsorted.
+  scalar_t cur_point[D];
+  scalar_t min_dists[K];
+  int min_idxs[K];
+  const int64_t chunks_per_cloud = (1 + (P1 - 1) / blockDim.x);
+  const int64_t chunks_to_do = N * chunks_per_cloud;
+  for (int64_t chunk = blockIdx.x; chunk < chunks_to_do; chunk += gridDim.x) {
+    const int64_t n = chunk / chunks_per_cloud;
+    const int64_t start_point = blockDim.x * (chunk % chunks_per_cloud);
+    int64_t p1 = start_point + threadIdx.x;
+    if (p1 >= lengths1[n])
+      continue;
+    for (int d = 0; d < D; ++d) {
+      cur_point[d] = points1[n * P1 * D + p1 * D + d];
+    }
+    int64_t length2 = lengths2[n];
+    RegisterMinK<scalar_t, int, K> mink(min_dists, min_idxs);
+    for (int p2 = 0; p2 < length2; ++p2) {
+      scalar_t dist = 0;
+      for (int d = 0; d < D; ++d) {
+        int offset = n * P2 * D + p2 * D + d;
+        scalar_t diff = cur_point[d] - points2[offset];
+        scalar_t norm_diff = (norm == 2) ? (diff * diff) : abs(diff);
+        dist += norm_diff;
+      }
+      mink.add(dist, p2);
+    }
+    for (int k = 0; k < mink.size(); ++k) {
+      idxs[n * P1 * K + p1 * K + k] = min_idxs[k];
+      dists[n * P1 * K + p1 * K + k] = min_dists[k];
+    }
+  }
+}
+
+// This is a shim so we can dispatch using DispatchKernel2D
+template <typename scalar_t, int64_t D, int64_t K>
+struct KNearestNeighborKernelV3Functor {
+  static void run(
+      size_t blocks,
+      size_t threads,
+      const scalar_t* __restrict__ points1,
+      const scalar_t* __restrict__ points2,
+      const int64_t* __restrict__ lengths1,
+      const int64_t* __restrict__ lengths2,
+      scalar_t* __restrict__ dists,
+      int64_t* __restrict__ idxs,
+      const size_t N,
+      const size_t P1,
+      const size_t P2,
+      const size_t norm) {
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+    KNearestNeighborKernelV3<scalar_t, D, K><<<blocks, threads, 0, stream>>>(
+        points1, points2, lengths1, lengths2, dists, idxs, N, P1, P2, norm);
+  }
+};
+
+constexpr int V1_MIN_D = 1;
+constexpr int V1_MAX_D = 32;
+
+constexpr int V2_MIN_D = 1;
+constexpr int V2_MAX_D = 8;
+constexpr int V2_MIN_K = 1;
+constexpr int V2_MAX_K = 32;
+
+constexpr int V3_MIN_D = 1;
+constexpr int V3_MAX_D = 8;
+constexpr int V3_MIN_K = 1;
+constexpr int V3_MAX_K = 4;
+
+bool InBounds(const int64_t min, const int64_t x, const int64_t max) {
+  return min <= x && x <= max;
+}
+
+bool KnnCheckVersion(int version, const int64_t D, const int64_t K) {
+  if (version == 0) {
+    return true;
+  } else if (version == 1) {
+    return InBounds(V1_MIN_D, D, V1_MAX_D);
+  } else if (version == 2) {
+    return InBounds(V2_MIN_D, D, V2_MAX_D) && InBounds(V2_MIN_K, K, V2_MAX_K);
+  } else if (version == 3) {
+    return InBounds(V3_MIN_D, D, V3_MAX_D) && InBounds(V3_MIN_K, K, V3_MAX_K);
+  }
+  return false;
+}
+
+int ChooseVersion(const int64_t D, const int64_t K) {
+  for (int version = 3; version >= 1; version--) {
+    if (KnnCheckVersion(version, D, K)) {
+      return version;
+    }
+  }
+  return 0;
+}
+
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborIdxCuda(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const int norm,
+    const int K,
+    int version) {
+  // Check inputs are on the same device
+  at::TensorArg p1_t{p1, "p1", 1}, p2_t{p2, "p2", 2},
+      lengths1_t{lengths1, "lengths1", 3}, lengths2_t{lengths2, "lengths2", 4};
+  at::CheckedFrom c = "KNearestNeighborIdxCuda";
+  at::checkAllSameGPU(c, {p1_t, p2_t, lengths1_t, lengths2_t});
+  at::checkAllSameType(c, {p1_t, p2_t});
+
+  // Set the device for the kernel launch based on the device of the input
+  at::cuda::CUDAGuard device_guard(p1.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  const auto N = p1.size(0);
+  const auto P1 = p1.size(1);
+  const auto P2 = p2.size(1);
+  const auto D = p2.size(2);
+  const int64_t K_64 = K;
+
+  TORCH_CHECK((norm == 1) || (norm == 2), "Norm must be 1 or 2.");
+
+  TORCH_CHECK(p2.size(2) == D, "Point sets must have the same last dimension");
+  auto long_dtype = lengths1.options().dtype(at::kLong);
+  auto idxs = at::zeros({N, P1, K}, long_dtype);
+  auto dists = at::zeros({N, P1, K}, p1.options());
+
+  if (idxs.numel() == 0) {
+    AT_CUDA_CHECK(cudaGetLastError());
+    return std::make_tuple(idxs, dists);
+  }
+
+  if (version < 0) {
+    version = ChooseVersion(D, K);
+  } else if (!KnnCheckVersion(version, D, K)) {
+    int new_version = ChooseVersion(D, K);
+    std::cout << "WARNING: Requested KNN version " << version
+              << " is not compatible with D = " << D << "; K = " << K
+              << ". Falling back to version = " << new_version << std::endl;
+    version = new_version;
+  }
+
+  // At this point we should have a valid version no matter what data the user
+  // gave us. But we can check once more to be sure; however this time
+  // assert fail since failing at this point means we have a bug in our version
+  // selection or checking code.
+  AT_ASSERTM(KnnCheckVersion(version, D, K), "Invalid version");
+
+  const size_t threads = 256;
+  const size_t blocks = 256;
+  if (version == 0) {
+    AT_DISPATCH_FLOATING_TYPES(
+        p1.scalar_type(), "knn_kernel_cuda", ([&] {
+          KNearestNeighborKernelV0<scalar_t><<<blocks, threads, 0, stream>>>(
+              p1.contiguous().data_ptr<scalar_t>(),
+              p2.contiguous().data_ptr<scalar_t>(),
+              lengths1.contiguous().data_ptr<int64_t>(),
+              lengths2.contiguous().data_ptr<int64_t>(),
+              dists.data_ptr<scalar_t>(),
+              idxs.data_ptr<int64_t>(),
+              N,
+              P1,
+              P2,
+              D,
+              K,
+              norm);
+        }));
+  } else if (version == 1) {
+    AT_DISPATCH_FLOATING_TYPES(p1.scalar_type(), "knn_kernel_cuda", ([&] {
+                                 DispatchKernel1D<
+                                     KNearestNeighborV1Functor,
+                                     scalar_t,
+                                     V1_MIN_D,
+                                     V1_MAX_D>(
+                                     D,
+                                     blocks,
+                                     threads,
+                                     p1.contiguous().data_ptr<scalar_t>(),
+                                     p2.contiguous().data_ptr<scalar_t>(),
+                                     lengths1.contiguous().data_ptr<int64_t>(),
+                                     lengths2.contiguous().data_ptr<int64_t>(),
+                                     dists.data_ptr<scalar_t>(),
+                                     idxs.data_ptr<int64_t>(),
+                                     N,
+                                     P1,
+                                     P2,
+                                     K,
+                                     norm);
+                               }));
+  } else if (version == 2) {
+    AT_DISPATCH_FLOATING_TYPES(p1.scalar_type(), "knn_kernel_cuda", ([&] {
+                                 DispatchKernel2D<
+                                     KNearestNeighborKernelV2Functor,
+                                     scalar_t,
+                                     V2_MIN_D,
+                                     V2_MAX_D,
+                                     V2_MIN_K,
+                                     V2_MAX_K>(
+                                     D,
+                                     K_64,
+                                     blocks,
+                                     threads,
+                                     p1.contiguous().data_ptr<scalar_t>(),
+                                     p2.contiguous().data_ptr<scalar_t>(),
+                                     lengths1.contiguous().data_ptr<int64_t>(),
+                                     lengths2.contiguous().data_ptr<int64_t>(),
+                                     dists.data_ptr<scalar_t>(),
+                                     idxs.data_ptr<int64_t>(),
+                                     N,
+                                     P1,
+                                     P2,
+                                     norm);
+                               }));
+  } else if (version == 3) {
+    AT_DISPATCH_FLOATING_TYPES(p1.scalar_type(), "knn_kernel_cuda", ([&] {
+                                 DispatchKernel2D<
+                                     KNearestNeighborKernelV3Functor,
+                                     scalar_t,
+                                     V3_MIN_D,
+                                     V3_MAX_D,
+                                     V3_MIN_K,
+                                     V3_MAX_K>(
+                                     D,
+                                     K_64,
+                                     blocks,
+                                     threads,
+                                     p1.contiguous().data_ptr<scalar_t>(),
+                                     p2.contiguous().data_ptr<scalar_t>(),
+                                     lengths1.contiguous().data_ptr<int64_t>(),
+                                     lengths2.contiguous().data_ptr<int64_t>(),
+                                     dists.data_ptr<scalar_t>(),
+                                     idxs.data_ptr<int64_t>(),
+                                     N,
+                                     P1,
+                                     P2,
+                                     norm);
+                               }));
+  }
+  AT_CUDA_CHECK(cudaGetLastError());
+  return std::make_tuple(idxs, dists);
+}
+
+// ------------------------------------------------------------- //
+//                   Backward Operators                          //
+// ------------------------------------------------------------- //
+
+// TODO(gkioxari) support all data types once AtomicAdd supports doubles.
+// Currently, support is for floats only.
+__global__ void KNearestNeighborBackwardKernel(
+    const float* __restrict__ p1, // (N, P1, D)
+    const float* __restrict__ p2, // (N, P2, D)
+    const int64_t* __restrict__ lengths1, // (N,)
+    const int64_t* __restrict__ lengths2, // (N,)
+    const int64_t* __restrict__ idxs, // (N, P1, K)
+    const float* __restrict__ grad_dists, // (N, P1, K)
+    float* __restrict__ grad_p1, // (N, P1, D)
+    float* __restrict__ grad_p2, // (N, P2, D)
+    const size_t N,
+    const size_t P1,
+    const size_t P2,
+    const size_t K,
+    const size_t D,
+    const size_t norm) {
+  const size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+  const size_t stride = gridDim.x * blockDim.x;
+
+  for (size_t i = tid; i < N * P1 * K * D; i += stride) {
+    const size_t n = i / (P1 * K * D); // batch index
+    size_t rem = i % (P1 * K * D);
+    const size_t p1_idx = rem / (K * D); // index of point in p1
+    rem = rem % (K * D);
+    const size_t k = rem / D; // k-th nearest neighbor
+    const size_t d = rem % D; // d-th dimension in the feature vector
+
+    const size_t num1 = lengths1[n]; // number of valid points in p1 in batch
+    const size_t num2 = lengths2[n]; // number of valid points in p2 in batch
+    if ((p1_idx < num1) && (k < num2)) {
+      const float grad_dist = grad_dists[n * P1 * K + p1_idx * K + k];
+      // index of point in p2 corresponding to the k-th nearest neighbor
+      const size_t p2_idx = idxs[n * P1 * K + p1_idx * K + k];
+      // If the index is the pad value of -1 then ignore it
+      if (p2_idx == -1) {
+        continue;
+      }
+      float diff = 0.0;
+      if (norm == 1) {
+        float sign =
+            (p1[n * P1 * D + p1_idx * D + d] > p2[n * P2 * D + p2_idx * D + d])
+            ? 1.0
+            : -1.0;
+        diff = grad_dist * sign;
+      } else { // norm is 2
+        diff = 2.0 * grad_dist *
+            (p1[n * P1 * D + p1_idx * D + d] - p2[n * P2 * D + p2_idx * D + d]);
+      }
+      atomicAdd(grad_p1 + n * P1 * D + p1_idx * D + d, diff);
+      atomicAdd(grad_p2 + n * P2 * D + p2_idx * D + d, -1.0f * diff);
+    }
+  }
+}
+
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborBackwardCuda(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const at::Tensor& idxs,
+    int norm,
+    const at::Tensor& grad_dists) {
+  // Check inputs are on the same device
+  at::TensorArg p1_t{p1, "p1", 1}, p2_t{p2, "p2", 2},
+      lengths1_t{lengths1, "lengths1", 3}, lengths2_t{lengths2, "lengths2", 4},
+      idxs_t{idxs, "idxs", 5}, grad_dists_t{grad_dists, "grad_dists", 6};
+  at::CheckedFrom c = "KNearestNeighborBackwardCuda";
+  at::checkAllSameGPU(
+      c, {p1_t, p2_t, lengths1_t, lengths2_t, idxs_t, grad_dists_t});
+  at::checkAllSameType(c, {p1_t, p2_t, grad_dists_t});
+
+  // Set the device for the kernel launch based on the device of the input
+  at::cuda::CUDAGuard device_guard(p1.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  const auto N = p1.size(0);
+  const auto P1 = p1.size(1);
+  const auto P2 = p2.size(1);
+  const auto D = p2.size(2);
+  const auto K = idxs.size(2);
+
+  TORCH_CHECK(p1.size(2) == D, "Point sets must have the same last dimension");
+  TORCH_CHECK(idxs.size(0) == N, "KNN idxs must have the same batch dimension");
+  TORCH_CHECK(
+      idxs.size(1) == P1, "KNN idxs must have the same point dimension as p1");
+  TORCH_CHECK(grad_dists.size(0) == N);
+  TORCH_CHECK(grad_dists.size(1) == P1);
+  TORCH_CHECK(grad_dists.size(2) == K);
+
+  auto grad_p1 = at::zeros({N, P1, D}, p1.options());
+  auto grad_p2 = at::zeros({N, P2, D}, p2.options());
+
+  if (grad_p1.numel() == 0 || grad_p2.numel() == 0) {
+    AT_CUDA_CHECK(cudaGetLastError());
+    return std::make_tuple(grad_p1, grad_p2);
+  }
+
+  const int blocks = 64;
+  const int threads = 512;
+
+  KNearestNeighborBackwardKernel<<<blocks, threads, 0, stream>>>(
+      p1.contiguous().data_ptr<float>(),
+      p2.contiguous().data_ptr<float>(),
+      lengths1.contiguous().data_ptr<int64_t>(),
+      lengths2.contiguous().data_ptr<int64_t>(),
+      idxs.contiguous().data_ptr<int64_t>(),
+      grad_dists.contiguous().data_ptr<float>(),
+      grad_p1.data_ptr<float>(),
+      grad_p2.data_ptr<float>(),
+      N,
+      P1,
+      P2,
+      K,
+      D,
+      norm);
+
+  AT_CUDA_CHECK(cudaGetLastError());
+  return std::make_tuple(grad_p1, grad_p2);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.h b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.h
new file mode 100644
index 0000000000000000000000000000000000000000..c27126cf52ac273f8a46313571648cb7b1fdf1f5
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn.h
@@ -0,0 +1,157 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#include <torch/extension.h>
+#include <tuple>
+#include "utils/pytorch3d_cutils.h"
+
+// Compute indices of K nearest neighbors in pointcloud p2 to points
+// in pointcloud p1.
+//
+// Args:
+//    p1: FloatTensor of shape (N, P1, D) giving a batch of pointclouds each
+//        containing P1 points of dimension D.
+//    p2: FloatTensor of shape (N, P2, D) giving a batch of pointclouds each
+//        containing P2 points of dimension D.
+//    lengths1: LongTensor, shape (N,), giving actual length of each P1 cloud.
+//    lengths2: LongTensor, shape (N,), giving actual length of each P2 cloud.
+//    norm: int specifying the norm for the distance (1 for L1, 2 for L2)
+//    K: int giving the number of nearest points to return.
+//    version: Integer telling which implementation to use.
+//
+// Returns:
+//    p1_neighbor_idx: LongTensor of shape (N, P1, K), where
+//        p1_neighbor_idx[n, i, k] = j means that the kth nearest
+//        neighbor to p1[n, i] in the cloud p2[n] is p2[n, j].
+//        It is padded with zeros so that it can be used easily in a later
+//        gather() operation.
+//
+//    p1_neighbor_dists: FloatTensor of shape (N, P1, K) containing the squared
+//        distance from each point p1[n, p, :] to its K neighbors
+//        p2[n, p1_neighbor_idx[n, p, k], :].
+
+// CPU implementation.
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborIdxCpu(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const int norm,
+    const int K);
+
+// CUDA implementation
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborIdxCuda(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const int norm,
+    const int K,
+    const int version);
+
+// Implementation which is exposed.
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborIdx(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const int norm,
+    const int K,
+    const int version) {
+  if (p1.is_cuda() || p2.is_cuda()) {
+#ifdef WITH_CUDA
+    CHECK_CUDA(p1);
+    CHECK_CUDA(p2);
+    return KNearestNeighborIdxCuda(
+        p1, p2, lengths1, lengths2, norm, K, version);
+#else
+    AT_ERROR("Not compiled with GPU support.");
+#endif
+  }
+  return KNearestNeighborIdxCpu(p1, p2, lengths1, lengths2, norm, K);
+}
+
+// Compute gradients with respect to p1 and p2
+//
+// Args:
+//    p1: FloatTensor of shape (N, P1, D) giving a batch of pointclouds each
+//        containing P1 points of dimension D.
+//    p2: FloatTensor of shape (N, P2, D) giving a batch of pointclouds each
+//        containing P2 points of dimension D.
+//    lengths1: LongTensor, shape (N,), giving actual length of each P1 cloud.
+//    lengths2: LongTensor, shape (N,), giving actual length of each P2 cloud.
+//    p1_neighbor_idx: LongTensor of shape (N, P1, K), where
+//        p1_neighbor_idx[n, i, k] = j means that the kth nearest
+//        neighbor to p1[n, i] in the cloud p2[n] is p2[n, j].
+//        It is padded with zeros so that it can be used easily in a later
+//        gather() operation. This is computed from the forward pass.
+//    norm: int specifying the norm for the distance (1 for L1, 2 for L2)
+//    grad_dists: FLoatTensor of shape (N, P1, K) which contains the input
+//        gradients.
+//
+// Returns:
+//    grad_p1: FloatTensor of shape (N, P1, D) containing the output gradients
+//        wrt p1.
+//    grad_p2: FloatTensor of shape (N, P2, D) containing the output gradients
+//        wrt p2.
+
+// CPU implementation.
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborBackwardCpu(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const at::Tensor& idxs,
+    const int norm,
+    const at::Tensor& grad_dists);
+
+// CUDA implementation
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborBackwardCuda(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const at::Tensor& idxs,
+    const int norm,
+    const at::Tensor& grad_dists);
+
+// Implementation which is exposed.
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborBackward(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const at::Tensor& idxs,
+    const int norm,
+    const at::Tensor& grad_dists) {
+  if (p1.is_cuda() || p2.is_cuda()) {
+#ifdef WITH_CUDA
+    CHECK_CUDA(p1);
+    CHECK_CUDA(p2);
+    return KNearestNeighborBackwardCuda(
+        p1, p2, lengths1, lengths2, idxs, norm, grad_dists);
+#else
+    AT_ERROR("Not compiled with GPU support.");
+#endif
+  }
+  return KNearestNeighborBackwardCpu(
+      p1, p2, lengths1, lengths2, idxs, norm, grad_dists);
+}
+
+// Utility to check whether a KNN version can be used.
+//
+// Args:
+//    version: Integer in the range 0 <= version <= 3 indicating one of our
+//        KNN implementations.
+//    D: Number of dimensions for the input and query point clouds
+//    K: Number of neighbors to be found
+//
+// Returns:
+//    Whether the indicated KNN version can be used.
+bool KnnCheckVersion(int version, const int64_t D, const int64_t K);
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn_cpu.cpp b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn_cpu.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..896e6f6ab2c952f214fb537cadd10423a8fb1663
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/knn/knn_cpu.cpp
@@ -0,0 +1,128 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <torch/extension.h>
+#include <queue>
+#include <tuple>
+
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborIdxCpu(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const int norm,
+    const int K) {
+  const int N = p1.size(0);
+  const int P1 = p1.size(1);
+  const int D = p1.size(2);
+
+  auto long_opts = lengths1.options().dtype(torch::kInt64);
+  torch::Tensor idxs = torch::full({N, P1, K}, 0, long_opts);
+  torch::Tensor dists = torch::full({N, P1, K}, 0, p1.options());
+
+  auto p1_a = p1.accessor<float, 3>();
+  auto p2_a = p2.accessor<float, 3>();
+  auto lengths1_a = lengths1.accessor<int64_t, 1>();
+  auto lengths2_a = lengths2.accessor<int64_t, 1>();
+  auto idxs_a = idxs.accessor<int64_t, 3>();
+  auto dists_a = dists.accessor<float, 3>();
+
+  for (int n = 0; n < N; ++n) {
+    const int64_t length1 = lengths1_a[n];
+    const int64_t length2 = lengths2_a[n];
+    for (int64_t i1 = 0; i1 < length1; ++i1) {
+      // Use a priority queue to store (distance, index) tuples.
+      std::priority_queue<std::tuple<float, int>> q;
+      for (int64_t i2 = 0; i2 < length2; ++i2) {
+        float dist = 0;
+        for (int d = 0; d < D; ++d) {
+          float diff = p1_a[n][i1][d] - p2_a[n][i2][d];
+          if (norm == 1) {
+            dist += abs(diff);
+          } else { // norm is 2 (default)
+            dist += diff * diff;
+          }
+        }
+        int size = static_cast<int>(q.size());
+        if (size < K || dist < std::get<0>(q.top())) {
+          q.emplace(dist, i2);
+          if (size >= K) {
+            q.pop();
+          }
+        }
+      }
+      while (!q.empty()) {
+        auto t = q.top();
+        q.pop();
+        const int k = q.size();
+        dists_a[n][i1][k] = std::get<0>(t);
+        idxs_a[n][i1][k] = std::get<1>(t);
+      }
+    }
+  }
+  return std::make_tuple(idxs, dists);
+}
+
+// ------------------------------------------------------------- //
+//                   Backward Operators                          //
+// ------------------------------------------------------------- //
+
+std::tuple<at::Tensor, at::Tensor> KNearestNeighborBackwardCpu(
+    const at::Tensor& p1,
+    const at::Tensor& p2,
+    const at::Tensor& lengths1,
+    const at::Tensor& lengths2,
+    const at::Tensor& idxs,
+    const int norm,
+    const at::Tensor& grad_dists) {
+  const int N = p1.size(0);
+  const int P1 = p1.size(1);
+  const int D = p1.size(2);
+  const int P2 = p2.size(1);
+  const int K = idxs.size(2);
+
+  torch::Tensor grad_p1 = torch::full({N, P1, D}, 0, p1.options());
+  torch::Tensor grad_p2 = torch::full({N, P2, D}, 0, p2.options());
+
+  auto p1_a = p1.accessor<float, 3>();
+  auto p2_a = p2.accessor<float, 3>();
+  auto lengths1_a = lengths1.accessor<int64_t, 1>();
+  auto lengths2_a = lengths2.accessor<int64_t, 1>();
+  auto idxs_a = idxs.accessor<int64_t, 3>();
+  auto grad_dists_a = grad_dists.accessor<float, 3>();
+  auto grad_p1_a = grad_p1.accessor<float, 3>();
+  auto grad_p2_a = grad_p2.accessor<float, 3>();
+
+  for (int n = 0; n < N; ++n) {
+    const int64_t length1 = lengths1_a[n];
+    int64_t length2 = lengths2_a[n];
+    length2 = (length2 < K) ? length2 : K;
+    for (int64_t i1 = 0; i1 < length1; ++i1) {
+      for (int64_t k = 0; k < length2; ++k) {
+        const int64_t i2 = idxs_a[n][i1][k];
+        // If the index is the pad value of -1 then ignore it
+        if (i2 == -1) {
+          continue;
+        }
+        for (int64_t d = 0; d < D; ++d) {
+          float diff = 0.0;
+          if (norm == 1) {
+            float sign = (p1_a[n][i1][d] > p2_a[n][i2][d]) ? 1.0 : -1.0;
+            diff = grad_dists_a[n][i1][k] * sign;
+          } else { // norm is 2 (default)
+            diff = 2.0f * grad_dists_a[n][i1][k] *
+                (p1_a[n][i1][d] - p2_a[n][i2][d]);
+          }
+          grad_p1_a[n][i1][d] += diff;
+          grad_p2_a[n][i2][d] += -1.0f * diff;
+        }
+      }
+    }
+  }
+  return std::make_tuple(grad_p1, grad_p2);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.cu b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.cu
new file mode 100644
index 0000000000000000000000000000000000000000..24c05ad54eef1ae1fa53f8caf2c8022832214a55
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.cu
@@ -0,0 +1,241 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+
+// Kernel for inputs_packed of shape (F, D), where D > 1
+template <typename scalar_t>
+__global__ void PackedToPaddedKernel(
+    const scalar_t* __restrict__ inputs_packed,
+    const int64_t* __restrict__ first_idxs,
+    scalar_t* __restrict__ inputs_padded,
+    const size_t batch_size,
+    const size_t max_size,
+    const size_t num_inputs,
+    const size_t D) {
+  // Batch elements split evenly across blocks (num blocks = batch_size) and
+  // values for each element split across threads in the block. Each thread adds
+  // the values of its respective input elements to the global inputs_padded
+  // tensor.
+  const size_t tid = threadIdx.x;
+  const size_t batch_idx = blockIdx.x;
+
+  const int64_t start = first_idxs[batch_idx];
+  const int64_t end =
+      batch_idx + 1 < batch_size ? first_idxs[batch_idx + 1] : num_inputs;
+  const int num = end - start;
+  for (size_t f = tid; f < num; f += blockDim.x) {
+    for (size_t j = 0; j < D; ++j) {
+      inputs_padded[batch_idx * max_size * D + f * D + j] =
+          inputs_packed[(start + f) * D + j];
+    }
+  }
+}
+
+// Kernel for inputs of shape (F, 1)
+template <typename scalar_t>
+__global__ void PackedToPaddedKernelD1(
+    const scalar_t* __restrict__ inputs_packed,
+    const int64_t* __restrict__ first_idxs,
+    scalar_t* __restrict__ inputs_padded,
+    const size_t batch_size,
+    const size_t max_size,
+    const size_t num_inputs) {
+  // Batch elements split evenly across blocks (num blocks = batch_size) and
+  // values for each element split across threads in the block. Each thread adds
+  // the values of its respective input elements to the global inputs_padded
+  // tensor.
+  const size_t tid = threadIdx.x;
+  const size_t batch_idx = blockIdx.x;
+
+  const int64_t start = first_idxs[batch_idx];
+  const int64_t end =
+      batch_idx + 1 < batch_size ? first_idxs[batch_idx + 1] : num_inputs;
+  const int num = end - start;
+  for (size_t f = tid; f < num; f += blockDim.x) {
+    inputs_padded[batch_idx * max_size + f] = inputs_packed[start + f];
+  }
+}
+
+// Kernel for inputs_padded of shape (B, F, D), where D > 1
+template <typename scalar_t>
+__global__ void PaddedToPackedKernel(
+    const scalar_t* __restrict__ inputs_padded,
+    const int64_t* __restrict__ first_idxs,
+    scalar_t* __restrict__ inputs_packed,
+    const size_t batch_size,
+    const size_t max_size,
+    const size_t num_inputs,
+    const size_t D) {
+  // Batch elements split evenly across blocks (num blocks = batch_size) and
+  // values for each element split across threads in the block. Each thread adds
+  // the values of its respective input elements to the global inputs_packed
+  // tensor.
+  const size_t tid = threadIdx.x;
+  const size_t batch_idx = blockIdx.x;
+
+  const int64_t start = first_idxs[batch_idx];
+  const int64_t end =
+      batch_idx + 1 < batch_size ? first_idxs[batch_idx + 1] : num_inputs;
+  const int num = end - start;
+  for (size_t f = tid; f < num; f += blockDim.x) {
+    for (size_t j = 0; j < D; ++j) {
+      inputs_packed[(start + f) * D + j] =
+          inputs_padded[batch_idx * max_size * D + f * D + j];
+    }
+  }
+}
+
+// Kernel for inputs_padded of shape (B, F, 1)
+template <typename scalar_t>
+__global__ void PaddedToPackedKernelD1(
+    const scalar_t* __restrict__ inputs_padded,
+    const int64_t* __restrict__ first_idxs,
+    scalar_t* __restrict__ inputs_packed,
+    const size_t batch_size,
+    const size_t max_size,
+    const size_t num_inputs) {
+  // Batch elements split evenly across blocks (num blocks = batch_size) and
+  // values for each element split across threads in the block. Each thread adds
+  // the values of its respective input elements to the global inputs_packed
+  // tensor.
+  const size_t tid = threadIdx.x;
+  const size_t batch_idx = blockIdx.x;
+
+  const int64_t start = first_idxs[batch_idx];
+  const int64_t end =
+      batch_idx + 1 < batch_size ? first_idxs[batch_idx + 1] : num_inputs;
+  const int num = end - start;
+  for (size_t f = tid; f < num; f += blockDim.x) {
+    inputs_packed[start + f] = inputs_padded[batch_idx * max_size + f];
+  }
+}
+
+at::Tensor PackedToPaddedCuda(
+    const at::Tensor inputs_packed,
+    const at::Tensor first_idxs,
+    const int64_t max_size) {
+  // Check inputs are on the same device
+  at::TensorArg inputs_packed_t{inputs_packed, "inputs_packed", 1},
+      first_idxs_t{first_idxs, "first_idxs", 2};
+  at::CheckedFrom c = "PackedToPaddedCuda";
+  at::checkAllSameGPU(c, {inputs_packed_t, first_idxs_t});
+
+  // Set the device for the kernel launch based on the device of the input
+  at::cuda::CUDAGuard device_guard(inputs_packed.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  const int64_t num_inputs = inputs_packed.size(0);
+  const int64_t batch_size = first_idxs.size(0);
+
+  TORCH_CHECK(
+      inputs_packed.dim() == 2, "inputs_packed must be a 2-dimensional tensor");
+  const int64_t D = inputs_packed.size(1);
+  at::Tensor inputs_padded =
+      at::zeros({batch_size, max_size, D}, inputs_packed.options());
+
+  if (inputs_padded.numel() == 0) {
+    AT_CUDA_CHECK(cudaGetLastError());
+    return inputs_padded;
+  }
+
+  const int threads = 512;
+  const int blocks = batch_size;
+  if (D == 1) {
+    AT_DISPATCH_FLOATING_TYPES(
+        inputs_packed.scalar_type(), "packed_to_padded_d1_kernel", ([&] {
+          PackedToPaddedKernelD1<scalar_t><<<blocks, threads, 0, stream>>>(
+              inputs_packed.contiguous().data_ptr<scalar_t>(),
+              first_idxs.contiguous().data_ptr<int64_t>(),
+              inputs_padded.data_ptr<scalar_t>(),
+              batch_size,
+              max_size,
+              num_inputs);
+        }));
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(
+        inputs_packed.scalar_type(), "packed_to_padded_kernel", ([&] {
+          PackedToPaddedKernel<scalar_t><<<blocks, threads, 0, stream>>>(
+              inputs_packed.contiguous().data_ptr<scalar_t>(),
+              first_idxs.contiguous().data_ptr<int64_t>(),
+              inputs_padded.data_ptr<scalar_t>(),
+              batch_size,
+              max_size,
+              num_inputs,
+              D);
+        }));
+  }
+
+  AT_CUDA_CHECK(cudaGetLastError());
+  return inputs_padded;
+}
+
+at::Tensor PaddedToPackedCuda(
+    const at::Tensor inputs_padded,
+    const at::Tensor first_idxs,
+    const int64_t num_inputs) {
+  // Check inputs are on the same device
+  at::TensorArg inputs_padded_t{inputs_padded, "inputs_padded", 1},
+      first_idxs_t{first_idxs, "first_idxs", 2};
+  at::CheckedFrom c = "PaddedToPackedCuda";
+  at::checkAllSameGPU(c, {inputs_padded_t, first_idxs_t});
+
+  // Set the device for the kernel launch based on the device of the input
+  at::cuda::CUDAGuard device_guard(inputs_padded.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  const int64_t batch_size = inputs_padded.size(0);
+  const int64_t max_size = inputs_padded.size(1);
+
+  TORCH_CHECK(batch_size == first_idxs.size(0), "sizes mismatch");
+  TORCH_CHECK(
+      inputs_padded.dim() == 3,
+      "inputs_padded  must be a 3-dimensional tensor");
+  const int64_t D = inputs_padded.size(2);
+
+  at::Tensor inputs_packed =
+      at::zeros({num_inputs, D}, inputs_padded.options());
+
+  if (inputs_packed.numel() == 0) {
+    AT_CUDA_CHECK(cudaGetLastError());
+    return inputs_packed;
+  }
+
+  const int threads = 512;
+  const int blocks = batch_size;
+
+  if (D == 1) {
+    AT_DISPATCH_FLOATING_TYPES(
+        inputs_padded.scalar_type(), "padded_to_packed_d1_kernel", ([&] {
+          PaddedToPackedKernelD1<scalar_t><<<blocks, threads, 0, stream>>>(
+              inputs_padded.contiguous().data_ptr<scalar_t>(),
+              first_idxs.contiguous().data_ptr<int64_t>(),
+              inputs_packed.data_ptr<scalar_t>(),
+              batch_size,
+              max_size,
+              num_inputs);
+        }));
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(
+        inputs_padded.scalar_type(), "padded_to_packed_kernel", ([&] {
+          PaddedToPackedKernel<scalar_t><<<blocks, threads, 0, stream>>>(
+              inputs_padded.contiguous().data_ptr<scalar_t>(),
+              first_idxs.contiguous().data_ptr<int64_t>(),
+              inputs_packed.data_ptr<scalar_t>(),
+              batch_size,
+              max_size,
+              num_inputs,
+              D);
+        }));
+  }
+
+  AT_CUDA_CHECK(cudaGetLastError());
+  return inputs_packed;
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.h b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.h
new file mode 100644
index 0000000000000000000000000000000000000000..97ad2e3ee95acff21106a9f5312c6ccf1f095e45
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.h
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#include <torch/extension.h>
+#include "utils/pytorch3d_cutils.h"
+
+// PackedToPadded
+// Converts a packed tensor into a padded tensor, restoring the batch dimension.
+// Refer to pytorch3d/structures/meshes.py for details on packed/padded tensors.
+//
+// Inputs:
+//    inputs_packed: FloatTensor of shape (F, D), representing the packed batch
+//                      tensor, e.g. areas for faces in a batch of meshes.
+//    first_idxs: LongTensor of shape (N,) where N is the number of
+//                       elements in the batch and `first_idxs[i] = f`
+//                       means that the inputs for batch element i begin at
+//                       `inputs[f]`.
+//    max_size: Max length of an element in the batch.
+// Returns:
+//   inputs_padded: FloatTensor of shape (N, max_size, D) where max_size is max
+//                 of `sizes`. The values for batch element i which start at
+//                 `inputs_packed[first_idxs[i]]` will be copied to
+//                 `inputs_padded[i, :]`, with zeros padding out the extra
+//                  inputs.
+//
+
+// PaddedToPacked
+// Converts a padded tensor into a packed tensor.
+// Refer to pytorch3d/structures/meshes.py for details on packed/padded tensors.
+//
+// Inputs:
+//    inputs_padded: FloatTensor of shape (N, max_size, D), representing the
+//                padded tensor, e.g. areas for faces in a batch of meshes.
+//    first_idxs: LongTensor of shape (N,) where N is the number of
+//                       elements in the batch and `first_idxs[i] = f`
+//                       means that the inputs for batch element i begin at
+//                       `inputs_packed[f]`.
+//    num_inputs: Number of packed entries (= F)
+// Returns:
+//   inputs_packed: FloatTensor of shape (F, D), where
+//                      `inputs_packed[first_idx[i]:] = inputs_padded[i, :]`.
+//
+//
+
+// Cpu implementation.
+at::Tensor PackedToPaddedCpu(
+    const at::Tensor inputs_packed,
+    const at::Tensor first_idxs,
+    const int64_t max_size);
+
+// Cpu implementation.
+at::Tensor PaddedToPackedCpu(
+    const at::Tensor inputs_padded,
+    const at::Tensor first_idxs,
+    const int64_t num_inputs);
+
+#ifdef WITH_CUDA
+// Cuda implementation.
+at::Tensor PackedToPaddedCuda(
+    const at::Tensor inputs_packed,
+    const at::Tensor first_idxs,
+    const int64_t max_size);
+
+// Cuda implementation.
+at::Tensor PaddedToPackedCuda(
+    const at::Tensor inputs_padded,
+    const at::Tensor first_idxs,
+    const int64_t num_inputs);
+#endif
+
+// Implementation which is exposed.
+at::Tensor PackedToPadded(
+    const at::Tensor inputs_packed,
+    const at::Tensor first_idxs,
+    const int64_t max_size) {
+  if (inputs_packed.is_cuda()) {
+#ifdef WITH_CUDA
+    CHECK_CUDA(inputs_packed);
+    CHECK_CUDA(first_idxs);
+    return PackedToPaddedCuda(inputs_packed, first_idxs, max_size);
+#else
+    AT_ERROR("Not compiled with GPU support.");
+#endif
+  }
+  return PackedToPaddedCpu(inputs_packed, first_idxs, max_size);
+}
+
+// Implementation which is exposed.
+at::Tensor PaddedToPacked(
+    const at::Tensor inputs_padded,
+    const at::Tensor first_idxs,
+    const int64_t num_inputs) {
+  if (inputs_padded.is_cuda()) {
+#ifdef WITH_CUDA
+    CHECK_CUDA(inputs_padded);
+    CHECK_CUDA(first_idxs);
+    return PaddedToPackedCuda(inputs_padded, first_idxs, num_inputs);
+#else
+    AT_ERROR("Not compiled with GPU support.");
+#endif
+  }
+  return PaddedToPackedCpu(inputs_padded, first_idxs, num_inputs);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.cpp b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..34a002a8ed582a73c759b1ced2ca05d1507cb3f3
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.cpp
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <torch/extension.h>
+
+at::Tensor PackedToPaddedCpu(
+    const at::Tensor inputs_packed,
+    const at::Tensor first_idxs,
+    const int64_t max_size) {
+  const int64_t num_inputs = inputs_packed.size(0);
+  const int64_t batch_size = first_idxs.size(0);
+
+  AT_ASSERTM(
+      inputs_packed.dim() == 2, "inputs_packed must be a 2-dimensional tensor");
+  const int64_t D = inputs_packed.size(1);
+
+  torch::Tensor inputs_padded =
+      torch::zeros({batch_size, max_size, D}, inputs_packed.options());
+
+  auto inputs_packed_a = inputs_packed.accessor<float, 2>();
+  auto first_idxs_a = first_idxs.accessor<int64_t, 1>();
+  auto inputs_padded_a = inputs_padded.accessor<float, 3>();
+
+  for (int b = 0; b < batch_size; ++b) {
+    const int64_t start = first_idxs_a[b];
+    const int64_t end = b + 1 < batch_size ? first_idxs_a[b + 1] : num_inputs;
+    const int64_t num = end - start;
+    for (int i = 0; i < num; ++i) {
+      for (int j = 0; j < D; ++j) {
+        inputs_padded_a[b][i][j] = inputs_packed_a[start + i][j];
+      }
+    }
+  }
+  return inputs_padded;
+}
+
+at::Tensor PaddedToPackedCpu(
+    const at::Tensor inputs_padded,
+    const at::Tensor first_idxs,
+    const int64_t num_inputs) {
+  const int64_t batch_size = inputs_padded.size(0);
+
+  AT_ASSERTM(
+      inputs_padded.dim() == 3, "inputs_padded must be a 3-dimensional tensor");
+  const int64_t D = inputs_padded.size(2);
+
+  torch::Tensor inputs_packed =
+      torch::zeros({num_inputs, D}, inputs_padded.options());
+
+  auto inputs_padded_a = inputs_padded.accessor<float, 3>();
+  auto first_idxs_a = first_idxs.accessor<int64_t, 1>();
+  auto inputs_packed_a = inputs_packed.accessor<float, 2>();
+
+  for (int b = 0; b < batch_size; ++b) {
+    const int64_t start = first_idxs_a[b];
+    const int64_t end = b + 1 < batch_size ? first_idxs_a[b + 1] : num_inputs;
+    const int64_t num = end - start;
+    for (int i = 0; i < num; ++i) {
+      for (int j = 0; j < D; ++j) {
+        inputs_packed_a[start + i][j] = inputs_padded_a[b][i][j];
+      }
+    }
+  }
+  return inputs_packed;
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/dispatch.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/dispatch.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..eff9521630a2298c3f3d29b3b07adcdc2d44db8a
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/dispatch.cuh
@@ -0,0 +1,357 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+// This file provides utilities for dispatching to specialized versions of
+// functions. This is especially useful for CUDA kernels, since specializing
+// them to particular input sizes can often allow the compiler to unroll loops
+// and place arrays into registers, which can give huge performance speedups.
+//
+// As an example, suppose we have the following function which is specialized
+// based on a compile-time int64_t value:
+//
+// template<typename T, int64_t x>
+// struct SquareOffset {
+//   static void run(T y) {
+//     T val = x * x + y;
+//     std::cout << val << std::endl;
+//   }
+// }
+//
+// This function takes one compile-time argument x, and one run-time argument y.
+// We might want to compile specialized versions of this for x=0, x=1, etc and
+// then dispatch to the correct one based on the runtime value of x.
+// One simple way to achieve this is with a lookup table:
+//
+// template<typename T>
+// void DispatchSquareOffset(const int64_t x, T y) {
+//   if (x == 0) {
+//     SquareOffset<T, 0>::run(y);
+//   } else if (x == 1) {
+//     SquareOffset<T, 1>::run(y);
+//   } else if (x == 2) {
+//     SquareOffset<T, 2>::run(y);
+//   }
+// }
+//
+// This function takes both x and y as run-time arguments, and dispatches to
+// different specialized versions of SquareOffset based on the run-time value
+// of x. This works, but it's tedious and error-prone. If we want to change the
+// set of x values for which we provide compile-time specializations, then we
+// will need to do a lot of tedius editing of the dispatch function. Also, if we
+// want to provide compile-time specializations for another function other than
+// SquareOffset, we will need to duplicate the entire lookup table.
+//
+// To solve these problems, we can use the DispatchKernel1D function provided by
+// this file instead:
+//
+// template<typename T>
+// void DispatchSquareOffset(const int64_t x, T y) {
+//     constexpr int64_t xmin = 0;
+//     constexpr int64_t xmax = 2;
+//     DispatchKernel1D<SquareOffset, T, xmin, xmax>(x, y);
+// }
+//
+// DispatchKernel1D uses template metaprogramming to compile specialized
+// versions of SquareOffset for all values of x with xmin <= x <= xmax, and
+// then dispatches to the correct one based on the run-time value of x. If we
+// want to change the range of x values for which SquareOffset is specialized
+// at compile-time, then all we have to do is change the values of the
+// compile-time constants xmin and xmax.
+//
+// This file also allows us to similarly dispatch functions that depend on two
+// compile-time int64_t values, using the DispatchKernel2D function like this:
+//
+// template<typename T, int64_t x, int64_t y>
+// struct Sum {
+//   static void run(T z, T w) {
+//     T val = x + y + z + w;
+//     std::cout << val << std::endl;
+//   }
+// }
+//
+// template<typename T>
+// void DispatchSum(const int64_t x, const int64_t y, int z, int w) {
+//   constexpr int64_t xmin = 1;
+//   constexpr int64_t xmax = 3;
+//   constexpr int64_t ymin = 2;
+//   constexpr int64_t ymax = 5;
+//   DispatchKernel2D<Sum, T, xmin, xmax, ymin, ymax>(x, y, z, w);
+// }
+//
+// Like its 1D counterpart, DispatchKernel2D uses template metaprogramming to
+// compile specialized versions of sum for all values of (x, y) with
+// xmin <= x <= xmax and ymin <= y <= ymax, then dispatches to the correct
+// specialized version based on the runtime values of x and y.
+
+// Define some helper structs in an anonymous namespace.
+namespace {
+
+// 1D dispatch: general case.
+// Kernel is the function we want to dispatch to; it should take a typename and
+// an int64_t as template args, and it should define a static void function
+// run which takes any number of arguments of any type.
+// In order to dispatch, we will take an additional template argument curN,
+// and increment it via template recursion until it is equal to the run-time
+// argument N.
+template <
+    template <typename, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t curN,
+    typename... Args>
+struct DispatchKernelHelper1D {
+  static void run(const int64_t N, Args... args) {
+    if (curN == N) {
+      // The compile-time value curN is equal to the run-time value N, so we
+      // can dispatch to the run method of the Kernel.
+      Kernel<T, curN>::run(args...);
+    } else if (curN < N) {
+      // Increment curN via template recursion
+      DispatchKernelHelper1D<Kernel, T, minN, maxN, curN + 1, Args...>::run(
+          N, args...);
+    }
+    // We shouldn't get here -- throw an error?
+  }
+};
+
+// 1D dispatch: Specialization when curN == maxN
+// We need this base case to avoid infinite template recursion.
+template <
+    template <typename, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    typename... Args>
+struct DispatchKernelHelper1D<Kernel, T, minN, maxN, maxN, Args...> {
+  static void run(const int64_t N, Args... args) {
+    if (N == maxN) {
+      Kernel<T, maxN>::run(args...);
+    }
+    // We shouldn't get here -- throw an error?
+  }
+};
+
+// 2D dispatch, general case.
+// This is similar to the 1D case: we take additional template args curN and
+// curM, and increment them via template recursion until they are equal to
+// the run-time values of N and M, at which point we dispatch to the run
+// method of the kernel.
+template <
+    template <typename, int64_t, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t curN,
+    int64_t minM,
+    int64_t maxM,
+    int64_t curM,
+    typename... Args>
+struct DispatchKernelHelper2D {
+  static void run(const int64_t N, const int64_t M, Args... args) {
+    if (curN == N && curM == M) {
+      Kernel<T, curN, curM>::run(args...);
+    } else if (curN < N && curM < M) {
+      // Increment both curN and curM. This isn't strictly necessary; we could
+      // just increment one or the other at each step. But this helps to cut
+      // on the number of recursive calls we make.
+      DispatchKernelHelper2D<
+          Kernel,
+          T,
+          minN,
+          maxN,
+          curN + 1,
+          minM,
+          maxM,
+          curM + 1,
+          Args...>::run(N, M, args...);
+    } else if (curN < N) {
+      // Increment curN only
+      DispatchKernelHelper2D<
+          Kernel,
+          T,
+          minN,
+          maxN,
+          curN + 1,
+          minM,
+          maxM,
+          curM,
+          Args...>::run(N, M, args...);
+    } else if (curM < M) {
+      // Increment curM only
+      DispatchKernelHelper2D<
+          Kernel,
+          T,
+          minN,
+          maxN,
+          curN,
+          minM,
+          maxM,
+          curM + 1,
+          Args...>::run(N, M, args...);
+    }
+  }
+};
+
+// 2D dispatch, specialization for curN == maxN
+template <
+    template <typename, int64_t, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t minM,
+    int64_t maxM,
+    int64_t curM,
+    typename... Args>
+struct DispatchKernelHelper2D<
+    Kernel,
+    T,
+    minN,
+    maxN,
+    maxN,
+    minM,
+    maxM,
+    curM,
+    Args...> {
+  static void run(const int64_t N, const int64_t M, Args... args) {
+    if (maxN == N && curM == M) {
+      Kernel<T, maxN, curM>::run(args...);
+    } else if (curM < maxM) {
+      DispatchKernelHelper2D<
+          Kernel,
+          T,
+          minN,
+          maxN,
+          maxN,
+          minM,
+          maxM,
+          curM + 1,
+          Args...>::run(N, M, args...);
+    }
+    // We should not get here -- throw an error?
+  }
+};
+
+// 2D dispatch, specialization for curM == maxM
+template <
+    template <typename, int64_t, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t curN,
+    int64_t minM,
+    int64_t maxM,
+    typename... Args>
+struct DispatchKernelHelper2D<
+    Kernel,
+    T,
+    minN,
+    maxN,
+    curN,
+    minM,
+    maxM,
+    maxM,
+    Args...> {
+  static void run(const int64_t N, const int64_t M, Args... args) {
+    if (curN == N && maxM == M) {
+      Kernel<T, curN, maxM>::run(args...);
+    } else if (curN < maxN) {
+      DispatchKernelHelper2D<
+          Kernel,
+          T,
+          minN,
+          maxN,
+          curN + 1,
+          minM,
+          maxM,
+          maxM,
+          Args...>::run(N, M, args...);
+    }
+    // We should not get here -- throw an error?
+  }
+};
+
+// 2D dispatch, specialization for curN == maxN, curM == maxM
+template <
+    template <typename, int64_t, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t minM,
+    int64_t maxM,
+    typename... Args>
+struct DispatchKernelHelper2D<
+    Kernel,
+    T,
+    minN,
+    maxN,
+    maxN,
+    minM,
+    maxM,
+    maxM,
+    Args...> {
+  static void run(const int64_t N, const int64_t M, Args... args) {
+    if (maxN == N && maxM == M) {
+      Kernel<T, maxN, maxM>::run(args...);
+    }
+    // We should not get here -- throw an error?
+  }
+};
+
+} // namespace
+
+// This is the function we expect users to call to dispatch to 1D functions
+template <
+    template <typename, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    typename... Args>
+void DispatchKernel1D(const int64_t N, Args... args) {
+  if (minN <= N && N <= maxN) {
+    // Kick off the template recursion by calling the Helper with curN = minN
+    DispatchKernelHelper1D<Kernel, T, minN, maxN, minN, Args...>::run(
+        N, args...);
+  }
+  // Maybe throw an error if we tried to dispatch outside the allowed range?
+}
+
+// This is the function we expect users to call to dispatch to 2D functions
+template <
+    template <typename, int64_t, int64_t>
+    class Kernel,
+    typename T,
+    int64_t minN,
+    int64_t maxN,
+    int64_t minM,
+    int64_t maxM,
+    typename... Args>
+void DispatchKernel2D(const int64_t N, const int64_t M, Args... args) {
+  if (minN <= N && N <= maxN && minM <= M && M <= maxM) {
+    // Kick off the template recursion by calling the Helper with curN = minN
+    // and curM = minM
+    DispatchKernelHelper2D<
+        Kernel,
+        T,
+        minN,
+        maxN,
+        minN,
+        minM,
+        maxM,
+        minM,
+        Args...>::run(N, M, args...);
+  }
+  // Maybe throw an error if we tried to dispatch outside the specified range?
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/float_math.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/float_math.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..9678eee8a2d544c4078933803416c90c895e071b
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/float_math.cuh
@@ -0,0 +1,153 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#include <thrust/tuple.h>
+
+// Set epsilon
+#ifdef _MSC_VER
+#define vEpsilon 1e-8f
+#else
+const auto vEpsilon = 1e-8;
+#endif
+
+// Common functions and operators for float2.
+
+__device__ inline float2 operator-(const float2& a, const float2& b) {
+  return make_float2(a.x - b.x, a.y - b.y);
+}
+
+__device__ inline float2 operator+(const float2& a, const float2& b) {
+  return make_float2(a.x + b.x, a.y + b.y);
+}
+
+__device__ inline float2 operator/(const float2& a, const float2& b) {
+  return make_float2(a.x / b.x, a.y / b.y);
+}
+
+__device__ inline float2 operator/(const float2& a, const float b) {
+  return make_float2(a.x / b, a.y / b);
+}
+
+__device__ inline float2 operator*(const float2& a, const float2& b) {
+  return make_float2(a.x * b.x, a.y * b.y);
+}
+
+__device__ inline float2 operator*(const float a, const float2& b) {
+  return make_float2(a * b.x, a * b.y);
+}
+
+__device__ inline float FloatMin3(const float a, const float b, const float c) {
+  return fminf(a, fminf(b, c));
+}
+
+__device__ inline float FloatMax3(const float a, const float b, const float c) {
+  return fmaxf(a, fmaxf(b, c));
+}
+
+__device__ inline float dot(const float2& a, const float2& b) {
+  return a.x * b.x + a.y * b.y;
+}
+
+// Backward pass for the dot product.
+// Args:
+//     a, b: Coordinates of two points.
+//     grad_dot: Upstream gradient for the output.
+//
+// Returns:
+//    tuple of gradients for each of the input points:
+//      (float2 grad_a, float2 grad_b)
+//
+__device__ inline thrust::tuple<float2, float2>
+DotBackward(const float2& a, const float2& b, const float& grad_dot) {
+  return thrust::make_tuple(grad_dot * b, grad_dot * a);
+}
+
+__device__ inline float sum(const float2& a) {
+  return a.x + a.y;
+}
+
+// Common functions and operators for float3.
+
+__device__ inline float3 operator-(const float3& a, const float3& b) {
+  return make_float3(a.x - b.x, a.y - b.y, a.z - b.z);
+}
+
+__device__ inline float3 operator+(const float3& a, const float3& b) {
+  return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
+}
+
+__device__ inline float3 operator/(const float3& a, const float3& b) {
+  return make_float3(a.x / b.x, a.y / b.y, a.z / b.z);
+}
+
+__device__ inline float3 operator/(const float3& a, const float b) {
+  return make_float3(a.x / b, a.y / b, a.z / b);
+}
+
+__device__ inline float3 operator*(const float3& a, const float3& b) {
+  return make_float3(a.x * b.x, a.y * b.y, a.z * b.z);
+}
+
+__device__ inline float3 operator*(const float a, const float3& b) {
+  return make_float3(a * b.x, a * b.y, a * b.z);
+}
+
+__device__ inline float dot(const float3& a, const float3& b) {
+  return a.x * b.x + a.y * b.y + a.z * b.z;
+}
+
+__device__ inline float sum(const float3& a) {
+  return a.x + a.y + a.z;
+}
+
+__device__ inline float3 cross(const float3& a, const float3& b) {
+  return make_float3(
+      a.y * b.z - a.z * b.y, a.z * b.x - a.x * b.z, a.x * b.y - a.y * b.x);
+}
+
+__device__ inline thrust::tuple<float3, float3>
+cross_backward(const float3& a, const float3& b, const float3& grad_cross) {
+  const float grad_ax = -grad_cross.y * b.z + grad_cross.z * b.y;
+  const float grad_ay = grad_cross.x * b.z - grad_cross.z * b.x;
+  const float grad_az = -grad_cross.x * b.y + grad_cross.y * b.x;
+  const float3 grad_a = make_float3(grad_ax, grad_ay, grad_az);
+
+  const float grad_bx = grad_cross.y * a.z - grad_cross.z * a.y;
+  const float grad_by = -grad_cross.x * a.z + grad_cross.z * a.x;
+  const float grad_bz = grad_cross.x * a.y - grad_cross.y * a.x;
+  const float3 grad_b = make_float3(grad_bx, grad_by, grad_bz);
+
+  return thrust::make_tuple(grad_a, grad_b);
+}
+
+__device__ inline float norm(const float3& a) {
+  return sqrt(dot(a, a));
+}
+
+__device__ inline float3 normalize(const float3& a) {
+  return a / (norm(a) + vEpsilon);
+}
+
+__device__ inline float3 normalize_backward(
+    const float3& a,
+    const float3& grad_normz) {
+  const float a_norm = norm(a) + vEpsilon;
+  const float3 out = a / a_norm;
+
+  const float grad_ax = grad_normz.x * (1.0f - out.x * out.x) / a_norm +
+      grad_normz.y * (-out.x * out.y) / a_norm +
+      grad_normz.z * (-out.x * out.z) / a_norm;
+  const float grad_ay = grad_normz.x * (-out.x * out.y) / a_norm +
+      grad_normz.y * (1.0f - out.y * out.y) / a_norm +
+      grad_normz.z * (-out.y * out.z) / a_norm;
+  const float grad_az = grad_normz.x * (-out.x * out.z) / a_norm +
+      grad_normz.y * (-out.y * out.z) / a_norm +
+      grad_normz.z * (1.0f - out.z * out.z) / a_norm;
+  return make_float3(grad_ax, grad_ay, grad_az);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..940dbb2c60a3a1c36b8620d86b540c32bc137537
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.cuh
@@ -0,0 +1,792 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <float.h>
+#include <math.h>
+#include <cstdio>
+#include "float_math.cuh"
+
+// Set epsilon for preventing floating point errors and division by 0.
+#ifdef _MSC_VER
+#define kEpsilon 1e-8f
+#else
+const auto kEpsilon = 1e-8;
+#endif
+
+// ************************************************************* //
+//                          vec2 utils                           //
+// ************************************************************* //
+
+// Determines whether a point p is on the right side of a 2D line segment
+// given by the end points v0, v1.
+//
+// Args:
+//     p: vec2 Coordinates of a point.
+//     v0, v1: vec2 Coordinates of the end points of the edge.
+//
+// Returns:
+//     area: The signed area of the parallelogram given by the vectors
+//           A = p - v0
+//           B = v1 - v0
+//
+__device__ inline float
+EdgeFunctionForward(const float2& p, const float2& v0, const float2& v1) {
+  return (p.x - v0.x) * (v1.y - v0.y) - (p.y - v0.y) * (v1.x - v0.x);
+}
+
+// Backward pass for the edge function returning partial dervivatives for each
+// of the input points.
+//
+// Args:
+//     p: vec2 Coordinates of a point.
+//     v0, v1: vec2 Coordinates of the end points of the edge.
+//     grad_edge: Upstream gradient for output from edge function.
+//
+// Returns:
+//     tuple of gradients for each of the input points:
+//     (float2 d_edge_dp, float2 d_edge_dv0, float2 d_edge_dv1)
+//
+__device__ inline thrust::tuple<float2, float2, float2> EdgeFunctionBackward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float& grad_edge) {
+  const float2 dedge_dp = make_float2(v1.y - v0.y, v0.x - v1.x);
+  const float2 dedge_dv0 = make_float2(p.y - v1.y, v1.x - p.x);
+  const float2 dedge_dv1 = make_float2(v0.y - p.y, p.x - v0.x);
+  return thrust::make_tuple(
+      grad_edge * dedge_dp, grad_edge * dedge_dv0, grad_edge * dedge_dv1);
+}
+
+// The forward pass for computing the barycentric coordinates of a point
+// relative to a triangle.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the triangle vertices.
+//
+// Returns
+//     bary: (w0, w1, w2) barycentric coordinates in the range [0, 1].
+//
+__device__ inline float3 BarycentricCoordsForward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float2& v2) {
+  const float area = EdgeFunctionForward(v2, v0, v1) + kEpsilon;
+  const float w0 = EdgeFunctionForward(p, v1, v2) / area;
+  const float w1 = EdgeFunctionForward(p, v2, v0) / area;
+  const float w2 = EdgeFunctionForward(p, v0, v1) / area;
+  return make_float3(w0, w1, w2);
+}
+
+// The backward pass for computing the barycentric coordinates of a point
+// relative to a triangle.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: (x, y) coordinates of the triangle vertices.
+//     grad_bary_upstream: vec3<T> Upstream gradient for each of the
+//                         barycentric coordaintes [grad_w0, grad_w1, grad_w2].
+//
+// Returns
+//    tuple of gradients for each of the triangle vertices:
+//    (float2 grad_v0, float2 grad_v1, float2 grad_v2)
+//
+__device__ inline thrust::tuple<float2, float2, float2, float2>
+BarycentricCoordsBackward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float2& v2,
+    const float3& grad_bary_upstream) {
+  const float area = EdgeFunctionForward(v2, v0, v1) + kEpsilon;
+  const float area2 = pow(area, 2.0f);
+  const float e0 = EdgeFunctionForward(p, v1, v2);
+  const float e1 = EdgeFunctionForward(p, v2, v0);
+  const float e2 = EdgeFunctionForward(p, v0, v1);
+
+  const float grad_w0 = grad_bary_upstream.x;
+  const float grad_w1 = grad_bary_upstream.y;
+  const float grad_w2 = grad_bary_upstream.z;
+
+  // Calculate component of the gradient from each of w0, w1 and w2.
+  // e.g. for w0:
+  // dloss/dw0_v = dl/dw0 * dw0/dw0_top * dw0_top/dv
+  //               + dl/dw0 * dw0/dw0_bot * dw0_bot/dv
+  const float dw0_darea = -e0 / (area2);
+  const float dw0_e0 = 1 / area;
+  const float dloss_d_w0area = grad_w0 * dw0_darea;
+  const float dloss_e0 = grad_w0 * dw0_e0;
+  auto de0_dv = EdgeFunctionBackward(p, v1, v2, dloss_e0);
+  auto dw0area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w0area);
+  const float2 dw0_p = thrust::get<0>(de0_dv);
+  const float2 dw0_dv0 = thrust::get<1>(dw0area_dv);
+  const float2 dw0_dv1 = thrust::get<1>(de0_dv) + thrust::get<2>(dw0area_dv);
+  const float2 dw0_dv2 = thrust::get<2>(de0_dv) + thrust::get<0>(dw0area_dv);
+
+  const float dw1_darea = -e1 / (area2);
+  const float dw1_e1 = 1 / area;
+  const float dloss_d_w1area = grad_w1 * dw1_darea;
+  const float dloss_e1 = grad_w1 * dw1_e1;
+  auto de1_dv = EdgeFunctionBackward(p, v2, v0, dloss_e1);
+  auto dw1area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w1area);
+  const float2 dw1_p = thrust::get<0>(de1_dv);
+  const float2 dw1_dv0 = thrust::get<2>(de1_dv) + thrust::get<1>(dw1area_dv);
+  const float2 dw1_dv1 = thrust::get<2>(dw1area_dv);
+  const float2 dw1_dv2 = thrust::get<1>(de1_dv) + thrust::get<0>(dw1area_dv);
+
+  const float dw2_darea = -e2 / (area2);
+  const float dw2_e2 = 1 / area;
+  const float dloss_d_w2area = grad_w2 * dw2_darea;
+  const float dloss_e2 = grad_w2 * dw2_e2;
+  auto de2_dv = EdgeFunctionBackward(p, v0, v1, dloss_e2);
+  auto dw2area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w2area);
+  const float2 dw2_p = thrust::get<0>(de2_dv);
+  const float2 dw2_dv0 = thrust::get<1>(de2_dv) + thrust::get<1>(dw2area_dv);
+  const float2 dw2_dv1 = thrust::get<2>(de2_dv) + thrust::get<2>(dw2area_dv);
+  const float2 dw2_dv2 = thrust::get<0>(dw2area_dv);
+
+  const float2 dbary_p = dw0_p + dw1_p + dw2_p;
+  const float2 dbary_dv0 = dw0_dv0 + dw1_dv0 + dw2_dv0;
+  const float2 dbary_dv1 = dw0_dv1 + dw1_dv1 + dw2_dv1;
+  const float2 dbary_dv2 = dw0_dv2 + dw1_dv2 + dw2_dv2;
+
+  return thrust::make_tuple(dbary_p, dbary_dv0, dbary_dv1, dbary_dv2);
+}
+
+// Forward pass for applying perspective correction to barycentric coordinates.
+//
+// Args:
+//     bary: Screen-space barycentric coordinates for a point
+//     z0, z1, z2: Camera-space z-coordinates of the triangle vertices
+//
+// Returns
+//     World-space barycentric coordinates
+//
+__device__ inline float3 BarycentricPerspectiveCorrectionForward(
+    const float3& bary,
+    const float z0,
+    const float z1,
+    const float z2) {
+  const float w0_top = bary.x * z1 * z2;
+  const float w1_top = z0 * bary.y * z2;
+  const float w2_top = z0 * z1 * bary.z;
+  const float denom = fmaxf(w0_top + w1_top + w2_top, kEpsilon);
+  const float w0 = w0_top / denom;
+  const float w1 = w1_top / denom;
+  const float w2 = w2_top / denom;
+  return make_float3(w0, w1, w2);
+}
+
+// Backward pass for applying perspective correction to barycentric coordinates.
+//
+// Args:
+//     bary: Screen-space barycentric coordinates for a point
+//     z0, z1, z2: Camera-space z-coordinates of the triangle vertices
+//     grad_out: Upstream gradient of the loss with respect to the corrected
+//               barycentric coordinates.
+//
+// Returns a tuple of:
+//      grad_bary: Downstream gradient of the loss with respect to the the
+//                 uncorrected barycentric coordinates.
+//      grad_z0, grad_z1, grad_z2: Downstream gradient of the loss with respect
+//                                 to the z-coordinates of the triangle verts
+__device__ inline thrust::tuple<float3, float, float, float>
+BarycentricPerspectiveCorrectionBackward(
+    const float3& bary,
+    const float z0,
+    const float z1,
+    const float z2,
+    const float3& grad_out) {
+  // Recompute forward pass
+  const float w0_top = bary.x * z1 * z2;
+  const float w1_top = z0 * bary.y * z2;
+  const float w2_top = z0 * z1 * bary.z;
+  const float denom = fmaxf(w0_top + w1_top + w2_top, kEpsilon);
+
+  // Now do backward pass
+  const float grad_denom_top =
+      -w0_top * grad_out.x - w1_top * grad_out.y - w2_top * grad_out.z;
+  const float grad_denom = grad_denom_top / (denom * denom);
+  const float grad_w0_top = grad_denom + grad_out.x / denom;
+  const float grad_w1_top = grad_denom + grad_out.y / denom;
+  const float grad_w2_top = grad_denom + grad_out.z / denom;
+  const float grad_bary_x = grad_w0_top * z1 * z2;
+  const float grad_bary_y = grad_w1_top * z0 * z2;
+  const float grad_bary_z = grad_w2_top * z0 * z1;
+  const float3 grad_bary = make_float3(grad_bary_x, grad_bary_y, grad_bary_z);
+  const float grad_z0 = grad_w1_top * bary.y * z2 + grad_w2_top * bary.z * z1;
+  const float grad_z1 = grad_w0_top * bary.x * z2 + grad_w2_top * bary.z * z0;
+  const float grad_z2 = grad_w0_top * bary.x * z1 + grad_w1_top * bary.y * z0;
+  return thrust::make_tuple(grad_bary, grad_z0, grad_z1, grad_z2);
+}
+
+// Clip negative barycentric coordinates to 0.0 and renormalize so
+// the barycentric coordinates for a point sum to 1. When the blur_radius
+// is greater than 0, a face will still be recorded as overlapping a pixel
+// if the pixel is outside the face. In this case at least one of the
+// barycentric coordinates for the pixel relative to the face will be negative.
+// Clipping will ensure that the texture and z buffer are interpolated
+// correctly.
+//
+//  Args
+//     bary: (w0, w1, w2) barycentric coordinates which can be outside the
+//            range [0, 1].
+//
+//  Returns
+//     bary: (w0, w1, w2) barycentric coordinates in the range [0, 1] which
+//           satisfy the condition: sum(w0, w1, w2) = 1.0.
+//
+__device__ inline float3 BarycentricClipForward(const float3 bary) {
+  float3 w = make_float3(0.0f, 0.0f, 0.0f);
+  // Clamp lower bound only
+  w.x = max(bary.x, 0.0);
+  w.y = max(bary.y, 0.0);
+  w.z = max(bary.z, 0.0);
+  float w_sum = w.x + w.y + w.z;
+  w_sum = fmaxf(w_sum, 1e-5);
+  w.x /= w_sum;
+  w.y /= w_sum;
+  w.z /= w_sum;
+
+  return w;
+}
+
+// Backward pass for barycentric coordinate clipping.
+//
+//  Args
+//     bary: (w0, w1, w2) barycentric coordinates which can be outside the
+//            range [0, 1].
+//     grad_baryclip_upstream: vec3<T> Upstream gradient for each of the clipped
+//                         barycentric coordinates [grad_w0, grad_w1, grad_w2].
+//
+// Returns
+//    vec3<T> of gradients for the unclipped barycentric coordinates:
+//    (grad_w0, grad_w1, grad_w2)
+//
+__device__ inline float3 BarycentricClipBackward(
+    const float3 bary,
+    const float3 grad_baryclip_upstream) {
+  // Redo some of the forward pass calculations
+  float3 w = make_float3(0.0f, 0.0f, 0.0f);
+  // Clamp lower bound only
+  w.x = max(bary.x, 0.0);
+  w.y = max(bary.y, 0.0);
+  w.z = max(bary.z, 0.0);
+  float w_sum = w.x + w.y + w.z;
+
+  float3 grad_bary = make_float3(1.0f, 1.0f, 1.0f);
+  float3 grad_clip = make_float3(1.0f, 1.0f, 1.0f);
+  float3 grad_sum = make_float3(1.0f, 1.0f, 1.0f);
+
+  // Check if sum was clipped.
+  float grad_sum_clip = 1.0f;
+  if (w_sum < 1e-5) {
+    grad_sum_clip = 0.0f;
+    w_sum = 1e-5;
+  }
+
+  // Check if any of bary values have been clipped.
+  if (bary.x < 0.0f) {
+    grad_clip.x = 0.0f;
+  }
+  if (bary.y < 0.0f) {
+    grad_clip.y = 0.0f;
+  }
+  if (bary.z < 0.0f) {
+    grad_clip.z = 0.0f;
+  }
+
+  // Gradients of the sum.
+  grad_sum.x = -w.x / (pow(w_sum, 2.0f)) * grad_sum_clip;
+  grad_sum.y = -w.y / (pow(w_sum, 2.0f)) * grad_sum_clip;
+  grad_sum.z = -w.z / (pow(w_sum, 2.0f)) * grad_sum_clip;
+
+  // Gradients for each of the bary coordinates including the cross terms
+  // from the sum.
+  grad_bary.x = grad_clip.x *
+      (grad_baryclip_upstream.x * (1.0f / w_sum + grad_sum.x) +
+       grad_baryclip_upstream.y * (grad_sum.y) +
+       grad_baryclip_upstream.z * (grad_sum.z));
+
+  grad_bary.y = grad_clip.y *
+      (grad_baryclip_upstream.y * (1.0f / w_sum + grad_sum.y) +
+       grad_baryclip_upstream.x * (grad_sum.x) +
+       grad_baryclip_upstream.z * (grad_sum.z));
+
+  grad_bary.z = grad_clip.z *
+      (grad_baryclip_upstream.z * (1.0f / w_sum + grad_sum.z) +
+       grad_baryclip_upstream.x * (grad_sum.x) +
+       grad_baryclip_upstream.y * (grad_sum.y));
+
+  return grad_bary;
+}
+
+// Return minimum distance between line segment (v1 - v0) and point p.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1: Coordinates of the end points of the line segment.
+//
+// Returns:
+//     squared distance to the boundary of the triangle.
+//
+__device__ inline float
+PointLineDistanceForward(const float2& p, const float2& a, const float2& b) {
+  const float2 ba = b - a;
+  float l2 = dot(ba, ba);
+  float t = dot(ba, p - a) / l2;
+  if (l2 <= kEpsilon) {
+    return dot(p - b, p - b);
+  }
+  t = __saturatef(t); // clamp to the interval [+0.0, 1.0]
+  const float2 p_proj = a + t * ba;
+  const float2 d = (p_proj - p);
+  return dot(d, d); // squared distance
+}
+
+// Backward pass for point to line distance in 2D.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1: Coordinates of the end points of the line segment.
+//     grad_dist: Upstream gradient for the distance.
+//
+// Returns:
+//    tuple of gradients for each of the input points:
+//      (float2 grad_p, float2 grad_v0, float2 grad_v1)
+//
+__device__ inline thrust::tuple<float2, float2, float2>
+PointLineDistanceBackward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float& grad_dist) {
+  // Redo some of the forward pass calculations.
+  const float2 v1v0 = v1 - v0;
+  const float2 pv0 = p - v0;
+  const float t_bot = dot(v1v0, v1v0);
+  const float t_top = dot(v1v0, pv0);
+  float tt = t_top / t_bot;
+  tt = __saturatef(tt);
+  const float2 p_proj = (1.0f - tt) * v0 + tt * v1;
+  const float2 d = p - p_proj;
+  const float dist = sqrt(dot(d, d));
+
+  const float2 grad_p = -1.0f * grad_dist * 2.0f * (p_proj - p);
+  const float2 grad_v0 = grad_dist * (1.0f - tt) * 2.0f * (p_proj - p);
+  const float2 grad_v1 = grad_dist * tt * 2.0f * (p_proj - p);
+
+  return thrust::make_tuple(grad_p, grad_v0, grad_v1);
+}
+
+// The forward pass for calculating the shortest distance between a point
+// and a triangle.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the three triangle vertices.
+//
+// Returns:
+//     shortest squared distance from a point to a triangle.
+//
+__device__ inline float PointTriangleDistanceForward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float2& v2) {
+  // Compute distance to all 3 edges of the triangle and return the min.
+  const float e01_dist = PointLineDistanceForward(p, v0, v1);
+  const float e02_dist = PointLineDistanceForward(p, v0, v2);
+  const float e12_dist = PointLineDistanceForward(p, v1, v2);
+  const float edge_dist = fminf(fminf(e01_dist, e02_dist), e12_dist);
+  return edge_dist;
+}
+
+// Backward pass for point triangle distance.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the three triangle vertices.
+//     grad_dist: Upstream gradient for the distance.
+//
+// Returns:
+//    tuple of gradients for each of the triangle vertices:
+//      (float2 grad_v0, float2 grad_v1, float2 grad_v2)
+//
+__device__ inline thrust::tuple<float2, float2, float2, float2>
+PointTriangleDistanceBackward(
+    const float2& p,
+    const float2& v0,
+    const float2& v1,
+    const float2& v2,
+    const float& grad_dist) {
+  // Compute distance to all 3 edges of the triangle.
+  const float e01_dist = PointLineDistanceForward(p, v0, v1);
+  const float e02_dist = PointLineDistanceForward(p, v0, v2);
+  const float e12_dist = PointLineDistanceForward(p, v1, v2);
+
+  // Initialize output tensors.
+  float2 grad_v0 = make_float2(0.0f, 0.0f);
+  float2 grad_v1 = make_float2(0.0f, 0.0f);
+  float2 grad_v2 = make_float2(0.0f, 0.0f);
+  float2 grad_p = make_float2(0.0f, 0.0f);
+
+  // Find which edge is the closest and return PointLineDistanceBackward for
+  // that edge.
+  if (e01_dist <= e02_dist && e01_dist <= e12_dist) {
+    // Closest edge is v1 - v0.
+    auto grad_e01 = PointLineDistanceBackward(p, v0, v1, grad_dist);
+    grad_p = thrust::get<0>(grad_e01);
+    grad_v0 = thrust::get<1>(grad_e01);
+    grad_v1 = thrust::get<2>(grad_e01);
+  } else if (e02_dist <= e01_dist && e02_dist <= e12_dist) {
+    // Closest edge is v2 - v0.
+    auto grad_e02 = PointLineDistanceBackward(p, v0, v2, grad_dist);
+    grad_p = thrust::get<0>(grad_e02);
+    grad_v0 = thrust::get<1>(grad_e02);
+    grad_v2 = thrust::get<2>(grad_e02);
+  } else if (e12_dist <= e01_dist && e12_dist <= e02_dist) {
+    // Closest edge is v2 - v1.
+    auto grad_e12 = PointLineDistanceBackward(p, v1, v2, grad_dist);
+    grad_p = thrust::get<0>(grad_e12);
+    grad_v1 = thrust::get<1>(grad_e12);
+    grad_v2 = thrust::get<2>(grad_e12);
+  }
+
+  return thrust::make_tuple(grad_p, grad_v0, grad_v1, grad_v2);
+}
+
+// ************************************************************* //
+//                          vec3 utils                           //
+// ************************************************************* //
+
+// Computes the area of a triangle (v0, v1, v2).
+//
+// Args:
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//
+// Returns
+//     area: float: The area of the triangle
+//
+__device__ inline float
+AreaOfTriangle(const float3& v0, const float3& v1, const float3& v2) {
+  float3 p0 = v1 - v0;
+  float3 p1 = v2 - v0;
+
+  // compute the hypotenus of the scross product (p0 x p1)
+  float dd = hypot(
+      p0.y * p1.z - p0.z * p1.y,
+      hypot(p0.z * p1.x - p0.x * p1.z, p0.x * p1.y - p0.y * p1.x));
+
+  return dd / 2.0;
+}
+
+// Computes the barycentric coordinates of a point p relative
+// to a triangle (v0, v1, v2), i.e. p = w0 * v0 + w1 * v1 + w2 * v2
+// s.t. w0 + w1 + w2 = 1.0
+//
+// NOTE that this function assumes that p lives on the space spanned
+// by (v0, v1, v2).
+// TODO(gkioxari) explicitly check whether p is coplanar with (v0, v1, v2)
+// and throw an error if check fails
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//
+// Returns
+//     bary: (w0, w1, w2) barycentric coordinates
+//
+__device__ inline float3 BarycentricCoords3Forward(
+    const float3& p,
+    const float3& v0,
+    const float3& v1,
+    const float3& v2) {
+  float3 p0 = v1 - v0;
+  float3 p1 = v2 - v0;
+  float3 p2 = p - v0;
+
+  const float d00 = dot(p0, p0);
+  const float d01 = dot(p0, p1);
+  const float d11 = dot(p1, p1);
+  const float d20 = dot(p2, p0);
+  const float d21 = dot(p2, p1);
+
+  const float denom = d00 * d11 - d01 * d01 + kEpsilon;
+  const float w1 = (d11 * d20 - d01 * d21) / denom;
+  const float w2 = (d00 * d21 - d01 * d20) / denom;
+  const float w0 = 1.0f - w1 - w2;
+
+  return make_float3(w0, w1, w2);
+}
+
+// Checks whether the point p is inside the triangle (v0, v1, v2).
+// A point is inside the triangle, if all barycentric coordinates
+// wrt the triangle are >= 0 & <= 1.
+// If the triangle is degenerate, aka line or point, then return False.
+//
+// NOTE that this function assumes that p lives on the space spanned
+// by (v0, v1, v2).
+// TODO(gkioxari) explicitly check whether p is coplanar with (v0, v1, v2)
+// and throw an error if check fails
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//     min_triangle_area: triangles less than this size are considered
+//     points/lines, IsInsideTriangle returns False
+//
+// Returns:
+//     inside: bool indicating wether p is inside triangle
+//
+__device__ inline bool IsInsideTriangle(
+    const float3& p,
+    const float3& v0,
+    const float3& v1,
+    const float3& v2,
+    const double min_triangle_area) {
+  bool inside;
+  if (AreaOfTriangle(v0, v1, v2) < min_triangle_area) {
+    inside = 0;
+  } else {
+    float3 bary = BarycentricCoords3Forward(p, v0, v1, v2);
+    bool x_in = 0.0f <= bary.x && bary.x <= 1.0f;
+    bool y_in = 0.0f <= bary.y && bary.y <= 1.0f;
+    bool z_in = 0.0f <= bary.z && bary.z <= 1.0f;
+    inside = x_in && y_in && z_in;
+  }
+  return inside;
+}
+
+// Computes the minimum squared Euclidean distance between the point p
+// and the segment spanned by (v0, v1).
+// To find this we parametrize p as: x(t) = v0 + t * (v1 - v0)
+// and find t which minimizes (x(t) - p) ^ 2.
+// Note that p does not need to live in the space spanned by (v0, v1)
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1: vec3 coordinates of start and end of segment
+//
+// Returns:
+//     dist: the minimum squared distance of p from segment (v0, v1)
+//
+
+__device__ inline float
+PointLine3DistanceForward(const float3& p, const float3& v0, const float3& v1) {
+  const float3 v1v0 = v1 - v0;
+  const float3 pv0 = p - v0;
+  const float t_bot = dot(v1v0, v1v0);
+  const float t_top = dot(pv0, v1v0);
+  // if t_bot small, then v0 == v1, set tt to 0.
+  float tt = (t_bot < kEpsilon) ? 0.0f : (t_top / t_bot);
+
+  tt = __saturatef(tt); // clamps to [0, 1]
+
+  const float3 p_proj = v0 + tt * v1v0;
+  const float3 diff = p - p_proj;
+  const float dist = dot(diff, diff);
+  return dist;
+}
+
+// Backward function of the minimum squared Euclidean distance between the point
+// p and the line segment (v0, v1).
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1: vec3 coordinates of start and end of segment
+//     grad_dist: Float of the gradient wrt dist
+//
+// Returns:
+//    tuple of gradients for the point and line segment (v0, v1):
+//      (float3 grad_p, float3 grad_v0, float3 grad_v1)
+
+__device__ inline thrust::tuple<float3, float3, float3>
+PointLine3DistanceBackward(
+    const float3& p,
+    const float3& v0,
+    const float3& v1,
+    const float& grad_dist) {
+  const float3 v1v0 = v1 - v0;
+  const float3 pv0 = p - v0;
+  const float t_bot = dot(v1v0, v1v0);
+  const float t_top = dot(v1v0, pv0);
+
+  float3 grad_p = make_float3(0.0f, 0.0f, 0.0f);
+  float3 grad_v0 = make_float3(0.0f, 0.0f, 0.0f);
+  float3 grad_v1 = make_float3(0.0f, 0.0f, 0.0f);
+
+  const float tt = t_top / t_bot;
+
+  if (t_bot < kEpsilon) {
+    // if t_bot small, then v0 == v1,
+    // and dist = 0.5 * dot(pv0, pv0) + 0.5 * dot(pv1, pv1)
+    grad_p = grad_dist * 2.0f * pv0;
+    grad_v0 = -0.5f * grad_p;
+    grad_v1 = grad_v0;
+  } else if (tt < 0.0f) {
+    grad_p = grad_dist * 2.0f * pv0;
+    grad_v0 = -1.0f * grad_p;
+    // no gradients wrt v1
+  } else if (tt > 1.0f) {
+    grad_p = grad_dist * 2.0f * (p - v1);
+    grad_v1 = -1.0f * grad_p;
+    // no gradients wrt v0
+  } else {
+    const float3 p_proj = v0 + tt * v1v0;
+    const float3 diff = p - p_proj;
+    const float3 grad_base = grad_dist * 2.0f * diff;
+    grad_p = grad_base - dot(grad_base, v1v0) * v1v0 / t_bot;
+    const float3 dtt_v0 = (-1.0f * v1v0 - pv0 + 2.0f * tt * v1v0) / t_bot;
+    grad_v0 = (-1.0f + tt) * grad_base - dot(grad_base, v1v0) * dtt_v0;
+    const float3 dtt_v1 = (pv0 - 2.0f * tt * v1v0) / t_bot;
+    grad_v1 = -dot(grad_base, v1v0) * dtt_v1 - tt * grad_base;
+  }
+
+  return thrust::make_tuple(grad_p, grad_v0, grad_v1);
+}
+
+// Computes the squared distance of a point p relative to a triangle (v0, v1,
+// v2). If the point's projection p0 on the plane spanned by (v0, v1, v2) is
+// inside the triangle with vertices (v0, v1, v2), then the returned value is
+// the squared distance of p to its projection p0. Otherwise, the returned value
+// is the smallest squared distance of p from the line segments (v0, v1), (v0,
+// v2) and (v1, v2).
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//     min_triangle_area: triangles less than this size are considered
+//     points/lines, IsInsideTriangle returns False
+//
+// Returns:
+//     dist: Float of the squared distance
+//
+
+__device__ inline float PointTriangle3DistanceForward(
+    const float3& p,
+    const float3& v0,
+    const float3& v1,
+    const float3& v2,
+    const double min_triangle_area) {
+  float3 normal = cross(v2 - v0, v1 - v0);
+  const float norm_normal = norm(normal);
+  normal = normalize(normal);
+
+  // p0 is the projection of p on the plane spanned by (v0, v1, v2)
+  // i.e. p0 = p + t * normal, s.t. (p0 - v0) is orthogonal to normal
+  const float t = dot(v0 - p, normal);
+  const float3 p0 = p + t * normal;
+
+  bool is_inside = IsInsideTriangle(p0, v0, v1, v2, min_triangle_area);
+  float dist = 0.0f;
+
+  if ((is_inside) && (norm_normal > kEpsilon)) {
+    // if projection p0 is inside triangle spanned by (v0, v1, v2)
+    // then distance is equal to norm(p0 - p)^2
+    dist = t * t;
+  } else {
+    const float e01 = PointLine3DistanceForward(p, v0, v1);
+    const float e02 = PointLine3DistanceForward(p, v0, v2);
+    const float e12 = PointLine3DistanceForward(p, v1, v2);
+
+    dist = (e01 > e02) ? e02 : e01;
+    dist = (dist > e12) ? e12 : dist;
+  }
+
+  return dist;
+}
+
+// The backward pass for computing the squared distance of a point
+// to the triangle (v0, v1, v2).
+//
+// Args:
+//     p: xyz coordinates of a point
+//     v0, v1, v2: xyz coordinates of the triangle vertices
+//     grad_dist: Float of the gradient wrt dist
+//     min_triangle_area: triangles less than this size are considered
+//     points/lines, IsInsideTriangle returns False
+//
+// Returns:
+//     tuple of gradients for the point and triangle:
+//        (float3 grad_p, float3 grad_v0, float3 grad_v1, float3 grad_v2)
+//
+
+__device__ inline thrust::tuple<float3, float3, float3, float3>
+PointTriangle3DistanceBackward(
+    const float3& p,
+    const float3& v0,
+    const float3& v1,
+    const float3& v2,
+    const float& grad_dist,
+    const double min_triangle_area) {
+  const float3 v2v0 = v2 - v0;
+  const float3 v1v0 = v1 - v0;
+  const float3 v0p = v0 - p;
+  float3 raw_normal = cross(v2v0, v1v0);
+  const float norm_normal = norm(raw_normal);
+  float3 normal = normalize(raw_normal);
+
+  // p0 is the projection of p on the plane spanned by (v0, v1, v2)
+  // i.e. p0 = p + t * normal, s.t. (p0 - v0) is orthogonal to normal
+  const float t = dot(v0 - p, normal);
+  const float3 p0 = p + t * normal;
+  const float3 diff = t * normal;
+
+  bool is_inside = IsInsideTriangle(p0, v0, v1, v2, min_triangle_area);
+
+  float3 grad_p = make_float3(0.0f, 0.0f, 0.0f);
+  float3 grad_v0 = make_float3(0.0f, 0.0f, 0.0f);
+  float3 grad_v1 = make_float3(0.0f, 0.0f, 0.0f);
+  float3 grad_v2 = make_float3(0.0f, 0.0f, 0.0f);
+
+  if ((is_inside) && (norm_normal > kEpsilon)) {
+    // derivative of dist wrt p
+    grad_p = -2.0f * grad_dist * t * normal;
+    // derivative of dist wrt normal
+    const float3 grad_normal = 2.0f * grad_dist * t * (v0p + diff);
+    // derivative of dist wrt raw_normal
+    const float3 grad_raw_normal = normalize_backward(raw_normal, grad_normal);
+    // derivative of dist wrt v2v0 and v1v0
+    const auto grad_cross = cross_backward(v2v0, v1v0, grad_raw_normal);
+    const float3 grad_cross_v2v0 = thrust::get<0>(grad_cross);
+    const float3 grad_cross_v1v0 = thrust::get<1>(grad_cross);
+    grad_v0 =
+        grad_dist * 2.0f * t * normal - (grad_cross_v2v0 + grad_cross_v1v0);
+    grad_v1 = grad_cross_v1v0;
+    grad_v2 = grad_cross_v2v0;
+  } else {
+    const float e01 = PointLine3DistanceForward(p, v0, v1);
+    const float e02 = PointLine3DistanceForward(p, v0, v2);
+    const float e12 = PointLine3DistanceForward(p, v1, v2);
+
+    if ((e01 <= e02) && (e01 <= e12)) {
+      // e01 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v0, v1, grad_dist);
+      grad_p = thrust::get<0>(grads);
+      grad_v0 = thrust::get<1>(grads);
+      grad_v1 = thrust::get<2>(grads);
+    } else if ((e02 <= e01) && (e02 <= e12)) {
+      // e02 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v0, v2, grad_dist);
+      grad_p = thrust::get<0>(grads);
+      grad_v0 = thrust::get<1>(grads);
+      grad_v2 = thrust::get<2>(grads);
+    } else if ((e12 <= e01) && (e12 <= e02)) {
+      // e12 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v1, v2, grad_dist);
+      grad_p = thrust::get<0>(grads);
+      grad_v1 = thrust::get<1>(grads);
+      grad_v2 = thrust::get<2>(grads);
+    }
+  }
+
+  return thrust::make_tuple(grad_p, grad_v0, grad_v1, grad_v2);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.h b/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..ce6e37cc4c1dbf444e04bbc302d1900b85d4834f
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/geometry_utils.h
@@ -0,0 +1,823 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <ATen/ATen.h>
+#include <algorithm>
+#include <tuple>
+#include <type_traits>
+#include "vec2.h"
+#include "vec3.h"
+
+// Set epsilon for preventing floating point errors and division by 0.
+const auto kEpsilon = 1e-8;
+
+// Determines whether a point p is on the right side of a 2D line segment
+// given by the end points v0, v1.
+//
+// Args:
+//     p: vec2 Coordinates of a point.
+//     v0, v1: vec2 Coordinates of the end points of the edge.
+//
+// Returns:
+//     area: The signed area of the parallelogram given by the vectors
+//           A = p - v0
+//           B = v1 - v0
+//
+//                 v1 ________
+//                   /\      /
+//               A  /  \    /
+//                 /    \  /
+//             v0 /______\/
+//                   B    p
+//
+//          The area can also be interpreted as the cross product A x B.
+//          If the sign of the area is positive, the point p is on the
+//          right side of the edge. Negative area indicates the point is on
+//          the left side of the edge. i.e. for an edge v1 - v0:
+//
+//                      v1
+//                     /
+//                    /
+//             -     /    +
+//                  /
+//                 /
+//               v0
+//
+template <typename T>
+T EdgeFunctionForward(const vec2<T>& p, const vec2<T>& v0, const vec2<T>& v1) {
+  const T edge = (p.x - v0.x) * (v1.y - v0.y) - (p.y - v0.y) * (v1.x - v0.x);
+  return edge;
+}
+
+// Backward pass for the edge function returning partial dervivatives for each
+// of the input points.
+//
+// Args:
+//     p: vec2 Coordinates of a point.
+//     v0, v1: vec2 Coordinates of the end points of the edge.
+//     grad_edge: Upstream gradient for output from edge function.
+//
+// Returns:
+//     tuple of gradients for each of the input points:
+//     (vec2<T> d_edge_dp, vec2<T> d_edge_dv0, vec2<T> d_edge_dv1)
+//
+template <typename T>
+inline std::tuple<vec2<T>, vec2<T>, vec2<T>> EdgeFunctionBackward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const T grad_edge) {
+  const vec2<T> dedge_dp(v1.y - v0.y, v0.x - v1.x);
+  const vec2<T> dedge_dv0(p.y - v1.y, v1.x - p.x);
+  const vec2<T> dedge_dv1(v0.y - p.y, p.x - v0.x);
+  return std::make_tuple(
+      grad_edge * dedge_dp, grad_edge * dedge_dv0, grad_edge * dedge_dv1);
+}
+
+// The forward pass for computing the barycentric coordinates of a point
+// relative to a triangle.
+// Ref:
+// https://www.scratchapixel.com/lessons/3d-basic-rendering/ray-tracing-rendering-a-triangle/barycentric-coordinates
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the triangle vertices.
+//
+// Returns
+//     bary: (w0, w1, w2) barycentric coordinates in the range [0, 1].
+//
+template <typename T>
+vec3<T> BarycentricCoordinatesForward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const vec2<T>& v2) {
+  const T area = EdgeFunctionForward(v2, v0, v1) + kEpsilon;
+  const T w0 = EdgeFunctionForward(p, v1, v2) / area;
+  const T w1 = EdgeFunctionForward(p, v2, v0) / area;
+  const T w2 = EdgeFunctionForward(p, v0, v1) / area;
+  return vec3<T>(w0, w1, w2);
+}
+
+// The backward pass for computing the barycentric coordinates of a point
+// relative to a triangle.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: (x, y) coordinates of the triangle vertices.
+//     grad_bary_upstream: vec3<T> Upstream gradient for each of the
+//                         barycentric coordaintes [grad_w0, grad_w1, grad_w2].
+//
+// Returns
+//    tuple of gradients for each of the triangle vertices:
+//    (vec2<T> grad_v0, vec2<T> grad_v1, vec2<T> grad_v2)
+//
+template <typename T>
+inline std::tuple<vec2<T>, vec2<T>, vec2<T>, vec2<T>> BarycentricCoordsBackward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const vec2<T>& v2,
+    const vec3<T>& grad_bary_upstream) {
+  const T area = EdgeFunctionForward(v2, v0, v1) + kEpsilon;
+  const T area2 = pow(area, 2.0f);
+  const T area_inv = 1.0f / area;
+  const T e0 = EdgeFunctionForward(p, v1, v2);
+  const T e1 = EdgeFunctionForward(p, v2, v0);
+  const T e2 = EdgeFunctionForward(p, v0, v1);
+
+  const T grad_w0 = grad_bary_upstream.x;
+  const T grad_w1 = grad_bary_upstream.y;
+  const T grad_w2 = grad_bary_upstream.z;
+
+  // Calculate component of the gradient from each of w0, w1 and w2.
+  // e.g. for w0:
+  // dloss/dw0_v = dl/dw0 * dw0/dw0_top * dw0_top/dv
+  //               + dl/dw0 * dw0/dw0_bot * dw0_bot/dv
+  const T dw0_darea = -e0 / (area2);
+  const T dw0_e0 = area_inv;
+  const T dloss_d_w0area = grad_w0 * dw0_darea;
+  const T dloss_e0 = grad_w0 * dw0_e0;
+  auto de0_dv = EdgeFunctionBackward(p, v1, v2, dloss_e0);
+  auto dw0area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w0area);
+  const vec2<T> dw0_p = std::get<0>(de0_dv);
+  const vec2<T> dw0_dv0 = std::get<1>(dw0area_dv);
+  const vec2<T> dw0_dv1 = std::get<1>(de0_dv) + std::get<2>(dw0area_dv);
+  const vec2<T> dw0_dv2 = std::get<2>(de0_dv) + std::get<0>(dw0area_dv);
+
+  const T dw1_darea = -e1 / (area2);
+  const T dw1_e1 = area_inv;
+  const T dloss_d_w1area = grad_w1 * dw1_darea;
+  const T dloss_e1 = grad_w1 * dw1_e1;
+  auto de1_dv = EdgeFunctionBackward(p, v2, v0, dloss_e1);
+  auto dw1area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w1area);
+  const vec2<T> dw1_p = std::get<0>(de1_dv);
+  const vec2<T> dw1_dv0 = std::get<2>(de1_dv) + std::get<1>(dw1area_dv);
+  const vec2<T> dw1_dv1 = std::get<2>(dw1area_dv);
+  const vec2<T> dw1_dv2 = std::get<1>(de1_dv) + std::get<0>(dw1area_dv);
+
+  const T dw2_darea = -e2 / (area2);
+  const T dw2_e2 = area_inv;
+  const T dloss_d_w2area = grad_w2 * dw2_darea;
+  const T dloss_e2 = grad_w2 * dw2_e2;
+  auto de2_dv = EdgeFunctionBackward(p, v0, v1, dloss_e2);
+  auto dw2area_dv = EdgeFunctionBackward(v2, v0, v1, dloss_d_w2area);
+  const vec2<T> dw2_p = std::get<0>(de2_dv);
+  const vec2<T> dw2_dv0 = std::get<1>(de2_dv) + std::get<1>(dw2area_dv);
+  const vec2<T> dw2_dv1 = std::get<2>(de2_dv) + std::get<2>(dw2area_dv);
+  const vec2<T> dw2_dv2 = std::get<0>(dw2area_dv);
+
+  const vec2<T> dbary_p = dw0_p + dw1_p + dw2_p;
+  const vec2<T> dbary_dv0 = dw0_dv0 + dw1_dv0 + dw2_dv0;
+  const vec2<T> dbary_dv1 = dw0_dv1 + dw1_dv1 + dw2_dv1;
+  const vec2<T> dbary_dv2 = dw0_dv2 + dw1_dv2 + dw2_dv2;
+
+  return std::make_tuple(dbary_p, dbary_dv0, dbary_dv1, dbary_dv2);
+}
+
+// Forward pass for applying perspective correction to barycentric coordinates.
+//
+// Args:
+//     bary: Screen-space barycentric coordinates for a point
+//     z0, z1, z2: Camera-space z-coordinates of the triangle vertices
+//
+// Returns
+//     World-space barycentric coordinates
+//
+template <typename T>
+inline vec3<T> BarycentricPerspectiveCorrectionForward(
+    const vec3<T>& bary,
+    const T z0,
+    const T z1,
+    const T z2) {
+  const T w0_top = bary.x * z1 * z2;
+  const T w1_top = bary.y * z0 * z2;
+  const T w2_top = bary.z * z0 * z1;
+  const T denom = std::max<T>(w0_top + w1_top + w2_top, kEpsilon);
+  const T w0 = w0_top / denom;
+  const T w1 = w1_top / denom;
+  const T w2 = w2_top / denom;
+  return vec3<T>(w0, w1, w2);
+}
+
+// Backward pass for applying perspective correction to barycentric coordinates.
+//
+// Args:
+//     bary: Screen-space barycentric coordinates for a point
+//     z0, z1, z2: Camera-space z-coordinates of the triangle vertices
+//     grad_out: Upstream gradient of the loss with respect to the corrected
+//               barycentric coordinates.
+//
+// Returns a tuple of:
+//      grad_bary: Downstream gradient of the loss with respect to the the
+//                 uncorrected barycentric coordinates.
+//      grad_z0, grad_z1, grad_z2: Downstream gradient of the loss with respect
+//                                 to the z-coordinates of the triangle verts
+template <typename T>
+inline std::tuple<vec3<T>, T, T, T> BarycentricPerspectiveCorrectionBackward(
+    const vec3<T>& bary,
+    const T z0,
+    const T z1,
+    const T z2,
+    const vec3<T>& grad_out) {
+  // Recompute forward pass
+  const T w0_top = bary.x * z1 * z2;
+  const T w1_top = bary.y * z0 * z2;
+  const T w2_top = bary.z * z0 * z1;
+  const T denom = std::max<T>(w0_top + w1_top + w2_top, kEpsilon);
+
+  // Now do backward pass
+  const T grad_denom_top =
+      -w0_top * grad_out.x - w1_top * grad_out.y - w2_top * grad_out.z;
+  const T grad_denom = grad_denom_top / (denom * denom);
+  const T grad_w0_top = grad_denom + grad_out.x / denom;
+  const T grad_w1_top = grad_denom + grad_out.y / denom;
+  const T grad_w2_top = grad_denom + grad_out.z / denom;
+  const T grad_bary_x = grad_w0_top * z1 * z2;
+  const T grad_bary_y = grad_w1_top * z0 * z2;
+  const T grad_bary_z = grad_w2_top * z0 * z1;
+  const vec3<T> grad_bary(grad_bary_x, grad_bary_y, grad_bary_z);
+  const T grad_z0 = grad_w1_top * bary.y * z2 + grad_w2_top * bary.z * z1;
+  const T grad_z1 = grad_w0_top * bary.x * z2 + grad_w2_top * bary.z * z0;
+  const T grad_z2 = grad_w0_top * bary.x * z1 + grad_w1_top * bary.y * z0;
+  return std::make_tuple(grad_bary, grad_z0, grad_z1, grad_z2);
+}
+
+// Clip negative barycentric coordinates to 0.0 and renormalize so
+// the barycentric coordinates for a point sum to 1. When the blur_radius
+// is greater than 0, a face will still be recorded as overlapping a pixel
+// if the pixel is outside the face. In this case at least one of the
+// barycentric coordinates for the pixel relative to the face will be negative.
+// Clipping will ensure that the texture and z buffer are interpolated
+// correctly.
+//
+//  Args
+//     bary: (w0, w1, w2) barycentric coordinates which can contain values < 0.
+//
+//  Returns
+//     bary: (w0, w1, w2) barycentric coordinates in the range [0, 1] which
+//           satisfy the condition: sum(w0, w1, w2) = 1.0.
+//
+template <typename T>
+vec3<T> BarycentricClipForward(const vec3<T> bary) {
+  vec3<T> w(0.0f, 0.0f, 0.0f);
+  // Only clamp negative values to 0.0.
+  // No need to clamp values > 1.0 as they will be renormalized.
+  w.x = std::max(bary.x, 0.0f);
+  w.y = std::max(bary.y, 0.0f);
+  w.z = std::max(bary.z, 0.0f);
+  float w_sum = w.x + w.y + w.z;
+  w_sum = std::fmaxf(w_sum, 1e-5);
+  w.x /= w_sum;
+  w.y /= w_sum;
+  w.z /= w_sum;
+  return w;
+}
+
+// Backward pass for barycentric coordinate clipping.
+//
+//  Args
+//     bary: (w0, w1, w2) barycentric coordinates which can contain values < 0.
+//     grad_baryclip_upstream: vec3<T> Upstream gradient for each of the clipped
+//                         barycentric coordinates [grad_w0, grad_w1, grad_w2].
+//
+// Returns
+//    vec3<T> of gradients for the unclipped barycentric coordinates:
+//    (grad_w0, grad_w1, grad_w2)
+//
+template <typename T>
+vec3<T> BarycentricClipBackward(
+    const vec3<T> bary,
+    const vec3<T> grad_baryclip_upstream) {
+  // Redo some of the forward pass calculations
+  vec3<T> w(0.0f, 0.0f, 0.0f);
+  w.x = std::max(bary.x, 0.0f);
+  w.y = std::max(bary.y, 0.0f);
+  w.z = std::max(bary.z, 0.0f);
+  float w_sum = w.x + w.y + w.z;
+
+  vec3<T> grad_bary(1.0f, 1.0f, 1.0f);
+  vec3<T> grad_clip(1.0f, 1.0f, 1.0f);
+  vec3<T> grad_sum(1.0f, 1.0f, 1.0f);
+
+  // Check if the sum was clipped.
+  float grad_sum_clip = 1.0f;
+  if (w_sum < 1e-5) {
+    grad_sum_clip = 0.0f;
+    w_sum = 1e-5;
+  }
+
+  // Check if any of the bary coordinates have been clipped.
+  // Only negative values are clamped to 0.0.
+  if (bary.x < 0.0f) {
+    grad_clip.x = 0.0f;
+  }
+  if (bary.y < 0.0f) {
+    grad_clip.y = 0.0f;
+  }
+  if (bary.z < 0.0f) {
+    grad_clip.z = 0.0f;
+  }
+
+  // Gradients of the sum.
+  grad_sum.x = -w.x / (pow(w_sum, 2.0f)) * grad_sum_clip;
+  grad_sum.y = -w.y / (pow(w_sum, 2.0f)) * grad_sum_clip;
+  grad_sum.z = -w.z / (pow(w_sum, 2.0f)) * grad_sum_clip;
+
+  // Gradients for each of the bary coordinates including the cross terms
+  // from the sum.
+  grad_bary.x = grad_clip.x *
+      (grad_baryclip_upstream.x * (1.0f / w_sum + grad_sum.x) +
+       grad_baryclip_upstream.y * (grad_sum.y) +
+       grad_baryclip_upstream.z * (grad_sum.z));
+
+  grad_bary.y = grad_clip.y *
+      (grad_baryclip_upstream.y * (1.0f / w_sum + grad_sum.y) +
+       grad_baryclip_upstream.x * (grad_sum.x) +
+       grad_baryclip_upstream.z * (grad_sum.z));
+
+  grad_bary.z = grad_clip.z *
+      (grad_baryclip_upstream.z * (1.0f / w_sum + grad_sum.z) +
+       grad_baryclip_upstream.x * (grad_sum.x) +
+       grad_baryclip_upstream.y * (grad_sum.y));
+
+  return grad_bary;
+}
+
+// Calculate minimum distance between a line segment (v1 - v0) and point p.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1: Coordinates of the end points of the line segment.
+//
+// Returns:
+//     squared distance of the point to the line.
+//
+// Consider the line extending the segment - this can be parameterized as:
+// v0 + t (v1 - v0).
+//
+// First find the projection of point p onto the line. It falls where:
+// t = [(p - v0) . (v1 - v0)] / |v1 - v0|^2
+// where . is the dot product.
+//
+// The parameter t is clamped from [0, 1] to handle points outside the
+// segment (v1 - v0).
+//
+// Once the projection of the point on the segment is known, the distance from
+// p to the projection gives the minimum distance to the segment.
+//
+template <typename T>
+T PointLineDistanceForward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1) {
+  const vec2<T> v1v0 = v1 - v0;
+  const T l2 = dot(v1v0, v1v0);
+  if (l2 <= kEpsilon) {
+    return dot(p - v1, p - v1);
+  }
+
+  const T t = dot(v1v0, p - v0) / l2;
+  const T tt = std::min(std::max(t, 0.00f), 1.00f);
+  const vec2<T> p_proj = v0 + tt * v1v0;
+  return dot(p - p_proj, p - p_proj);
+}
+
+template <typename T>
+T PointLine3DistanceForward(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1) {
+  const vec3<T> v1v0 = v1 - v0;
+  const T l2 = dot(v1v0, v1v0);
+  if (l2 <= kEpsilon) {
+    return dot(p - v1, p - v1);
+  }
+
+  const T t = dot(v1v0, p - v0) / l2;
+  const T tt = std::min(std::max(t, 0.00f), 1.00f);
+  const vec3<T> p_proj = v0 + tt * v1v0;
+  return dot(p - p_proj, p - p_proj);
+}
+
+// Backward pass for point to line distance in 2D.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1: Coordinates of the end points of the line segment.
+//     grad_dist: Upstream gradient for the distance.
+//
+// Returns:
+//    tuple of gradients for each of the input points:
+//      (vec2<T> grad_p, vec2<T> grad_v0, vec2<T> grad_v1)
+//
+template <typename T>
+inline std::tuple<vec2<T>, vec2<T>, vec2<T>> PointLineDistanceBackward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const T& grad_dist) {
+  // Redo some of the forward pass calculations.
+  const vec2<T> v1v0 = v1 - v0;
+  const vec2<T> pv0 = p - v0;
+  const T t_bot = dot(v1v0, v1v0);
+  const T t_top = dot(v1v0, pv0);
+  const T t = t_top / t_bot;
+  const T tt = std::min(std::max(t, 0.00f), 1.00f);
+  const vec2<T> p_proj = (1.0f - tt) * v0 + tt * v1;
+
+  const vec2<T> grad_v0 = grad_dist * (1.0f - tt) * 2.0f * (p_proj - p);
+  const vec2<T> grad_v1 = grad_dist * tt * 2.0f * (p_proj - p);
+  const vec2<T> grad_p = -1.0f * grad_dist * 2.0f * (p_proj - p);
+
+  return std::make_tuple(grad_p, grad_v0, grad_v1);
+}
+
+template <typename T>
+std::tuple<vec3<T>, vec3<T>, vec3<T>> PointLine3DistanceBackward(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1,
+    const T& grad_dist) {
+  const vec3<T> v1v0 = v1 - v0;
+  const vec3<T> pv0 = p - v0;
+  const T t_bot = dot(v1v0, v1v0);
+  const T t_top = dot(v1v0, pv0);
+
+  vec3<T> grad_p{0.0f, 0.0f, 0.0f};
+  vec3<T> grad_v0{0.0f, 0.0f, 0.0f};
+  vec3<T> grad_v1{0.0f, 0.0f, 0.0f};
+
+  const T tt = t_top / t_bot;
+
+  if (t_bot < kEpsilon) {
+    // if t_bot small, then v0 == v1,
+    // and dist = 0.5 * dot(pv0, pv0) + 0.5 * dot(pv1, pv1)
+    grad_p = grad_dist * 2.0f * pv0;
+    grad_v0 = -0.5f * grad_p;
+    grad_v1 = grad_v0;
+  } else if (tt < 0.0f) {
+    grad_p = grad_dist * 2.0f * pv0;
+    grad_v0 = -1.0f * grad_p;
+    // no gradients wrt v1
+  } else if (tt > 1.0f) {
+    grad_p = grad_dist * 2.0f * (p - v1);
+    grad_v1 = -1.0f * grad_p;
+    // no gradients wrt v0
+  } else {
+    const vec3<T> p_proj = v0 + tt * v1v0;
+    const vec3<T> diff = p - p_proj;
+    const vec3<T> grad_base = grad_dist * 2.0f * diff;
+    grad_p = grad_base - dot(grad_base, v1v0) * v1v0 / t_bot;
+    const vec3<T> dtt_v0 = (-1.0f * v1v0 - pv0 + 2.0f * tt * v1v0) / t_bot;
+    grad_v0 = (-1.0f + tt) * grad_base - dot(grad_base, v1v0) * dtt_v0;
+    const vec3<T> dtt_v1 = (pv0 - 2.0f * tt * v1v0) / t_bot;
+    grad_v1 = -dot(grad_base, v1v0) * dtt_v1 - tt * grad_base;
+  }
+
+  return std::make_tuple(grad_p, grad_v0, grad_v1);
+}
+
+// The forward pass for calculating the shortest distance between a point
+// and a triangle.
+// Ref: https://www.randygaul.net/2014/07/23/distance-point-to-line-segment/
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the three triangle vertices.
+//
+// Returns:
+//     shortest squared distance from a point to a triangle.
+//
+//
+template <typename T>
+T PointTriangleDistanceForward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const vec2<T>& v2) {
+  // Compute distance of point to 3 edges of the triangle and return the
+  // minimum value.
+  const T e01_dist = PointLineDistanceForward(p, v0, v1);
+  const T e02_dist = PointLineDistanceForward(p, v0, v2);
+  const T e12_dist = PointLineDistanceForward(p, v1, v2);
+  const T edge_dist = std::min(std::min(e01_dist, e02_dist), e12_dist);
+
+  return edge_dist;
+}
+
+// Backward pass for point triangle distance.
+//
+// Args:
+//     p: Coordinates of a point.
+//     v0, v1, v2: Coordinates of the three triangle vertices.
+//     grad_dist: Upstream gradient for the distance.
+//
+// Returns:
+//    tuple of gradients for each of the triangle vertices:
+//      (vec2<T> grad_v0, vec2<T> grad_v1, vec2<T> grad_v2)
+//
+template <typename T>
+inline std::tuple<vec2<T>, vec2<T>, vec2<T>, vec2<T>>
+PointTriangleDistanceBackward(
+    const vec2<T>& p,
+    const vec2<T>& v0,
+    const vec2<T>& v1,
+    const vec2<T>& v2,
+    const T& grad_dist) {
+  // Compute distance to all 3 edges of the triangle.
+  const T e01_dist = PointLineDistanceForward(p, v0, v1);
+  const T e02_dist = PointLineDistanceForward(p, v0, v2);
+  const T e12_dist = PointLineDistanceForward(p, v1, v2);
+
+  // Initialize output tensors.
+  vec2<T> grad_v0(0.0f, 0.0f);
+  vec2<T> grad_v1(0.0f, 0.0f);
+  vec2<T> grad_v2(0.0f, 0.0f);
+  vec2<T> grad_p(0.0f, 0.0f);
+
+  // Find which edge is the closest and return PointLineDistanceBackward for
+  // that edge.
+  if (e01_dist <= e02_dist && e01_dist <= e12_dist) {
+    // Closest edge is v1 - v0.
+    auto grad_e01 = PointLineDistanceBackward(p, v0, v1, grad_dist);
+    grad_p = std::get<0>(grad_e01);
+    grad_v0 = std::get<1>(grad_e01);
+    grad_v1 = std::get<2>(grad_e01);
+  } else if (e02_dist <= e01_dist && e02_dist <= e12_dist) {
+    // Closest edge is v2 - v0.
+    auto grad_e02 = PointLineDistanceBackward(p, v0, v2, grad_dist);
+    grad_p = std::get<0>(grad_e02);
+    grad_v0 = std::get<1>(grad_e02);
+    grad_v2 = std::get<2>(grad_e02);
+  } else if (e12_dist <= e01_dist && e12_dist <= e02_dist) {
+    // Closest edge is v2 - v1.
+    auto grad_e12 = PointLineDistanceBackward(p, v1, v2, grad_dist);
+    grad_p = std::get<0>(grad_e12);
+    grad_v1 = std::get<1>(grad_e12);
+    grad_v2 = std::get<2>(grad_e12);
+  }
+
+  return std::make_tuple(grad_p, grad_v0, grad_v1, grad_v2);
+}
+
+// Computes the area of a triangle (v0, v1, v2).
+// Args:
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//
+// Returns:
+//     area: float: the area of the triangle
+//
+template <typename T>
+T AreaOfTriangle(const vec3<T>& v0, const vec3<T>& v1, const vec3<T>& v2) {
+  vec3<T> p0 = v1 - v0;
+  vec3<T> p1 = v2 - v0;
+
+  // compute the hypotenus of the scross product (p0 x p1)
+  float dd = std::hypot(
+      p0.y * p1.z - p0.z * p1.y,
+      std::hypot(p0.z * p1.x - p0.x * p1.z, p0.x * p1.y - p0.y * p1.x));
+
+  return dd / 2.0;
+}
+
+// Computes the squared distance of a point p relative to a triangle (v0, v1,
+// v2). If the point's projection p0 on the plane spanned by (v0, v1, v2) is
+// inside the triangle with vertices (v0, v1, v2), then the returned value is
+// the squared distance of p to its projection p0. Otherwise, the returned value
+// is the smallest squared distance of p from the line segments (v0, v1), (v0,
+// v2) and (v1, v2).
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//
+// Returns:
+//     dist: Float of the squared distance
+//
+
+const float vEpsilon = 1e-8;
+
+template <typename T>
+vec3<T> BarycentricCoords3Forward(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1,
+    const vec3<T>& v2) {
+  vec3<T> p0 = v1 - v0;
+  vec3<T> p1 = v2 - v0;
+  vec3<T> p2 = p - v0;
+
+  const T d00 = dot(p0, p0);
+  const T d01 = dot(p0, p1);
+  const T d11 = dot(p1, p1);
+  const T d20 = dot(p2, p0);
+  const T d21 = dot(p2, p1);
+
+  const T denom = d00 * d11 - d01 * d01 + kEpsilon;
+  const T w1 = (d11 * d20 - d01 * d21) / denom;
+  const T w2 = (d00 * d21 - d01 * d20) / denom;
+  const T w0 = 1.0f - w1 - w2;
+
+  return vec3<T>(w0, w1, w2);
+}
+
+// Checks whether the point p is inside the triangle (v0, v1, v2).
+// A point is inside the triangle, if all barycentric coordinates
+// wrt the triangle are >= 0 & <= 1.
+// If the triangle is degenerate, aka line or point, then return False.
+//
+// NOTE that this function assumes that p lives on the space spanned
+// by (v0, v1, v2).
+// TODO(gkioxari) explicitly check whether p is coplanar with (v0, v1, v2)
+// and throw an error if check fails
+//
+// Args:
+//     p: vec3 coordinates of a point
+//     v0, v1, v2: vec3 coordinates of the triangle vertices
+//     min_triangle_area: triangles less than this size are considered
+//     points/lines, IsInsideTriangle returns False
+//
+// Returns:
+//     inside: bool indicating wether p is inside triangle
+//
+template <typename T>
+static bool IsInsideTriangle(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1,
+    const vec3<T>& v2,
+    const double min_triangle_area) {
+  bool inside;
+  if (AreaOfTriangle(v0, v1, v2) < min_triangle_area) {
+    inside = 0;
+  } else {
+    vec3<T> bary = BarycentricCoords3Forward(p, v0, v1, v2);
+    bool x_in = 0.0f <= bary.x && bary.x <= 1.0f;
+    bool y_in = 0.0f <= bary.y && bary.y <= 1.0f;
+    bool z_in = 0.0f <= bary.z && bary.z <= 1.0f;
+    inside = x_in && y_in && z_in;
+  }
+  return inside;
+}
+
+template <typename T>
+T PointTriangle3DistanceForward(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1,
+    const vec3<T>& v2,
+    const double min_triangle_area) {
+  vec3<T> normal = cross(v2 - v0, v1 - v0);
+  const T norm_normal = norm(normal);
+  normal = normal / (norm_normal + vEpsilon);
+
+  // p0 is the projection of p on the plane spanned by (v0, v1, v2)
+  // i.e. p0 = p + t * normal, s.t. (p0 - v0) is orthogonal to normal
+  const T t = dot(v0 - p, normal);
+  const vec3<T> p0 = p + t * normal;
+
+  bool is_inside = IsInsideTriangle(p0, v0, v1, v2, min_triangle_area);
+  T dist = 0.0f;
+
+  if ((is_inside) && (norm_normal > kEpsilon)) {
+    // if projection p0 is inside triangle spanned by (v0, v1, v2)
+    // then distance is equal to norm(p0 - p)^2
+    dist = t * t;
+  } else {
+    const float e01 = PointLine3DistanceForward(p, v0, v1);
+    const float e02 = PointLine3DistanceForward(p, v0, v2);
+    const float e12 = PointLine3DistanceForward(p, v1, v2);
+
+    dist = (e01 > e02) ? e02 : e01;
+    dist = (dist > e12) ? e12 : dist;
+  }
+
+  return dist;
+}
+
+template <typename T>
+std::tuple<vec3<T>, vec3<T>>
+cross_backward(const vec3<T>& a, const vec3<T>& b, const vec3<T>& grad_cross) {
+  const float grad_ax = -grad_cross.y * b.z + grad_cross.z * b.y;
+  const float grad_ay = grad_cross.x * b.z - grad_cross.z * b.x;
+  const float grad_az = -grad_cross.x * b.y + grad_cross.y * b.x;
+  const vec3<T> grad_a = vec3<T>(grad_ax, grad_ay, grad_az);
+
+  const float grad_bx = grad_cross.y * a.z - grad_cross.z * a.y;
+  const float grad_by = -grad_cross.x * a.z + grad_cross.z * a.x;
+  const float grad_bz = grad_cross.x * a.y - grad_cross.y * a.x;
+  const vec3<T> grad_b = vec3<T>(grad_bx, grad_by, grad_bz);
+
+  return std::make_tuple(grad_a, grad_b);
+}
+
+template <typename T>
+vec3<T> normalize_backward(const vec3<T>& a, const vec3<T>& grad_normz) {
+  const float a_norm = norm(a) + vEpsilon;
+  const vec3<T> out = a / a_norm;
+
+  const float grad_ax = grad_normz.x * (1.0f - out.x * out.x) / a_norm +
+      grad_normz.y * (-out.x * out.y) / a_norm +
+      grad_normz.z * (-out.x * out.z) / a_norm;
+  const float grad_ay = grad_normz.x * (-out.x * out.y) / a_norm +
+      grad_normz.y * (1.0f - out.y * out.y) / a_norm +
+      grad_normz.z * (-out.y * out.z) / a_norm;
+  const float grad_az = grad_normz.x * (-out.x * out.z) / a_norm +
+      grad_normz.y * (-out.y * out.z) / a_norm +
+      grad_normz.z * (1.0f - out.z * out.z) / a_norm;
+  return vec3<T>(grad_ax, grad_ay, grad_az);
+}
+
+// The backward pass for computing the squared distance of a point
+// to the triangle (v0, v1, v2).
+//
+// Args:
+//     p: xyz coordinates of a point
+//     v0, v1, v2: xyz coordinates of the triangle vertices
+//     grad_dist: Float of the gradient wrt dist
+//     min_triangle_area: triangles less than this size are considered
+//     points/lines, IsInsideTriangle returns False
+//
+// Returns:
+//     tuple of gradients for the point and triangle:
+//        (float3 grad_p, float3 grad_v0, float3 grad_v1, float3 grad_v2)
+//
+
+template <typename T>
+static std::tuple<vec3<T>, vec3<T>, vec3<T>, vec3<T>>
+PointTriangle3DistanceBackward(
+    const vec3<T>& p,
+    const vec3<T>& v0,
+    const vec3<T>& v1,
+    const vec3<T>& v2,
+    const T& grad_dist,
+    const double min_triangle_area) {
+  const vec3<T> v2v0 = v2 - v0;
+  const vec3<T> v1v0 = v1 - v0;
+  const vec3<T> v0p = v0 - p;
+  vec3<T> raw_normal = cross(v2v0, v1v0);
+  const T norm_normal = norm(raw_normal);
+  vec3<T> normal = raw_normal / (norm_normal + vEpsilon);
+
+  // p0 is the projection of p on the plane spanned by (v0, v1, v2)
+  // i.e. p0 = p + t * normal, s.t. (p0 - v0) is orthogonal to normal
+  const T t = dot(v0 - p, normal);
+  const vec3<T> p0 = p + t * normal;
+  const vec3<T> diff = t * normal;
+
+  bool is_inside = IsInsideTriangle(p0, v0, v1, v2, min_triangle_area);
+
+  vec3<T> grad_p(0.0f, 0.0f, 0.0f);
+  vec3<T> grad_v0(0.0f, 0.0f, 0.0f);
+  vec3<T> grad_v1(0.0f, 0.0f, 0.0f);
+  vec3<T> grad_v2(0.0f, 0.0f, 0.0f);
+
+  if ((is_inside) && (norm_normal > kEpsilon)) {
+    // derivative of dist wrt p
+    grad_p = -2.0f * grad_dist * t * normal;
+    // derivative of dist wrt normal
+    const vec3<T> grad_normal = 2.0f * grad_dist * t * (v0p + diff);
+    // derivative of dist wrt raw_normal
+    const vec3<T> grad_raw_normal = normalize_backward(raw_normal, grad_normal);
+    // derivative of dist wrt v2v0 and v1v0
+    const auto grad_cross = cross_backward(v2v0, v1v0, grad_raw_normal);
+    const vec3<T> grad_cross_v2v0 = std::get<0>(grad_cross);
+    const vec3<T> grad_cross_v1v0 = std::get<1>(grad_cross);
+    grad_v0 =
+        grad_dist * 2.0f * t * normal - (grad_cross_v2v0 + grad_cross_v1v0);
+    grad_v1 = grad_cross_v1v0;
+    grad_v2 = grad_cross_v2v0;
+  } else {
+    const T e01 = PointLine3DistanceForward(p, v0, v1);
+    const T e02 = PointLine3DistanceForward(p, v0, v2);
+    const T e12 = PointLine3DistanceForward(p, v1, v2);
+
+    if ((e01 <= e02) && (e01 <= e12)) {
+      // e01 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v0, v1, grad_dist);
+      grad_p = std::get<0>(grads);
+      grad_v0 = std::get<1>(grads);
+      grad_v1 = std::get<2>(grads);
+    } else if ((e02 <= e01) && (e02 <= e12)) {
+      // e02 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v0, v2, grad_dist);
+      grad_p = std::get<0>(grads);
+      grad_v0 = std::get<1>(grads);
+      grad_v2 = std::get<2>(grads);
+    } else if ((e12 <= e01) && (e12 <= e02)) {
+      // e12 is smallest
+      const auto grads = PointLine3DistanceBackward(p, v1, v2, grad_dist);
+      grad_p = std::get<0>(grads);
+      grad_v1 = std::get<1>(grads);
+      grad_v2 = std::get<2>(grads);
+    }
+  }
+
+  return std::make_tuple(grad_p, grad_v0, grad_v1, grad_v2);
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/index_utils.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/index_utils.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..cdae2d59353d47b37c057bfcc1e614f741ab81f1
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/index_utils.cuh
@@ -0,0 +1,224 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+// This converts dynamic array lookups into static array lookups, for small
+// arrays up to size 32.
+//
+// Suppose we have a small thread-local array:
+//
+// float vals[10];
+//
+// Ideally we should only index this array using static indices:
+//
+// for (int i = 0; i < 10; ++i) vals[i] = i * i;
+//
+// If we do so, then the CUDA compiler may be able to place the array into
+// registers, which can have a big performance improvement. However if we
+// access the array dynamically, the the compiler may force the array into
+// local memory, which has the same latency as global memory.
+//
+// These functions convert dynamic array access into static array access
+// using a brute-force lookup table. It can be used like this:
+//
+// float vals[10];
+// int idx = 3;
+// float val = 3.14f;
+// RegisterIndexUtils<float, 10>::set(vals, idx, val);
+// float val2 = RegisterIndexUtils<float, 10>::get(vals, idx);
+//
+// The implementation is based on fbcuda/RegisterUtils.cuh:
+// https://github.com/facebook/fbcuda/blob/master/RegisterUtils.cuh
+// To avoid depending on the entire library, we just reimplement these two
+// functions. The fbcuda implementation is a bit more sophisticated, and uses
+// the preprocessor to generate switch statements that go up to N for each
+// value of N. We are lazy and just have a giant explicit switch statement.
+//
+// We might be able to use a template metaprogramming approach similar to
+// DispatchKernel1D for this. However DispatchKernel1D is intended to be used
+// for dispatching to the correct CUDA kernel on the host, while this is
+// is intended to run on the device. I was concerned that a metaprogramming
+// approach for this might lead to extra function calls at runtime if the
+// compiler fails to optimize them away, which could be very slow on device.
+// However I didn't actually benchmark or test this.
+template <typename T, int N>
+struct RegisterIndexUtils {
+  __device__ __forceinline__ static T get(const T arr[N], int idx) {
+    if (idx < 0 || idx >= N)
+      return T();
+    switch (idx) {
+      case 0:
+        return arr[0];
+      case 1:
+        return arr[1];
+      case 2:
+        return arr[2];
+      case 3:
+        return arr[3];
+      case 4:
+        return arr[4];
+      case 5:
+        return arr[5];
+      case 6:
+        return arr[6];
+      case 7:
+        return arr[7];
+      case 8:
+        return arr[8];
+      case 9:
+        return arr[9];
+      case 10:
+        return arr[10];
+      case 11:
+        return arr[11];
+      case 12:
+        return arr[12];
+      case 13:
+        return arr[13];
+      case 14:
+        return arr[14];
+      case 15:
+        return arr[15];
+      case 16:
+        return arr[16];
+      case 17:
+        return arr[17];
+      case 18:
+        return arr[18];
+      case 19:
+        return arr[19];
+      case 20:
+        return arr[20];
+      case 21:
+        return arr[21];
+      case 22:
+        return arr[22];
+      case 23:
+        return arr[23];
+      case 24:
+        return arr[24];
+      case 25:
+        return arr[25];
+      case 26:
+        return arr[26];
+      case 27:
+        return arr[27];
+      case 28:
+        return arr[28];
+      case 29:
+        return arr[29];
+      case 30:
+        return arr[30];
+      case 31:
+        return arr[31];
+    };
+    return T();
+  }
+
+  __device__ __forceinline__ static void set(T arr[N], int idx, T val) {
+    if (idx < 0 || idx >= N)
+      return;
+    switch (idx) {
+      case 0:
+        arr[0] = val;
+        break;
+      case 1:
+        arr[1] = val;
+        break;
+      case 2:
+        arr[2] = val;
+        break;
+      case 3:
+        arr[3] = val;
+        break;
+      case 4:
+        arr[4] = val;
+        break;
+      case 5:
+        arr[5] = val;
+        break;
+      case 6:
+        arr[6] = val;
+        break;
+      case 7:
+        arr[7] = val;
+        break;
+      case 8:
+        arr[8] = val;
+        break;
+      case 9:
+        arr[9] = val;
+        break;
+      case 10:
+        arr[10] = val;
+        break;
+      case 11:
+        arr[11] = val;
+        break;
+      case 12:
+        arr[12] = val;
+        break;
+      case 13:
+        arr[13] = val;
+        break;
+      case 14:
+        arr[14] = val;
+        break;
+      case 15:
+        arr[15] = val;
+        break;
+      case 16:
+        arr[16] = val;
+        break;
+      case 17:
+        arr[17] = val;
+        break;
+      case 18:
+        arr[18] = val;
+        break;
+      case 19:
+        arr[19] = val;
+        break;
+      case 20:
+        arr[20] = val;
+        break;
+      case 21:
+        arr[21] = val;
+        break;
+      case 22:
+        arr[22] = val;
+        break;
+      case 23:
+        arr[23] = val;
+        break;
+      case 24:
+        arr[24] = val;
+        break;
+      case 25:
+        arr[25] = val;
+        break;
+      case 26:
+        arr[26] = val;
+        break;
+      case 27:
+        arr[27] = val;
+        break;
+      case 28:
+        arr[28] = val;
+        break;
+      case 29:
+        arr[29] = val;
+        break;
+      case 30:
+        arr[30] = val;
+        break;
+      case 31:
+        arr[31] = val;
+        break;
+    }
+  }
+};
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/mink.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/mink.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..5b278c80509c0c1bb6238256f425a0809fdc0a54
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/mink.cuh
@@ -0,0 +1,165 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#define MINK_H
+
+#include "index_utils.cuh"
+
+// A data structure to keep track of the smallest K keys seen so far as well
+// as their associated values, intended to be used in device code.
+// This data structure doesn't allocate any memory; keys and values are stored
+// in arrays passed to the constructor.
+//
+// The implementation is generic; it can be used for any key type that supports
+// the < operator, and can be used with any value type.
+//
+// Example usage:
+//
+// float keys[K];
+// int values[K];
+// MinK<float, int> mink(keys, values, K);
+// for (...) {
+//   // Produce some key and value from somewhere
+//   mink.add(key, value);
+// }
+// mink.sort();
+//
+// Now keys and values store the smallest K keys seen so far and the values
+// associated to these keys:
+//
+// for (int k = 0; k < K; ++k) {
+//   float key_k = keys[k];
+//   int value_k = values[k];
+// }
+template <typename key_t, typename value_t>
+class MinK {
+ public:
+  // Constructor.
+  //
+  // Arguments:
+  //   keys: Array in which to store keys
+  //   values: Array in which to store values
+  //   K: How many values to keep track of
+  __device__ MinK(key_t* keys, value_t* vals, int K)
+      : keys(keys), vals(vals), K(K), _size(0) {}
+
+  // Try to add a new key and associated value to the data structure. If the key
+  // is one of the smallest K seen so far then it will be kept; otherwise it
+  // it will not be kept.
+  //
+  // This takes O(1) operations if the new key is not kept, or if the structure
+  // currently contains fewer than K elements. Otherwise this takes O(K) time.
+  //
+  // Arguments:
+  //   key: The key to add
+  //   val: The value associated to the key
+  __device__ __forceinline__ void add(const key_t& key, const value_t& val) {
+    if (_size < K) {
+      keys[_size] = key;
+      vals[_size] = val;
+      if (_size == 0 || key > max_key) {
+        max_key = key;
+        max_idx = _size;
+      }
+      _size++;
+    } else if (key < max_key) {
+      keys[max_idx] = key;
+      vals[max_idx] = val;
+      max_key = key;
+      for (int k = 0; k < K; ++k) {
+        key_t cur_key = keys[k];
+        if (cur_key > max_key) {
+          max_key = cur_key;
+          max_idx = k;
+        }
+      }
+    }
+  }
+
+  // Get the number of items currently stored in the structure.
+  // This takes O(1) time.
+  __device__ __forceinline__ int size() {
+    return _size;
+  }
+
+  // Sort the items stored in the structure using bubble sort.
+  // This takes O(K^2) time.
+  __device__ __forceinline__ void sort() {
+    for (int i = 0; i < _size - 1; ++i) {
+      for (int j = 0; j < _size - i - 1; ++j) {
+        if (keys[j + 1] < keys[j]) {
+          key_t key = keys[j];
+          value_t val = vals[j];
+          keys[j] = keys[j + 1];
+          vals[j] = vals[j + 1];
+          keys[j + 1] = key;
+          vals[j + 1] = val;
+        }
+      }
+    }
+  }
+
+ private:
+  key_t* keys;
+  value_t* vals;
+  int K;
+  int _size;
+  key_t max_key;
+  int max_idx;
+};
+
+// This is a version of MinK that only touches the arrays using static indexing
+// via RegisterIndexUtils. If the keys and values are stored in thread-local
+// arrays, then this may allow the compiler to place them in registers for
+// fast access.
+//
+// This has the same API as RegisterMinK, but doesn't support sorting.
+// We found that sorting via RegisterIndexUtils gave very poor performance,
+// and suspect it may have prevented the compiler from placing the arrays
+// into registers.
+template <typename key_t, typename value_t, int K>
+class RegisterMinK {
+ public:
+  __device__ RegisterMinK(key_t* keys, value_t* vals)
+      : keys(keys), vals(vals), _size(0) {}
+
+  __device__ __forceinline__ void add(const key_t& key, const value_t& val) {
+    if (_size < K) {
+      RegisterIndexUtils<key_t, K>::set(keys, _size, key);
+      RegisterIndexUtils<value_t, K>::set(vals, _size, val);
+      if (_size == 0 || key > max_key) {
+        max_key = key;
+        max_idx = _size;
+      }
+      _size++;
+    } else if (key < max_key) {
+      RegisterIndexUtils<key_t, K>::set(keys, max_idx, key);
+      RegisterIndexUtils<value_t, K>::set(vals, max_idx, val);
+      max_key = key;
+      for (int k = 0; k < K; ++k) {
+        key_t cur_key = RegisterIndexUtils<key_t, K>::get(keys, k);
+        if (cur_key > max_key) {
+          max_key = cur_key;
+          max_idx = k;
+        }
+      }
+    }
+  }
+
+  __device__ __forceinline__ int size() {
+    return _size;
+  }
+
+ private:
+  key_t* keys;
+  value_t* vals;
+  int _size;
+  key_t max_key;
+  int max_idx;
+};
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/pytorch3d_cutils.h b/data/dot_single_video/dot/utils/torch3d/csrc/utils/pytorch3d_cutils.h
new file mode 100644
index 0000000000000000000000000000000000000000..e88a2f43317128bd6a0b822bb3d860194e0610f7
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/pytorch3d_cutils.h
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#include <torch/extension.h>
+
+#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor.")
+#define CHECK_CONTIGUOUS(x) \
+  TORCH_CHECK(x.is_contiguous(), #x " must be contiguous.")
+#define CHECK_CONTIGUOUS_CUDA(x) \
+  CHECK_CUDA(x);                 \
+  CHECK_CONTIGUOUS(x)
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec2.h b/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec2.h
new file mode 100644
index 0000000000000000000000000000000000000000..ad4081bc437bd9dea2b146a9ebb057eb1ddeb1d2
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec2.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+#include <type_traits>
+
+// A fixed-sized vector with basic arithmetic operators useful for
+// representing 2D coordinates.
+// TODO: switch to Eigen if more functionality is needed.
+
+template <
+    typename T,
+    typename = std::enable_if_t<
+        std::is_same<T, double>::value || std::is_same<T, float>::value>>
+struct vec2 {
+  T x, y;
+  typedef T scalar_t;
+  vec2(T x, T y) : x(x), y(y) {}
+};
+
+template <typename T>
+inline vec2<T> operator+(const vec2<T>& a, const vec2<T>& b) {
+  return vec2<T>(a.x + b.x, a.y + b.y);
+}
+
+template <typename T>
+inline vec2<T> operator-(const vec2<T>& a, const vec2<T>& b) {
+  return vec2<T>(a.x - b.x, a.y - b.y);
+}
+
+template <typename T>
+inline vec2<T> operator*(const T a, const vec2<T>& b) {
+  return vec2<T>(a * b.x, a * b.y);
+}
+
+template <typename T>
+inline vec2<T> operator/(const vec2<T>& a, const T b) {
+  if (b == 0.0) {
+    AT_ERROR(
+        "denominator in vec2 division is 0"); // prevent divide by 0 errors.
+  }
+  return vec2<T>(a.x / b, a.y / b);
+}
+
+template <typename T>
+inline T dot(const vec2<T>& a, const vec2<T>& b) {
+  return a.x * b.x + a.y * b.y;
+}
+
+template <typename T>
+inline T norm(const vec2<T>& a, const vec2<T>& b) {
+  const vec2<T> ba = b - a;
+  return sqrt(dot(ba, ba));
+}
+
+template <typename T>
+std::ostream& operator<<(std::ostream& os, const vec2<T>& v) {
+  os << "vec2(" << v.x << ", " << v.y << ")";
+  return os;
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec3.h b/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec3.h
new file mode 100644
index 0000000000000000000000000000000000000000..9467d787a3c589a5703832eea05d6e55f1350886
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/vec3.h
@@ -0,0 +1,74 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#pragma once
+
+// A fixed-sized vector with basic arithmetic operators useful for
+// representing 3D coordinates.
+// TODO: switch to Eigen if more functionality is needed.
+
+template <
+    typename T,
+    typename = std::enable_if_t<
+        std::is_same<T, double>::value || std::is_same<T, float>::value>>
+struct vec3 {
+  T x, y, z;
+  typedef T scalar_t;
+  vec3(T x, T y, T z) : x(x), y(y), z(z) {}
+};
+
+template <typename T>
+inline vec3<T> operator+(const vec3<T>& a, const vec3<T>& b) {
+  return vec3<T>(a.x + b.x, a.y + b.y, a.z + b.z);
+}
+
+template <typename T>
+inline vec3<T> operator-(const vec3<T>& a, const vec3<T>& b) {
+  return vec3<T>(a.x - b.x, a.y - b.y, a.z - b.z);
+}
+
+template <typename T>
+inline vec3<T> operator/(const vec3<T>& a, const T b) {
+  if (b == 0.0) {
+    AT_ERROR(
+        "denominator in vec3 division is 0"); // prevent divide by 0 errors.
+  }
+  return vec3<T>(a.x / b, a.y / b, a.z / b);
+}
+
+template <typename T>
+inline vec3<T> operator*(const T a, const vec3<T>& b) {
+  return vec3<T>(a * b.x, a * b.y, a * b.z);
+}
+
+template <typename T>
+inline vec3<T> operator*(const vec3<T>& a, const vec3<T>& b) {
+  return vec3<T>(a.x * b.x, a.y * b.y, a.z * b.z);
+}
+
+template <typename T>
+inline T dot(const vec3<T>& a, const vec3<T>& b) {
+  return a.x * b.x + a.y * b.y + a.z * b.z;
+}
+
+template <typename T>
+inline vec3<T> cross(const vec3<T>& a, const vec3<T>& b) {
+  return vec3<T>(
+      a.y * b.z - a.z * b.y, a.z * b.x - a.x * b.z, a.x * b.y - a.y * b.x);
+}
+
+template <typename T>
+inline T norm(const vec3<T>& a) {
+  return sqrt(dot(a, a));
+}
+
+template <typename T>
+std::ostream& operator<<(std::ostream& os, const vec3<T>& v) {
+  os << "vec3(" << v.x << ", " << v.y << ", " << v.z << ")";
+  return os;
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/csrc/utils/warp_reduce.cuh b/data/dot_single_video/dot/utils/torch3d/csrc/utils/warp_reduce.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..9f1082115e005efe5bb4d501931beba8a31388d3
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/csrc/utils/warp_reduce.cuh
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <float.h>
+#include <math.h>
+#include <cstdio>
+
+// Helper functions WarpReduceMin and WarpReduceMax used in .cu files
+// Starting in Volta, instructions are no longer synchronous within a warp.
+// We need to call __syncwarp() to sync the 32 threads in the warp
+// instead of all the threads in the block.
+
+template <typename scalar_t>
+__device__ void
+WarpReduceMin(scalar_t* min_dists, int64_t* min_idxs, const size_t tid) {
+  // s = 32
+  if (min_dists[tid] > min_dists[tid + 32]) {
+    min_idxs[tid] = min_idxs[tid + 32];
+    min_dists[tid] = min_dists[tid + 32];
+  }
+  __syncwarp();
+  // s = 16
+  if (min_dists[tid] > min_dists[tid + 16]) {
+    min_idxs[tid] = min_idxs[tid + 16];
+    min_dists[tid] = min_dists[tid + 16];
+  }
+  __syncwarp();
+  // s = 8
+  if (min_dists[tid] > min_dists[tid + 8]) {
+    min_idxs[tid] = min_idxs[tid + 8];
+    min_dists[tid] = min_dists[tid + 8];
+  }
+  __syncwarp();
+  // s = 4
+  if (min_dists[tid] > min_dists[tid + 4]) {
+    min_idxs[tid] = min_idxs[tid + 4];
+    min_dists[tid] = min_dists[tid + 4];
+  }
+  __syncwarp();
+  // s = 2
+  if (min_dists[tid] > min_dists[tid + 2]) {
+    min_idxs[tid] = min_idxs[tid + 2];
+    min_dists[tid] = min_dists[tid + 2];
+  }
+  __syncwarp();
+  // s = 1
+  if (min_dists[tid] > min_dists[tid + 1]) {
+    min_idxs[tid] = min_idxs[tid + 1];
+    min_dists[tid] = min_dists[tid + 1];
+  }
+  __syncwarp();
+}
+
+template <typename scalar_t>
+__device__ void WarpReduceMax(
+    volatile scalar_t* dists,
+    volatile int64_t* dists_idx,
+    const size_t tid) {
+  if (dists[tid] < dists[tid + 32]) {
+    dists[tid] = dists[tid + 32];
+    dists_idx[tid] = dists_idx[tid + 32];
+  }
+  __syncwarp();
+  if (dists[tid] < dists[tid + 16]) {
+    dists[tid] = dists[tid + 16];
+    dists_idx[tid] = dists_idx[tid + 16];
+  }
+  __syncwarp();
+  if (dists[tid] < dists[tid + 8]) {
+    dists[tid] = dists[tid + 8];
+    dists_idx[tid] = dists_idx[tid + 8];
+  }
+  __syncwarp();
+  if (dists[tid] < dists[tid + 4]) {
+    dists[tid] = dists[tid + 4];
+    dists_idx[tid] = dists_idx[tid + 4];
+  }
+  __syncwarp();
+  if (dists[tid] < dists[tid + 2]) {
+    dists[tid] = dists[tid + 2];
+    dists_idx[tid] = dists_idx[tid + 2];
+  }
+  __syncwarp();
+  if (dists[tid] < dists[tid + 1]) {
+    dists[tid] = dists[tid + 1];
+    dists_idx[tid] = dists_idx[tid + 1];
+  }
+  __syncwarp();
+}
diff --git a/data/dot_single_video/dot/utils/torch3d/knn.py b/data/dot_single_video/dot/utils/torch3d/knn.py
new file mode 100644
index 0000000000000000000000000000000000000000..1987758809a4b073678bb55ccc17de0898b31374
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/knn.py
@@ -0,0 +1,236 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from collections import namedtuple
+from typing import Union
+
+import torch
+from _torch3d import _C
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+
+
+_KNN = namedtuple("KNN", "dists idx knn")
+
+
+class _knn_points(Function):
+    """
+    Torch autograd Function wrapper for KNN C++/CUDA implementations.
+    """
+
+    @staticmethod
+    # pyre-fixme[14]: `forward` overrides method defined in `Function` inconsistently.
+    def forward(
+        ctx,
+        p1,
+        p2,
+        lengths1,
+        lengths2,
+        K,
+        version,
+        norm: int = 2,
+        return_sorted: bool = True,
+    ):
+        """
+        K-Nearest neighbors on point clouds.
+        Args:
+            p1: Tensor of shape (N, P1, D) giving a batch of N point clouds, each
+                containing up to P1 points of dimension D.
+            p2: Tensor of shape (N, P2, D) giving a batch of N point clouds, each
+                containing up to P2 points of dimension D.
+            lengths1: LongTensor of shape (N,) of values in the range [0, P1], giving the
+                length of each pointcloud in p1. Or None to indicate that every cloud has
+                length P1.
+            lengths2: LongTensor of shape (N,) of values in the range [0, P2], giving the
+                length of each pointcloud in p2. Or None to indicate that every cloud has
+                length P2.
+            K: Integer giving the number of nearest neighbors to return.
+            version: Which KNN implementation to use in the backend. If version=-1,
+                the correct implementation is selected based on the shapes of the inputs.
+            norm: (int) indicating the norm. Only supports 1 (for L1) and 2 (for L2).
+            return_sorted: (bool) whether to return the nearest neighbors sorted in
+                ascending order of distance.
+        Returns:
+            p1_dists: Tensor of shape (N, P1, K) giving the squared distances to
+                the nearest neighbors. This is padded with zeros both where a cloud in p2
+                has fewer than K points and where a cloud in p1 has fewer than P1 points.
+            p1_idx: LongTensor of shape (N, P1, K) giving the indices of the
+                K nearest neighbors from points in p1 to points in p2.
+                Concretely, if `p1_idx[n, i, k] = j` then `p2[n, j]` is the k-th nearest
+                neighbors to `p1[n, i]` in `p2[n]`. This is padded with zeros both where a cloud
+                in p2 has fewer than K points and where a cloud in p1 has fewer than P1 points.
+        """
+        if not ((norm == 1) or (norm == 2)):
+            raise ValueError("Support for 1 or 2 norm.")
+
+        idx, dists = _C.knn_points_idx(p1, p2, lengths1, lengths2, norm, K, version)
+
+        # sort KNN in ascending order if K > 1
+        if K > 1 and return_sorted:
+            if lengths2.min() < K:
+                P1 = p1.shape[1]
+                mask = lengths2[:, None] <= torch.arange(K, device=dists.device)[None]
+                # mask has shape [N, K], true where dists irrelevant
+                mask = mask[:, None].expand(-1, P1, -1)
+                # mask has shape [N, P1, K], true where dists irrelevant
+                dists[mask] = float("inf")
+                dists, sort_idx = dists.sort(dim=2)
+                dists[mask] = 0
+            else:
+                dists, sort_idx = dists.sort(dim=2)
+            idx = idx.gather(2, sort_idx)
+
+        ctx.save_for_backward(p1, p2, lengths1, lengths2, idx)
+        ctx.mark_non_differentiable(idx)
+        ctx.norm = norm
+        return dists, idx
+
+    @staticmethod
+    @once_differentiable
+    def backward(ctx, grad_dists, grad_idx):
+        p1, p2, lengths1, lengths2, idx = ctx.saved_tensors
+        norm = ctx.norm
+        # TODO(gkioxari) Change cast to floats once we add support for doubles.
+        if not (grad_dists.dtype == torch.float32):
+            grad_dists = grad_dists.float()
+        if not (p1.dtype == torch.float32):
+            p1 = p1.float()
+        if not (p2.dtype == torch.float32):
+            p2 = p2.float()
+        grad_p1, grad_p2 = _C.knn_points_backward(
+            p1, p2, lengths1, lengths2, idx, norm, grad_dists
+        )
+        return grad_p1, grad_p2, None, None, None, None, None, None
+
+
+def knn_points(
+    p1: torch.Tensor,
+    p2: torch.Tensor,
+    lengths1: Union[torch.Tensor, None] = None,
+    lengths2: Union[torch.Tensor, None] = None,
+    norm: int = 2,
+    K: int = 1,
+    version: int = -1,
+    return_nn: bool = False,
+    return_sorted: bool = True,
+) -> _KNN:
+    """
+    K-Nearest neighbors on point clouds.
+    Args:
+        p1: Tensor of shape (N, P1, D) giving a batch of N point clouds, each
+            containing up to P1 points of dimension D.
+        p2: Tensor of shape (N, P2, D) giving a batch of N point clouds, each
+            containing up to P2 points of dimension D.
+        lengths1: LongTensor of shape (N,) of values in the range [0, P1], giving the
+            length of each pointcloud in p1. Or None to indicate that every cloud has
+            length P1.
+        lengths2: LongTensor of shape (N,) of values in the range [0, P2], giving the
+            length of each pointcloud in p2. Or None to indicate that every cloud has
+            length P2.
+        norm: Integer indicating the norm of the distance. Supports only 1 for L1, 2 for L2.
+        K: Integer giving the number of nearest neighbors to return.
+        version: Which KNN implementation to use in the backend. If version=-1,
+            the correct implementation is selected based on the shapes of the inputs.
+        return_nn: If set to True returns the K nearest neighbors in p2 for each point in p1.
+        return_sorted: (bool) whether to return the nearest neighbors sorted in
+            ascending order of distance.
+    Returns:
+        dists: Tensor of shape (N, P1, K) giving the squared distances to
+            the nearest neighbors. This is padded with zeros both where a cloud in p2
+            has fewer than K points and where a cloud in p1 has fewer than P1 points.
+        idx: LongTensor of shape (N, P1, K) giving the indices of the
+            K nearest neighbors from points in p1 to points in p2.
+            Concretely, if `p1_idx[n, i, k] = j` then `p2[n, j]` is the k-th nearest
+            neighbors to `p1[n, i]` in `p2[n]`. This is padded with zeros both where a cloud
+            in p2 has fewer than K points and where a cloud in p1 has fewer than P1
+            points.
+        nn: Tensor of shape (N, P1, K, D) giving the K nearest neighbors in p2 for
+            each point in p1. Concretely, `p2_nn[n, i, k]` gives the k-th nearest neighbor
+            for `p1[n, i]`. Returned if `return_nn` is True.
+            The nearest neighbors are collected using `knn_gather`
+            .. code-block::
+                p2_nn = knn_gather(p2, p1_idx, lengths2)
+            which is a helper function that allows indexing any tensor of shape (N, P2, U) with
+            the indices `p1_idx` returned by `knn_points`. The output is a tensor
+            of shape (N, P1, K, U).
+    """
+    if p1.shape[0] != p2.shape[0]:
+        raise ValueError("pts1 and pts2 must have the same batch dimension.")
+    if p1.shape[2] != p2.shape[2]:
+        raise ValueError("pts1 and pts2 must have the same point dimension.")
+
+    p1 = p1.contiguous()
+    p2 = p2.contiguous()
+
+    P1 = p1.shape[1]
+    P2 = p2.shape[1]
+
+    if lengths1 is None:
+        lengths1 = torch.full((p1.shape[0],), P1, dtype=torch.int64, device=p1.device)
+    if lengths2 is None:
+        lengths2 = torch.full((p1.shape[0],), P2, dtype=torch.int64, device=p1.device)
+
+    # pyre-fixme[16]: `_knn_points` has no attribute `apply`.
+    p1_dists, p1_idx = _knn_points.apply(
+        p1, p2, lengths1, lengths2, K, version, norm, return_sorted
+    )
+
+    p2_nn = None
+    if return_nn:
+        p2_nn = knn_gather(p2, p1_idx, lengths2)
+
+    return _KNN(dists=p1_dists, idx=p1_idx, knn=p2_nn if return_nn else None)
+
+
+def knn_gather(
+    x: torch.Tensor, idx: torch.Tensor, lengths: Union[torch.Tensor, None] = None
+):
+    """
+    A helper function for knn that allows indexing a tensor x with the indices `idx`
+    returned by `knn_points`.
+    For example, if `dists, idx = knn_points(p, x, lengths_p, lengths, K)`
+    where p is a tensor of shape (N, L, D) and x a tensor of shape (N, M, D),
+    then one can compute the K nearest neighbors of p with `p_nn = knn_gather(x, idx, lengths)`.
+    It can also be applied for any tensor x of shape (N, M, U) where U != D.
+    Args:
+        x: Tensor of shape (N, M, U) containing U-dimensional features to
+            be gathered.
+        idx: LongTensor of shape (N, L, K) giving the indices returned by `knn_points`.
+        lengths: LongTensor of shape (N,) of values in the range [0, M], giving the
+            length of each example in the batch in x. Or None to indicate that every
+            example has length M.
+    Returns:
+        x_out: Tensor of shape (N, L, K, U) resulting from gathering the elements of x
+            with idx, s.t. `x_out[n, l, k] = x[n, idx[n, l, k]]`.
+            If `k > lengths[n]` then `x_out[n, l, k]` is filled with 0.0.
+    """
+    N, M, U = x.shape
+    _N, L, K = idx.shape
+
+    if N != _N:
+        raise ValueError("x and idx must have same batch dimension.")
+
+    if lengths is None:
+        lengths = torch.full((x.shape[0],), M, dtype=torch.int64, device=x.device)
+
+    idx_expanded = idx[:, :, :, None].expand(-1, -1, -1, U)
+    # idx_expanded has shape [N, L, K, U]
+
+    x_out = x[:, :, None].expand(-1, -1, K, -1).gather(1, idx_expanded)
+    # p2_nn has shape [N, L, K, U]
+
+    needs_mask = lengths.min() < K
+    if needs_mask:
+        # mask has shape [N, K], true where idx is irrelevant because
+        # there is less number of points in p2 than K
+        mask = lengths[:, None] <= torch.arange(K, device=x.device)[None]
+
+        # expand mask to shape [N, L, K, U]
+        mask = mask[:, None].expand(-1, L, -1)
+        mask = mask[:, :, :, None].expand(-1, -1, -1, U)
+        x_out[mask] = 0.0
+
+    return x_out
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/packed_to_padded.py b/data/dot_single_video/dot/utils/torch3d/packed_to_padded.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c48dc772d97c8116957bde3156a74b783a4e2ec
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/packed_to_padded.py
@@ -0,0 +1,196 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from _torch3d import _C
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+
+
+class _PackedToPadded(Function):
+    """
+    Torch autograd Function wrapper for packed_to_padded C++/CUDA implementations.
+    """
+
+    @staticmethod
+    def forward(ctx, inputs, first_idxs, max_size):
+        """
+        Args:
+            ctx: Context object used to calculate gradients.
+            inputs: FloatTensor of shape (F, D), representing the packed batch tensor.
+                e.g. areas for faces in a batch of meshes.
+            first_idxs: LongTensor of shape (N,) where N is the number of
+                elements in the batch and `first_idxs[i] = f`
+                means that the inputs for batch element i begin at `inputs[f]`.
+            max_size: Max length of an element in the batch.
+
+        Returns:
+            inputs_padded: FloatTensor of shape (N, max_size, D) where max_size is max
+                of `sizes`. The values for batch element i which start at
+                `inputs[first_idxs[i]]` will be copied to `inputs_padded[i, :]`,
+                with zeros padding out the extra inputs.
+        """
+        if not (inputs.dim() == 2):
+            raise ValueError("input can only be 2-dimensional.")
+        if not (first_idxs.dim() == 1):
+            raise ValueError("first_idxs can only be 1-dimensional.")
+        if not (inputs.dtype == torch.float32):
+            raise ValueError("input has to be of type torch.float32.")
+        if not (first_idxs.dtype == torch.int64):
+            raise ValueError("first_idxs has to be of type torch.int64.")
+        if not isinstance(max_size, int):
+            raise ValueError("max_size has to be int.")
+
+        ctx.save_for_backward(first_idxs)
+        ctx.num_inputs = int(inputs.shape[0])
+        inputs, first_idxs = inputs.contiguous(), first_idxs.contiguous()
+        inputs_padded = _C.packed_to_padded(inputs, first_idxs, max_size)
+        return inputs_padded
+
+    @staticmethod
+    @once_differentiable
+    def backward(ctx, grad_output):
+        grad_output = grad_output.contiguous()
+        first_idxs = ctx.saved_tensors[0]
+        num_inputs = ctx.num_inputs
+        grad_input = _C.padded_to_packed(grad_output, first_idxs, num_inputs)
+        return grad_input, None, None
+
+
+def packed_to_padded(
+    inputs: torch.Tensor, first_idxs: torch.LongTensor, max_size: int
+) -> torch.Tensor:
+    """
+    Torch wrapper that handles allowed input shapes. See description below.
+
+    Args:
+        inputs: FloatTensor of shape (F,) or (F, ...), representing the packed
+            batch tensor, e.g. areas for faces in a batch of meshes.
+        first_idxs: LongTensor of shape (N,) where N is the number of
+            elements in the batch and `first_idxs[i] = f`
+            means that the inputs for batch element i begin at `inputs[f]`.
+        max_size: Max length of an element in the batch.
+
+    Returns:
+        inputs_padded: FloatTensor of shape (N, max_size) or (N, max_size, ...)
+            where max_size is max of `sizes`. The values for batch element i
+            which start at `inputs[first_idxs[i]]` will be copied to
+            `inputs_padded[i, :]`, with zeros padding out the extra inputs.
+
+    To handle the allowed input shapes, we convert the inputs tensor of shape
+    (F,) to (F, 1). We reshape the output back to (N, max_size) from
+    (N, max_size, 1).
+    """
+    # if inputs is of shape (F,), reshape into (F, 1)
+    input_shape = inputs.shape
+    n_dims = inputs.dim()
+    if n_dims == 1:
+        inputs = inputs.unsqueeze(1)
+    else:
+        inputs = inputs.reshape(input_shape[0], -1)
+    inputs_padded = _PackedToPadded.apply(inputs, first_idxs, max_size)
+    # if flat is True, reshape output to (N, max_size) from (N, max_size, 1)
+    # else reshape output to (N, max_size, ...)
+    if n_dims == 1:
+        return inputs_padded.squeeze(2)
+    if n_dims == 2:
+        return inputs_padded
+    return inputs_padded.view(*inputs_padded.shape[:2], *input_shape[1:])
+
+
+class _PaddedToPacked(Function):
+    """
+    Torch autograd Function wrapper for padded_to_packed C++/CUDA implementations.
+    """
+
+    @staticmethod
+    def forward(ctx, inputs, first_idxs, num_inputs):
+        """
+        Args:
+            ctx: Context object used to calculate gradients.
+            inputs: FloatTensor of shape (N, max_size, D), representing
+            the padded tensor, e.g. areas for faces in a batch of meshes.
+            first_idxs: LongTensor of shape (N,) where N is the number of
+                elements in the batch and `first_idxs[i] = f`
+                means that the inputs for batch element i begin at `inputs_packed[f]`.
+            num_inputs: Number of packed entries (= F)
+
+        Returns:
+            inputs_packed: FloatTensor of shape (F, D) where
+                `inputs_packed[first_idx[i]:] = inputs[i, :]`.
+        """
+        if not (inputs.dim() == 3):
+            raise ValueError("input can only be 3-dimensional.")
+        if not (first_idxs.dim() == 1):
+            raise ValueError("first_idxs can only be 1-dimensional.")
+        if not (inputs.dtype == torch.float32):
+            raise ValueError("input has to be of type torch.float32.")
+        if not (first_idxs.dtype == torch.int64):
+            raise ValueError("first_idxs has to be of type torch.int64.")
+        if not isinstance(num_inputs, int):
+            raise ValueError("max_size has to be int.")
+
+        ctx.save_for_backward(first_idxs)
+        ctx.max_size = inputs.shape[1]
+        inputs, first_idxs = inputs.contiguous(), first_idxs.contiguous()
+        inputs_packed = _C.padded_to_packed(inputs, first_idxs, num_inputs)
+        return inputs_packed
+
+    @staticmethod
+    @once_differentiable
+    def backward(ctx, grad_output):
+        grad_output = grad_output.contiguous()
+        first_idxs = ctx.saved_tensors[0]
+        max_size = ctx.max_size
+        grad_input = _C.packed_to_padded(grad_output, first_idxs, max_size)
+        return grad_input, None, None
+
+
+def padded_to_packed(
+    inputs: torch.Tensor,
+    first_idxs: torch.LongTensor,
+    num_inputs: int,
+    max_size_dim: int = 1,
+) -> torch.Tensor:
+    """
+    Torch wrapper that handles allowed input shapes. See description below.
+
+    Args:
+        inputs: FloatTensor of shape (N, ..., max_size) or (N, ..., max_size, ...),
+            representing the padded tensor, e.g. areas for faces in a batch of
+            meshes, where max_size occurs on max_size_dim-th position.
+        first_idxs: LongTensor of shape (N,) where N is the number of
+            elements in the batch and `first_idxs[i] = f`
+            means that the inputs for batch element i begin at `inputs_packed[f]`.
+        num_inputs: Number of packed entries (= F)
+        max_size_dim: the dimension to be packed
+
+    Returns:
+        inputs_packed: FloatTensor of shape (F,) or (F, ...) where
+            `inputs_packed[first_idx[i]:first_idx[i+1]] = inputs[i, ..., :delta[i]]`,
+            where `delta[i] = first_idx[i+1] - first_idx[i]`.
+
+    To handle the allowed input shapes, we convert the inputs tensor of shape
+    (N, max_size) to (N, max_size, 1). We reshape the output back to (F,) from
+    (F, 1).
+    """
+    n_dims = inputs.dim()
+    # move the variable dim to position 1
+    inputs = inputs.movedim(max_size_dim, 1)
+
+    # if inputs is of shape (N, max_size), reshape into (N, max_size, 1))
+    input_shape = inputs.shape
+    if n_dims == 2:
+        inputs = inputs.unsqueeze(2)
+    else:
+        inputs = inputs.reshape(*input_shape[:2], -1)
+    inputs_packed = _PaddedToPacked.apply(inputs, first_idxs, num_inputs)
+    # if input is flat, reshape output to (F,) from (F, 1)
+    # else reshape output to (F, ...)
+    if n_dims == 2:
+        return inputs_packed.squeeze(1)
+
+    return inputs_packed.view(-1, *input_shape[2:])
diff --git a/data/dot_single_video/dot/utils/torch3d/setup.py b/data/dot_single_video/dot/utils/torch3d/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..599bf4b820909ba318e605786c916b07e59ff994
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/setup.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import glob
+import os
+import runpy
+import sys
+import warnings
+from typing import List, Optional
+
+import torch
+from setuptools import find_packages, setup
+from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDA_HOME, CUDAExtension
+
+
+def get_existing_ccbin(nvcc_args: List[str]) -> Optional[str]:
+    """
+    Given a list of nvcc arguments, return the compiler if specified.
+    Note from CUDA doc: Single value options and list options must have
+    arguments, which must follow the name of the option itself by either
+    one of more spaces or an equals character.
+    """
+    last_arg = None
+    for arg in reversed(nvcc_args):
+        if arg == "-ccbin":
+            return last_arg
+        if arg.startswith("-ccbin="):
+            return arg[7:]
+        last_arg = arg
+    return None
+
+
+def get_extensions():
+    no_extension = os.getenv("PYTORCH3D_NO_EXTENSION", "0") == "1"
+    if no_extension:
+        msg = "SKIPPING EXTENSION BUILD. PYTORCH3D WILL NOT WORK!"
+        print(msg, file=sys.stderr)
+        warnings.warn(msg)
+        return []
+
+    this_dir = os.path.dirname(os.path.abspath(__file__))
+    extensions_dir = os.path.join(this_dir, "csrc")
+    sources = glob.glob(os.path.join(extensions_dir, "**", "*.cpp"), recursive=True)
+    source_cuda = glob.glob(os.path.join(extensions_dir, "**", "*.cu"), recursive=True)
+    extension = CppExtension
+
+    extra_compile_args = {"cxx": ["-std=c++14"]}
+    define_macros = []
+    include_dirs = [extensions_dir]
+
+    force_cuda = os.getenv("FORCE_CUDA", "0") == "1"
+    force_no_cuda = os.getenv("PYTORCH3D_FORCE_NO_CUDA", "0") == "1"
+    if (
+        not force_no_cuda and torch.cuda.is_available() and CUDA_HOME is not None
+    ) or force_cuda:
+        extension = CUDAExtension
+        sources += source_cuda
+        define_macros += [("WITH_CUDA", None)]
+        # Thrust is only used for its tuple objects.
+        # With CUDA 11.0 we can't use the cudatoolkit's version of cub.
+        # We take the risk that CUB and Thrust are incompatible, because
+        # we aren't using parts of Thrust which actually use CUB.
+        define_macros += [("THRUST_IGNORE_CUB_VERSION_CHECK", None)]
+        cub_home = os.environ.get("CUB_HOME", None)
+        nvcc_args = [
+            "-DCUDA_HAS_FP16=1",
+            "-D__CUDA_NO_HALF_OPERATORS__",
+            "-D__CUDA_NO_HALF_CONVERSIONS__",
+            "-D__CUDA_NO_HALF2_OPERATORS__",
+        ]
+        if os.name != "nt":
+            nvcc_args.append("-std=c++14")
+        if cub_home is None:
+            prefix = os.environ.get("CONDA_PREFIX", None)
+            if prefix is not None and os.path.isdir(prefix + "/include/cub"):
+                cub_home = prefix + "/include"
+
+        if cub_home is None:
+            warnings.warn(
+                "The environment variable `CUB_HOME` was not found. "
+                "NVIDIA CUB is required for compilation and can be downloaded "
+                "from `https://github.com/NVIDIA/cub/releases`. You can unpack "
+                "it to a location of your choice and set the environment variable "
+                "`CUB_HOME` to the folder containing the `CMakeListst.txt` file."
+            )
+        else:
+            include_dirs.append(os.path.realpath(cub_home).replace("\\ ", " "))
+        nvcc_flags_env = os.getenv("NVCC_FLAGS", "")
+        if nvcc_flags_env != "":
+            nvcc_args.extend(nvcc_flags_env.split(" "))
+
+        # This is needed for pytorch 1.6 and earlier. See e.g.
+        # https://github.com/facebookresearch/pytorch3d/issues/436
+        # It is harmless after https://github.com/pytorch/pytorch/pull/47404 .
+        # But it can be problematic in torch 1.7.0 and 1.7.1
+        if torch.__version__[:4] != "1.7.":
+            CC = os.environ.get("CC", None)
+            if CC is not None:
+                existing_CC = get_existing_ccbin(nvcc_args)
+                if existing_CC is None:
+                    CC_arg = "-ccbin={}".format(CC)
+                    nvcc_args.append(CC_arg)
+                elif existing_CC != CC:
+                    msg = f"Inconsistent ccbins: {CC} and {existing_CC}"
+                    raise ValueError(msg)
+
+        extra_compile_args["nvcc"] = nvcc_args
+
+    ext_modules = [
+        extension(
+            "_torch3d._C",
+            sources,
+            include_dirs=include_dirs,
+            define_macros=define_macros,
+            extra_compile_args=extra_compile_args,
+        )
+    ]
+
+    return ext_modules
+
+        
+'''
+Usage:
+
+python setup.py build_ext --inplace # build extensions locally, do not install (only can be used from the parent directory)
+
+python setup.py install # build extensions and install (copy) to PATH.
+pip install . # ditto but better (e.g., dependency & metadata handling)
+
+python setup.py develop # build extensions and install (symbolic) to PATH.
+pip install -e . # ditto but better (e.g., dependency & metadata handling)
+
+'''
+setup(
+    name='torch3d', # package name, import this to use python API
+    ext_modules=get_extensions(),
+    packages=find_packages(),
+    cmdclass={
+        'build_ext': BuildExtension,
+    }
+)
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/PKG-INFO b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/PKG-INFO
new file mode 100644
index 0000000000000000000000000000000000000000..e00a85d535e7c64908dd1c2f70a45ff78cfa695b
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/PKG-INFO
@@ -0,0 +1,4 @@
+Metadata-Version: 2.1
+Name: torch3d
+Version: 0.0.0
+License-File: LICENSE
diff --git a/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/SOURCES.txt b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/SOURCES.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c6d1e263543e8e3bbd656975f24bf4f3cf4170ae
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/SOURCES.txt
@@ -0,0 +1,11 @@
+LICENSE
+setup.py
+/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/ext.cpp
+/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn.cu
+/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/knn/knn_cpu.cpp
+/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor.cu
+/mnt/zhongwei/zhongwei/all_good_tools/dot_all/24_06_06/dot_single_video/dot_ori/dot/utils/torch3d/csrc/packed_to_padded_tensor/packed_to_padded_tensor_cpu.cpp
+torch3d.egg-info/PKG-INFO
+torch3d.egg-info/SOURCES.txt
+torch3d.egg-info/dependency_links.txt
+torch3d.egg-info/top_level.txt
\ No newline at end of file
diff --git a/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/dependency_links.txt b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/dependency_links.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/top_level.txt b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/top_level.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9374fef9cc1bd0b296e61406cac2f6aad88d0077
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/torch3d.egg-info/top_level.txt
@@ -0,0 +1 @@
+_torch3d
diff --git a/data/dot_single_video/dot/utils/torch3d/utils.py b/data/dot_single_video/dot/utils/torch3d/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e232374f3f03b36e9bc050a2ebb69085e074204
--- /dev/null
+++ b/data/dot_single_video/dot/utils/torch3d/utils.py
@@ -0,0 +1,54 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import Optional, Tuple, TYPE_CHECKING, Union
+import torch
+
+
+def masked_gather(points: torch.Tensor, idx: torch.Tensor) -> torch.Tensor:
+    """
+    Helper function for torch.gather to collect the points at
+    the given indices in idx where some of the indices might be -1 to
+    indicate padding. These indices are first replaced with 0.
+    Then the points are gathered after which the padded values
+    are set to 0.0.
+    Args:
+        points: (N, P, D) float32 tensor of points
+        idx: (N, K) or (N, P, K) long tensor of indices into points, where
+            some indices are -1 to indicate padding
+    Returns:
+        selected_points: (N, K, D) float32 tensor of points
+            at the given indices
+    """
+
+    if len(idx) != len(points):
+        raise ValueError("points and idx must have the same batch dimension")
+
+    N, P, D = points.shape
+
+    if idx.ndim == 3:
+        # Case: KNN, Ball Query where idx is of shape (N, P', K)
+        # where P' is not necessarily the same as P as the
+        # points may be gathered from a different pointcloud.
+        K = idx.shape[2]
+        # Match dimensions for points and indices
+        idx_expanded = idx[..., None].expand(-1, -1, -1, D)
+        points = points[:, :, None, :].expand(-1, -1, K, -1)
+    elif idx.ndim == 2:
+        # Farthest point sampling where idx is of shape (N, K)
+        idx_expanded = idx[..., None].expand(-1, -1, D)
+    else:
+        raise ValueError("idx format is not supported %s" % repr(idx.shape))
+
+    idx_expanded_mask = idx_expanded.eq(-1)
+    idx_expanded = idx_expanded.clone()
+    # Replace -1 values with 0 for gather
+    idx_expanded[idx_expanded_mask] = 0
+    # Gather points
+    selected_points = points.gather(dim=1, index=idx_expanded)
+    # Replace padded values
+    selected_points[idx_expanded_mask] = 0.0
+    return selected_points
\ No newline at end of file
diff --git a/data/dot_single_video/precess_dataset_with_dot_single_video_return_position.py b/data/dot_single_video/precess_dataset_with_dot_single_video_return_position.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a97ea261aa130901d97deee37b1489519c44e86
--- /dev/null
+++ b/data/dot_single_video/precess_dataset_with_dot_single_video_return_position.py
@@ -0,0 +1,279 @@
+import os
+import ast
+import argparse
+import torch
+from tqdm import tqdm
+from decord import VideoReader, cpu
+from PIL import Image
+import sys
+from omegaconf import OmegaConf
+from dot.models.dense_optical_tracking import DenseOpticalTracker
+import numpy as np
+from torch import nn
+import os.path as osp
+from utils import *
+from dot.utils.torch import to_device, get_grid
+from torchvision import transforms
+import torchvision.transforms._transforms_video as transforms_video
+
+
+class Visualizer(nn.Module):
+    def __init__(self, result_path, save_mode='video', overlay_factor=0.75, spaghetti_radius=1.5, 
+                 spaghetti_length=40, spaghetti_grid=30, spaghetti_scale=2, spaghetti_every=10, spaghetti_dropout=0):
+        super().__init__()
+        self.save_mode = save_mode
+        self.result_path = result_path
+        self.overlay_factor = overlay_factor
+        self.spaghetti_radius = spaghetti_radius
+        self.spaghetti_length = spaghetti_length
+        self.spaghetti_grid = spaghetti_grid
+        self.spaghetti_scale = spaghetti_scale
+        self.spaghetti_every = spaghetti_every
+        self.spaghetti_dropout = spaghetti_dropout
+
+    def forward(self, data, mode):
+        if "overlay" in mode:
+            video = self.plot_overlay(data, mode)
+        elif "spaghetti" in mode:
+            video = self.plot_spaghetti(data, mode)
+        else:
+            raise ValueError(f"Unknown mode {mode}")
+        save_path = osp.join(self.result_path, mode) + ".mp4" if self.save_mode == "video" else ""
+        write_video(video, save_path)
+
+    def plot_overlay(self, data, mode):
+        T, C, H, W = data["video"].shape
+        mask = data["mask"] if "mask" in mode else torch.ones_like(data["mask"])
+        tracks = data["tracks"]
+
+        if tracks.ndim == 4:
+            col = get_rainbow_colors(int(mask.sum())).cuda()
+        else:
+            col = get_rainbow_colors(tracks.size(1)).cuda()
+
+        video = []
+        for tgt_step in tqdm(range(T), leave=False, desc="Plot target frame"):
+            tgt_frame = data["video"][tgt_step]
+            tgt_frame = tgt_frame.permute(1, 2, 0)
+
+            # Plot rainbow points
+            tgt_pos = tracks[tgt_step, ..., :2]
+            tgt_vis = tracks[tgt_step, ..., 2]
+            if tracks.ndim == 4:
+                tgt_pos = tgt_pos[mask]
+                tgt_vis = tgt_vis[mask]
+            rainbow, alpha = draw(tgt_pos, tgt_vis, col, H, W)
+
+            # Plot rainbow points with white stripes in occluded regions
+            if "stripes" in mode:
+                rainbow_occ, alpha_occ = draw(tgt_pos, 1 - tgt_vis, col, H, W)
+                stripes = torch.arange(H).view(-1, 1) + torch.arange(W).view(1, -1)
+                stripes = stripes % 9 < 3
+                rainbow_occ[stripes] = 1.
+                rainbow = alpha * rainbow + (1 - alpha) * rainbow_occ
+                alpha = alpha + (1 - alpha) * alpha_occ
+
+            # Overlay rainbow points over target frame
+            tgt_frame = self.overlay_factor * alpha * rainbow + (1 - self.overlay_factor * alpha) * tgt_frame
+
+            # Convert from H W C to C H W
+            tgt_frame = tgt_frame.permute(2, 0, 1)
+            video.append(tgt_frame)
+        video = torch.stack(video)
+        return video
+
+    def plot_spaghetti(self, data, mode):
+        bg_color = 1.
+        T, C, H, W = data["video"].shape
+        G, S, R, L = self.spaghetti_grid, self.spaghetti_scale, self.spaghetti_radius, self.spaghetti_length
+        D = self.spaghetti_dropout
+
+        # Extract a grid of tracks
+        mask = data["mask"] if "mask" in mode else torch.ones_like(data["mask"])
+        mask = mask[G // 2:-G // 2 + 1:G, G // 2:-G // 2 + 1:G]
+        tracks = data["tracks"]
+        if tracks.ndim == 4:
+            tracks = tracks[:, G // 2:-G // 2 + 1:G, G // 2:-G // 2 + 1:G]
+            tracks = tracks[:, mask]
+        elif D > 0:
+            N = tracks.size(1)
+            assert D < 1
+            samples = np.sort(np.random.choice(N, int((1 - D) * N), replace=False))
+            tracks = tracks[:, samples]
+        col = get_rainbow_colors(tracks.size(1)).cuda()
+
+        # Densify tracks over temporal axis
+        tracks = spline_interpolation(tracks, length=L)
+
+        video = []
+        cur_frame = None
+        cur_alpha = None
+        grid = get_grid(H, W).cuda()
+        grid[..., 0] *= (W - 1)
+        grid[..., 1] *= (H - 1)
+        for tgt_step in tqdm(range(T), leave=False, desc="Plot target frame"):
+            for delta in range(L):
+                # Plot rainbow points
+                tgt_pos = tracks[tgt_step * L + delta, :, :2]
+                tgt_vis = torch.ones_like(tgt_pos[..., 0])
+                tgt_pos = project(tgt_pos, tgt_step * L + delta, T * L, H, W)
+                tgt_col = col.clone()
+                rainbow, alpha = draw(S * tgt_pos, tgt_vis, tgt_col, int(S * H), int(S * W), radius=R)
+                rainbow, alpha = rainbow.cpu(), alpha.cpu()
+
+                # Overlay rainbow points over previous points / frames
+                if cur_frame is None:
+                    cur_frame = rainbow
+                    cur_alpha = alpha
+                else:
+                    cur_frame = alpha * rainbow + (1 - alpha) * cur_frame
+                    cur_alpha = 1 - (1 - cur_alpha) * (1 - alpha)
+
+                plot_first = "first" in mode and tgt_step == 0 and delta == 0
+                plot_last = "last" in mode and delta == 0
+                plot_every = "every" in mode and delta == 0 and tgt_step % self.spaghetti_every == 0
+                if delta == 0:
+                    if plot_first or plot_last or plot_every:
+                        # Plot target frame
+                        tgt_col = data["video"][tgt_step].permute(1, 2, 0).reshape(-1, 3)
+                        tgt_pos = grid.view(-1, 2)
+                        tgt_vis = torch.ones_like(tgt_pos[..., 0])
+                        tgt_pos = project(tgt_pos, tgt_step * L + delta, T * L, H, W)
+                        tgt_frame, alpha_frame = draw(S * tgt_pos, tgt_vis, tgt_col, int(S * H), int(S * W))
+                        tgt_frame, alpha_frame = tgt_frame.cpu(), alpha_frame.cpu()
+
+                        # Overlay target frame over previous points / frames
+                        tgt_frame = alpha_frame * tgt_frame + (1 - alpha_frame) * cur_frame
+                        alpha_frame = 1 - (1 - cur_alpha) * (1 - alpha_frame)
+
+                        # Add last points on top
+                        tgt_frame = alpha * rainbow + (1 - alpha) * tgt_frame
+                        alpha_frame = 1 - (1 - alpha_frame) * (1 - alpha)
+
+                        # Set background color
+                        tgt_frame = alpha_frame * tgt_frame + (1 - alpha_frame) * torch.ones_like(tgt_frame) * bg_color
+
+                        if plot_first or plot_every:
+                            cur_frame = tgt_frame
+                            cur_alpha = alpha_frame
+                    else:
+                        tgt_frame = cur_alpha * cur_frame + (1 - cur_alpha) * torch.ones_like(cur_frame) * bg_color
+
+                    # Convert from H W C to C H W
+                    tgt_frame = tgt_frame.permute(2, 0, 1)
+
+                    # Translate everything to make the target frame look static
+                    if "static" in mode:
+                        end_pos = project(torch.tensor([[0, 0]]), T * L, T * L, H, W)[0]
+                        cur_pos = project(torch.tensor([[0, 0]]), tgt_step * L + delta, T * L, H, W)[0]
+                        delta_pos = S * (end_pos - cur_pos)
+                        tgt_frame = translation(tgt_frame, delta_pos[0], delta_pos[1], bg_color)
+                    video.append(tgt_frame)
+        video = torch.stack(video)
+        return video
+
+
+
+def to_device(data, device):
+    data = {k: v.to(device) for k, v in data.items()}
+    return data
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    
+    parser.add_argument("--dot_config", type=str, help="config (yaml) path")
+    parser.add_argument("--video_path", type=str, help="video_path")         # motion_flow long video
+    parser.add_argument("--save_path", type=str, default='outputs', help="save_path")
+    parser.add_argument("--seed", type=int, default=2024, help="seed")
+    parser.add_argument("--eval_frame_fps", type=int, default=8, help="frame steps")
+    parser.add_argument("--track_time", type=int, default=4, help="track_time")
+    parser.add_argument("--rainbow_mode", type=str, default='left_right', help="rainbow_mode")
+    
+    return parser
+
+if __name__ == '__main__':
+    
+    parser = get_parser()
+    args = parser.parse_args()
+    config = OmegaConf.load(args.dot_config)
+    dot_config = config.pop("dot_model", OmegaConf.create())
+    inference_config = config.pop("inference_config", OmegaConf.create())
+    
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    video_path = args.video_path
+    resolution = [512, 512]
+    
+    # 1、instance model
+    dot_model = DenseOpticalTracker(**dot_config).to(device=device)
+    video_reader = VideoReader(video_path, ctx=cpu(0))
+    ori_fps = video_reader.get_avg_fps()
+    steps = int(ori_fps // args.eval_frame_fps)
+    video_fps_list = list(range(0, args.track_time*args.eval_frame_fps*steps, steps))
+    
+    frames = video_reader.get_batch(video_fps_list).asnumpy()             # [f, h, w, c]
+    tensor_frames = torch.from_numpy(frames).permute(0,3,1,2)             # [b,f,c,h,w], torch.float32, 0~1 input
+    spatial_transform = transforms.Compose([
+        transforms.Resize(resolution[0], antialias=True),
+        transforms_video.CenterCropVideo(resolution),
+        ])
+    tensor_frames = spatial_transform(tensor_frames).unsqueeze(0) / 255.
+    
+    # 2、get video tensor frames..
+    with torch.no_grad():
+        pred = dot_model({"video": tensor_frames.to(device)}, **inference_config)    # input [b, f, c, h, w], torch.float32, 0~1 input
+    
+    tracks = pred["tracks"]
+    new_flows = tracks[:,:,:,:,:2]
+    bs, num_f = new_flows.shape[:2]
+    num_frames = new_flows.shape[1]
+    visible_masks = tracks[:,:,:,:,2:]
+    
+    # 3、save tracking results
+    video_dict = {}
+    sub_tracks = tracks[0]
+
+    flow_numpy = new_flows[0].squeeze().detach().cpu().numpy().astype(np.float16)
+    file_path = os.path.join(args.save_path, os.path.basename(video_path).replace('.mp4', ''))
+    os.makedirs(file_path, exist_ok=True)
+    flo_file_path = os.path.join(file_path, f'dot_optical_flow.npy')
+    np.save(flo_file_path, flow_numpy)
+
+    for j in range(num_f):
+        mask_numpy = visible_masks[0,j].squeeze().detach().cpu().numpy().astype(np.uint8)*255
+        visible_mask_path = os.path.join(file_path, 'visible_mask')
+        os.makedirs(visible_mask_path, exist_ok=True)
+        mask_file_path = os.path.join(visible_mask_path, f'visible_mask_{j+1:02d}.jpg')
+        img = Image.fromarray(mask_numpy)
+        img.save(mask_file_path)
+    
+    # 4、visualization----Overlay
+    visualizer = Visualizer(result_path=file_path).cuda()
+    mask = torch.ones(resolution).bool()
+    data = {
+        "video": tensor_frames.squeeze(),
+        "tracks": sub_tracks,
+        "mask": mask
+    }
+
+    data = to_device(data, "cuda")
+
+    if data["tracks"].ndim == 4 and args.rainbow_mode == "left_right":
+        data["mask"] = data["mask"].permute(1, 0)
+        data["tracks"] = data["tracks"].permute(0, 2, 1, 3)
+    elif data["tracks"].ndim == 3:
+        points = data["tracks"][0]
+        x, y = points[..., 0].long(), points[..., 1].long()
+        x, y = x - x.min(), y - y.min()
+        if args.rainbow_mode == "left_right":
+            idx = y + x * y.max()
+        else:
+            idx = x + y * x.max()
+        order = idx.argsort(dim=0)
+        data["tracks"] = data["tracks"][:, order]
+
+    visualizer(data, mode='overlay')
+
+    
+    
+        
\ No newline at end of file
diff --git a/data/dot_single_video/process_dataset_with_dot_single_video_wo_vis_return_flow.py b/data/dot_single_video/process_dataset_with_dot_single_video_wo_vis_return_flow.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dacc16a5ddfb189048f910346390096e8c9209a
--- /dev/null
+++ b/data/dot_single_video/process_dataset_with_dot_single_video_wo_vis_return_flow.py
@@ -0,0 +1,285 @@
+import os
+import ast
+import argparse
+import torch
+from tqdm import tqdm
+from decord import VideoReader, cpu
+from PIL import Image
+import sys
+from omegaconf import OmegaConf
+from dot.models.dense_optical_tracking import DenseOpticalTracker
+import numpy as np
+from torch import nn
+import os.path as osp
+from utils import *
+from dot.utils.torch import to_device, get_grid
+from torchvision import transforms
+import torchvision.transforms._transforms_video as transforms_video
+
+
+class Visualizer(nn.Module):
+    def __init__(self, result_path, save_mode='video', overlay_factor=0.75, spaghetti_radius=1.5, 
+                 spaghetti_length=40, spaghetti_grid=30, spaghetti_scale=2, spaghetti_every=10, spaghetti_dropout=0):
+        super().__init__()
+        self.save_mode = save_mode
+        self.result_path = result_path
+        self.overlay_factor = overlay_factor
+        self.spaghetti_radius = spaghetti_radius
+        self.spaghetti_length = spaghetti_length
+        self.spaghetti_grid = spaghetti_grid
+        self.spaghetti_scale = spaghetti_scale
+        self.spaghetti_every = spaghetti_every
+        self.spaghetti_dropout = spaghetti_dropout
+
+    def forward(self, data, mode):
+        if "overlay" in mode:
+            video = self.plot_overlay(data, mode)
+        elif "spaghetti" in mode:
+            video = self.plot_spaghetti(data, mode)
+        else:
+            raise ValueError(f"Unknown mode {mode}")
+        save_path = osp.join(self.result_path, mode) + ".mp4" if self.save_mode == "video" else ""
+        write_video(video, save_path)
+
+    def plot_overlay(self, data, mode):
+        T, C, H, W = data["video"].shape
+        mask = data["mask"] if "mask" in mode else torch.ones_like(data["mask"])
+        tracks = data["tracks"]
+
+        if tracks.ndim == 4:
+            col = get_rainbow_colors(int(mask.sum())).cuda()
+        else:
+            col = get_rainbow_colors(tracks.size(1)).cuda()
+
+        video = []
+        for tgt_step in tqdm(range(T), leave=False, desc="Plot target frame"):
+            tgt_frame = data["video"][tgt_step]
+            tgt_frame = tgt_frame.permute(1, 2, 0)
+
+            # Plot rainbow points
+            tgt_pos = tracks[tgt_step, ..., :2]
+            tgt_vis = tracks[tgt_step, ..., 2]
+            if tracks.ndim == 4:
+                tgt_pos = tgt_pos[mask]
+                tgt_vis = tgt_vis[mask]
+            rainbow, alpha = draw(tgt_pos, tgt_vis, col, H, W)
+
+            # Plot rainbow points with white stripes in occluded regions
+            if "stripes" in mode:
+                rainbow_occ, alpha_occ = draw(tgt_pos, 1 - tgt_vis, col, H, W)
+                stripes = torch.arange(H).view(-1, 1) + torch.arange(W).view(1, -1)
+                stripes = stripes % 9 < 3
+                rainbow_occ[stripes] = 1.
+                rainbow = alpha * rainbow + (1 - alpha) * rainbow_occ
+                alpha = alpha + (1 - alpha) * alpha_occ
+
+            # Overlay rainbow points over target frame
+            tgt_frame = self.overlay_factor * alpha * rainbow + (1 - self.overlay_factor * alpha) * tgt_frame
+
+            # Convert from H W C to C H W
+            tgt_frame = tgt_frame.permute(2, 0, 1)
+            video.append(tgt_frame)
+        video = torch.stack(video)
+        return video
+
+    def plot_spaghetti(self, data, mode):
+        bg_color = 1.
+        T, C, H, W = data["video"].shape
+        G, S, R, L = self.spaghetti_grid, self.spaghetti_scale, self.spaghetti_radius, self.spaghetti_length
+        D = self.spaghetti_dropout
+
+        # Extract a grid of tracks
+        mask = data["mask"] if "mask" in mode else torch.ones_like(data["mask"])
+        mask = mask[G // 2:-G // 2 + 1:G, G // 2:-G // 2 + 1:G]
+        tracks = data["tracks"]
+        if tracks.ndim == 4:
+            tracks = tracks[:, G // 2:-G // 2 + 1:G, G // 2:-G // 2 + 1:G]
+            tracks = tracks[:, mask]
+        elif D > 0:
+            N = tracks.size(1)
+            assert D < 1
+            samples = np.sort(np.random.choice(N, int((1 - D) * N), replace=False))
+            tracks = tracks[:, samples]
+        col = get_rainbow_colors(tracks.size(1)).cuda()
+
+        # Densify tracks over temporal axis
+        tracks = spline_interpolation(tracks, length=L)
+
+        video = []
+        cur_frame = None
+        cur_alpha = None
+        grid = get_grid(H, W).cuda()
+        grid[..., 0] *= (W - 1)
+        grid[..., 1] *= (H - 1)
+        for tgt_step in tqdm(range(T), leave=False, desc="Plot target frame"):
+            for delta in range(L):
+                # Plot rainbow points
+                tgt_pos = tracks[tgt_step * L + delta, :, :2]
+                tgt_vis = torch.ones_like(tgt_pos[..., 0])
+                tgt_pos = project(tgt_pos, tgt_step * L + delta, T * L, H, W)
+                tgt_col = col.clone()
+                rainbow, alpha = draw(S * tgt_pos, tgt_vis, tgt_col, int(S * H), int(S * W), radius=R)
+                rainbow, alpha = rainbow.cpu(), alpha.cpu()
+
+                # Overlay rainbow points over previous points / frames
+                if cur_frame is None:
+                    cur_frame = rainbow
+                    cur_alpha = alpha
+                else:
+                    cur_frame = alpha * rainbow + (1 - alpha) * cur_frame
+                    cur_alpha = 1 - (1 - cur_alpha) * (1 - alpha)
+
+                plot_first = "first" in mode and tgt_step == 0 and delta == 0
+                plot_last = "last" in mode and delta == 0
+                plot_every = "every" in mode and delta == 0 and tgt_step % self.spaghetti_every == 0
+                if delta == 0:
+                    if plot_first or plot_last or plot_every:
+                        # Plot target frame
+                        tgt_col = data["video"][tgt_step].permute(1, 2, 0).reshape(-1, 3)
+                        tgt_pos = grid.view(-1, 2)
+                        tgt_vis = torch.ones_like(tgt_pos[..., 0])
+                        tgt_pos = project(tgt_pos, tgt_step * L + delta, T * L, H, W)
+                        tgt_frame, alpha_frame = draw(S * tgt_pos, tgt_vis, tgt_col, int(S * H), int(S * W))
+                        tgt_frame, alpha_frame = tgt_frame.cpu(), alpha_frame.cpu()
+
+                        # Overlay target frame over previous points / frames
+                        tgt_frame = alpha_frame * tgt_frame + (1 - alpha_frame) * cur_frame
+                        alpha_frame = 1 - (1 - cur_alpha) * (1 - alpha_frame)
+
+                        # Add last points on top
+                        tgt_frame = alpha * rainbow + (1 - alpha) * tgt_frame
+                        alpha_frame = 1 - (1 - alpha_frame) * (1 - alpha)
+
+                        # Set background color
+                        tgt_frame = alpha_frame * tgt_frame + (1 - alpha_frame) * torch.ones_like(tgt_frame) * bg_color
+
+                        if plot_first or plot_every:
+                            cur_frame = tgt_frame
+                            cur_alpha = alpha_frame
+                    else:
+                        tgt_frame = cur_alpha * cur_frame + (1 - cur_alpha) * torch.ones_like(cur_frame) * bg_color
+
+                    # Convert from H W C to C H W
+                    tgt_frame = tgt_frame.permute(2, 0, 1)
+
+                    # Translate everything to make the target frame look static
+                    if "static" in mode:
+                        end_pos = project(torch.tensor([[0, 0]]), T * L, T * L, H, W)[0]
+                        cur_pos = project(torch.tensor([[0, 0]]), tgt_step * L + delta, T * L, H, W)[0]
+                        delta_pos = S * (end_pos - cur_pos)
+                        tgt_frame = translation(tgt_frame, delta_pos[0], delta_pos[1], bg_color)
+                    video.append(tgt_frame)
+        video = torch.stack(video)
+        return video
+
+
+
+def to_device(data, device):
+    data = {k: v.to(device) for k, v in data.items()}
+    return data
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    
+    parser.add_argument("--dot_config", type=str, help="config (yaml) path")
+    parser.add_argument("--video_path", type=str, help="video_path")         # motion_flow long video
+    parser.add_argument("--save_path", type=str, default='outputs', help="save_path")
+    parser.add_argument("--seed", type=int, default=2024, help="seed")
+    parser.add_argument("--eval_frame_fps", type=int, default=8, help="frame steps")
+    parser.add_argument("--track_time", type=int, default=4, help="track_time")
+    parser.add_argument("--rainbow_mode", type=str, default='left_right', help="rainbow_mode")
+    
+    return parser
+
+if __name__ == '__main__':
+    
+    parser = get_parser()
+    args = parser.parse_args()
+    config = OmegaConf.load(args.dot_config)
+    dot_config = config.pop("dot_model", OmegaConf.create())
+    inference_config = config.pop("inference_config", OmegaConf.create())
+    
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    video_path = args.video_path
+    video_name = os.path.basename(video_path).replace('.mp4', '')
+    resolution = [320, 512]
+    
+    # 1、instance model
+    dot_model = DenseOpticalTracker(**dot_config).to(device=device)
+    video_reader = VideoReader(video_path, ctx=cpu(0))
+    ori_fps = video_reader.get_avg_fps()
+    steps = int(ori_fps // args.eval_frame_fps)
+    video_fps_list = list(range(0, args.track_time*args.eval_frame_fps*steps, steps))
+    
+    frames = video_reader.get_batch(video_fps_list).asnumpy()             # [f, h, w, c]
+    tensor_frames = torch.from_numpy(frames).permute(0,3,1,2)             # [b,f,c,h,w], torch.float32, 0~1 input
+    spatial_transform = transforms.Compose([
+        transforms.Resize(resolution[0], antialias=True),
+        transforms_video.CenterCropVideo(resolution),
+        ])
+    tensor_frames = spatial_transform(tensor_frames).unsqueeze(0) / 255.
+    
+    # 2、get video tensor frames..
+    with torch.no_grad():
+        pred = dot_model({"video": tensor_frames.to(device)}, **inference_config)    # input [b, f, c, h, w], torch.float32, 0~1 input
+    
+    tracks = pred["tracks"]
+    new_flows = tracks[:,:,:,:,:2]
+    bs, num_f = new_flows.shape[:2]
+    num_frames = new_flows.shape[1]
+    visible_masks = tracks[:,:,:,:,2:]
+    
+    # 3、save tracking results
+    video_dict = {}
+    sub_tracks = tracks[0]
+
+    flow_numpy = new_flows[0].squeeze().detach().cpu().numpy().astype(np.float16)
+    file_path = os.path.join(args.save_path, f'{video_name}')
+    os.makedirs(file_path, exist_ok=True)
+    flo_file_path = os.path.join(file_path, f'{video_name}.npy')
+    np.save(flo_file_path, flow_numpy)
+
+    all_mask_np = []
+    for j in range(num_f):
+        mask_numpy = visible_masks[0,j].squeeze().detach().cpu().numpy().astype(np.uint8)*255
+        all_mask_np.append(mask_numpy)
+    
+    mask_numpy_new = np.concatenate(all_mask_np, axis=1)
+    mask_file_path = os.path.join(file_path, f'{video_name}.jpg')
+    img = Image.fromarray(mask_numpy_new)
+    img.save(mask_file_path)
+    
+    import shutil
+    shutil.copyfile(video_path, os.path.join(file_path, f'{video_name}.mp4'))
+    
+    # 4、visualization----Overlay
+    # visualizer = Visualizer(result_path=file_path).cuda()
+    # mask = torch.ones(resolution).bool()
+    # data = {
+    #     "video": tensor_frames.squeeze(),
+    #     "tracks": sub_tracks,
+    #     "mask": mask
+    # }
+
+    # data = to_device(data, "cuda")
+
+    # if data["tracks"].ndim == 4 and args.rainbow_mode == "left_right":
+    #     data["mask"] = data["mask"].permute(1, 0)
+    #     data["tracks"] = data["tracks"].permute(0, 2, 1, 3)
+    # elif data["tracks"].ndim == 3:
+    #     points = data["tracks"][0]
+    #     x, y = points[..., 0].long(), points[..., 1].long()
+    #     x, y = x - x.min(), y - y.min()
+    #     if args.rainbow_mode == "left_right":
+    #         idx = y + x * y.max()
+    #     else:
+    #         idx = x + y * x.max()
+    #     order = idx.argsort(dim=0)
+    #     data["tracks"] = data["tracks"][:, order]
+
+    # visualizer(data, mode='overlay')
+
+    
+    
+        
\ No newline at end of file
diff --git a/data/dot_single_video/run_dot_single_video.sh b/data/dot_single_video/run_dot_single_video.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eb4666b9104f62ae4da6e3d2f6dd89ac92807edf
--- /dev/null
+++ b/data/dot_single_video/run_dot_single_video.sh
@@ -0,0 +1,7 @@
+python process_dataset_with_dot_single_video_wo_vis_return_flow.py \
+    --dot_config data/dot_single_video/configs/dot_single_video_1105.yaml \
+    --video_path "YOUR_VIDEO_PATH" \
+    --save_path "YOUR_SAVE_PATH" \
+    --eval_frame_fps 8 \
+    --track_time 2
+
diff --git a/data/dot_single_video/utils.py b/data/dot_single_video/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..44bb60c2d06808f04d86fe3e4b03359c96e612b4
--- /dev/null
+++ b/data/dot_single_video/utils.py
@@ -0,0 +1,200 @@
+import os
+import argparse
+from PIL import Image
+from glob import glob
+import numpy as np
+import json
+import torch
+import torchvision
+from torch.nn import functional as F
+from matplotlib import colormaps
+import math
+import scipy
+
+
+def get_grid(height, width, shape=None, dtype="torch", device="cpu", align_corners=True, normalize=True):
+    H, W = height, width
+    S = shape if shape else []
+    if align_corners:
+        x = torch.linspace(0, 1, W, device=device)
+        y = torch.linspace(0, 1, H, device=device)
+        if not normalize:
+            x = x * (W - 1)
+            y = y * (H - 1)
+    else:
+        x = torch.linspace(0.5 / W, 1.0 - 0.5 / W, W, device=device)
+        y = torch.linspace(0.5 / H, 1.0 - 0.5 / H, H, device=device)
+        if not normalize:
+            x = x * W
+            y = y * H
+    x_view, y_view, exp = [1 for _ in S] + [1, -1], [1 for _ in S] + [-1, 1], S + [H, W]
+    x = x.view(*x_view).expand(*exp)
+    y = y.view(*y_view).expand(*exp)
+    grid = torch.stack([x, y], dim=-1)
+    if dtype == "numpy":
+        grid = grid.numpy()
+    return grid
+
+def translation(frame, dx, dy, pad_value):
+    C, H, W = frame.shape
+    grid = get_grid(H, W, device=frame.device)
+    grid[..., 0] = grid[..., 0] - (dx / (W - 1))
+    grid[..., 1] = grid[..., 1] - (dy / (H - 1))
+    frame = frame - pad_value
+    frame = torch.nn.functional.grid_sample(frame[None], grid[None] * 2 - 1, mode='bilinear', align_corners=True)[0]
+    frame = frame + pad_value
+    return frame
+
+
+def project(pos, t, time_steps, heigh, width):
+    T, H, W = time_steps, heigh, width
+    pos = torch.stack([pos[..., 0] / (W - 1), pos[..., 1] / (H - 1)], dim=-1)
+    pos = pos - 0.5
+    pos = pos * 0.25
+    t = 1 - torch.ones_like(pos[..., :1]) * t / (T - 1)
+    pos = torch.cat([pos, t], dim=-1)
+    M = torch.tensor([
+        [0.8, 0, 0.5],
+        [-0.2, 1.0, 0.1],
+        [0.0, 0.0, 0.0]
+    ])
+    pos = pos @ M.t().to(pos.device)
+    pos = pos[..., :2]
+    pos[..., 0] += 0.25
+    pos[..., 1] += 0.45
+    pos[..., 0] *= (W - 1)
+    pos[..., 1] *= (H - 1)
+    return pos
+
+def draw(pos, vis, col, height, width, radius=1):
+    H, W = height, width
+    frame = torch.zeros(H * W, 4, device=pos.device)
+    pos = pos[vis.bool()]
+    col = col[vis.bool()]
+    if radius > 1:
+        pos, col = get_radius_neighbors(pos, col, radius)
+    else:
+        pos, col = get_cardinal_neighbors(pos, col)
+    inbound = (pos[:, 0] >= 0) & (pos[:, 0] <= W - 1) & (pos[:, 1] >= 0) & (pos[:, 1] <= H - 1)
+    pos = pos[inbound]
+    col = col[inbound]
+    pos = pos.round().long()
+    idx = pos[:, 1] * W + pos[:, 0]
+    idx = idx.view(-1, 1).expand(-1, 4)
+    frame.scatter_add_(0, idx, col)
+    frame = frame.view(H, W, 4)
+    frame, alpha = frame[..., :3], frame[..., 3]
+    nonzero = alpha > 0
+    frame[nonzero] /= alpha[nonzero][..., None]
+    alpha = nonzero[..., None].float()
+    return frame, alpha
+
+def get_cardinal_neighbors(pos, col, eps=0.01):
+    pos_nw = torch.stack([pos[:, 0].floor(), pos[:, 1].floor()], dim=-1)
+    pos_sw = torch.stack([pos[:, 0].floor(), pos[:, 1].floor() + 1], dim=-1)
+    pos_ne = torch.stack([pos[:, 0].floor() + 1, pos[:, 1].floor()], dim=-1)
+    pos_se = torch.stack([pos[:, 0].floor() + 1, pos[:, 1].floor() + 1], dim=-1)
+    w_n = pos[:, 1].floor() + 1 - pos[:, 1] + eps
+    w_s = pos[:, 1] - pos[:, 1].floor() + eps
+    w_w = pos[:, 0].floor() + 1 - pos[:, 0] + eps
+    w_e = pos[:, 0] - pos[:, 0].floor() + eps
+    w_nw = (w_n * w_w)[:, None]
+    w_sw = (w_s * w_w)[:, None]
+    w_ne = (w_n * w_e)[:, None]
+    w_se = (w_s * w_e)[:, None]
+    col_nw = torch.cat([w_nw * col, w_nw], dim=-1)
+    col_sw = torch.cat([w_sw * col, w_sw], dim=-1)
+    col_ne = torch.cat([w_ne * col, w_ne], dim=-1)
+    col_se = torch.cat([w_se * col, w_se], dim=-1)
+    pos = torch.cat([pos_nw, pos_sw, pos_ne, pos_se], dim=0)
+    col = torch.cat([col_nw, col_sw, col_ne, col_se], dim=0)
+    return pos, col
+
+
+def get_radius_neighbors(pos, col, radius):
+    R = math.ceil(radius)
+    center = torch.stack([pos[:, 0].round(), pos[:, 1].round()], dim=-1)
+    nn = torch.arange(-R, R + 1)
+    nn = torch.stack([nn[None, :].expand(2 * R + 1, -1), nn[:, None].expand(-1, 2 * R + 1)], dim=-1)
+    nn = nn.view(-1, 2).cuda()
+    in_radius = nn[:, 0] ** 2 + nn[:, 1] ** 2 <= radius ** 2
+    nn = nn[in_radius]
+    w = 1 - nn.pow(2).sum(-1).sqrt() / radius + 0.01
+    w = w[None].expand(pos.size(0), -1).reshape(-1)
+    pos = (center.view(-1, 1, 2) + nn.view(1, -1, 2)).view(-1, 2)
+    col = col.view(-1, 1, 3).repeat(1, nn.size(0), 1)
+    col = col.view(-1, 3)
+    col = torch.cat([col * w[:, None], w[:, None]], dim=-1)
+    return pos, col
+
+
+def get_rainbow_colors(size):
+    col_map = colormaps["jet"]
+    col_range = np.array(range(size)) / (size - 1)
+    col = torch.from_numpy(col_map(col_range)[..., :3]).float()
+    col = col.view(-1, 3)
+    return col
+
+
+def spline_interpolation(x, length=10):
+    if length != 1:
+        T, N, C = x.shape
+        x = x.view(T, -1).cpu().numpy()
+        original_time = np.arange(T)
+        cs = scipy.interpolate.CubicSpline(original_time, x)
+        new_time = np.linspace(original_time[0], original_time[-1], T * length)
+        x = torch.from_numpy(cs(new_time)).view(-1, N, C).float().cuda()
+    return x
+
+def create_folder(path, verbose=False, exist_ok=True, safe=True):
+    if os.path.exists(path) and not exist_ok:
+        if not safe:
+            raise OSError
+        return False
+    try:
+        os.makedirs(path)
+    except:
+        if not safe:
+            raise OSError
+        return False
+    if verbose:
+        print(f"Created folder: {path}")
+    return True
+
+
+def write_video_to_file(video, path, channels):
+    create_folder(os.path.dirname(path))
+    if channels == "first":
+        video = video.permute(0, 2, 3, 1)
+    video = (video.cpu() * 255.).to(torch.uint8)
+    torchvision.io.write_video(path, video, 8, "h264", options={"pix_fmt": "yuv420p", "crf": "23"})
+    return video
+
+
+def write_frame(frame, path, channels="first"):
+    create_folder(os.path.dirname(path))
+    frame = frame.cpu().numpy()
+    if channels == "first":
+        frame = np.transpose(frame, (1, 2, 0))
+    frame = np.clip(np.round(frame * 255), 0, 255).astype(np.uint8)
+    frame = Image.fromarray(frame)
+    frame.save(path)
+
+
+def write_video_to_folder(video, path, channels, zero_padded, ext):
+    create_folder(path)
+    time_steps = video.shape[0]
+    for step in range(time_steps):
+        pad = "0" * (len(str(time_steps)) - len(str(step))) if zero_padded else ""
+        frame_path = os.path.join(path, f"{pad}{step}.{ext}")
+        write_frame(video[step], frame_path, channels)
+        
+        
+
+def write_video(video, path, channels="first", zero_padded=True, ext="png", dtype="torch"):
+    if dtype == "numpy":
+        video = torch.from_numpy(video)
+    if path.endswith(".mp4"):
+        write_video_to_file(video, path, channels)
+    else:
+        write_video_to_folder(video, path, channels, zero_padded, ext)
diff --git a/data/folders/007401_007450_1018898026/flow_01.npy b/data/folders/007401_007450_1018898026/flow_01.npy
new file mode 100644
index 0000000000000000000000000000000000000000..47c6664addb5d6c7fdaa1550b909c6fe4bde5e20
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_01.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:26a0e61f15ab758ad1a3a52f67fa7e34188cc2f2cafc6c4c02c6baf4fc123458
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_02.npy b/data/folders/007401_007450_1018898026/flow_02.npy
new file mode 100644
index 0000000000000000000000000000000000000000..079d3d9bfa12d3db29c78947c5b2f0cf61031074
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_02.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b5754d96c41f020b6bf7f4004ab2985af99fe88925f56af8814fffd9cf49cdca
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_03.npy b/data/folders/007401_007450_1018898026/flow_03.npy
new file mode 100644
index 0000000000000000000000000000000000000000..e8c3001d956dc28ded0efdf9f56482379bc21bbc
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_03.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7d4125790d64ac1755b4946b3c7f3f2e31c82cfb9a73736bcdd4f8052b29ffb5
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_04.npy b/data/folders/007401_007450_1018898026/flow_04.npy
new file mode 100644
index 0000000000000000000000000000000000000000..03411fc3e72a7c8ea319da6787f2494a3174d947
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_04.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5215fd89d4c18db0e8c70686b902010b3a3badc556ccdad452cbd2bc306a807d
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_05.npy b/data/folders/007401_007450_1018898026/flow_05.npy
new file mode 100644
index 0000000000000000000000000000000000000000..93857e66427c01b010084d51f39061501a0aac0c
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_05.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6b84b46097261d2b0be35ae6e45bed4b5f46382e360ef0880da70a7afd435b79
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_06.npy b/data/folders/007401_007450_1018898026/flow_06.npy
new file mode 100644
index 0000000000000000000000000000000000000000..76fd0636fd191e946b88282438475426660f455c
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_06.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d2493d6788ad329af14a2cfc33f149fbd06bca8373ebb813c79a467f9571ecf7
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_07.npy b/data/folders/007401_007450_1018898026/flow_07.npy
new file mode 100644
index 0000000000000000000000000000000000000000..b4b336b1356157126dfe361e87109f28ba41b58a
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_07.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:91eb9d192591de8b871a54c2630fff9df82a20bfbb8964f88d9c62cc4218723a
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_08.npy b/data/folders/007401_007450_1018898026/flow_08.npy
new file mode 100644
index 0000000000000000000000000000000000000000..a43c766f9abe33516276e4da1aabaa4e163fff3d
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_08.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d7f17b6be7eafeec1e3d07010c22fa904f8343b44dfd0bc6848b7c507374a49a
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_09.npy b/data/folders/007401_007450_1018898026/flow_09.npy
new file mode 100644
index 0000000000000000000000000000000000000000..b2bc5416622a7bbe0bb75a8a33bb4d6d8358a3fe
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_09.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:90a9dd6f40c92e3a18eb7849bb97a8584d398ae88d8b0e452b1730fea071488d
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_10.npy b/data/folders/007401_007450_1018898026/flow_10.npy
new file mode 100644
index 0000000000000000000000000000000000000000..6ce8467c5e32d993436b7047b46bad735733bcb4
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_10.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f74f4abebdbe7a28f3c37341a81312622f71fb1db03d4db2c9ce2e9e7c3320a2
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_11.npy b/data/folders/007401_007450_1018898026/flow_11.npy
new file mode 100644
index 0000000000000000000000000000000000000000..bb247617d3e6ce0061b95a20f839ecbf8c79dc71
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_11.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8997d332517e86c9c37df251dc9c5996daa7e9a8622709245c4181939d54580e
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_12.npy b/data/folders/007401_007450_1018898026/flow_12.npy
new file mode 100644
index 0000000000000000000000000000000000000000..aa5049b57f99184e61c249d3c2b16b84cb913ab2
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_12.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:2f960666953f6721d9f7097ae3de6ceae081e5e592e833e14893a25cc4ac4c3f
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_13.npy b/data/folders/007401_007450_1018898026/flow_13.npy
new file mode 100644
index 0000000000000000000000000000000000000000..30e2932bd25372f816b1df6b87829cfe36765ad0
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_13.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8f240227ac273d4099157ab6bbc3cf081548b831084ce6226fec28cd1c8a949e
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_14.npy b/data/folders/007401_007450_1018898026/flow_14.npy
new file mode 100644
index 0000000000000000000000000000000000000000..9d734f5e0d1ef4455edd9f9c027103bec433fe60
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_14.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:79665e5e7d87b5417afd52f2e21899d8ace47f82f53139a400b5a91324f14218
+size 655488
diff --git a/data/folders/007401_007450_1018898026/flow_15.npy b/data/folders/007401_007450_1018898026/flow_15.npy
new file mode 100644
index 0000000000000000000000000000000000000000..c466e459ba6c08249bd4ba2d9e8f7cfbe0091738
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/flow_15.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c0a45b626c3d72e7117a2687df431e0901ac25fface87a4ef7d5f46d9f569792
+size 655488
diff --git a/data/folders/007401_007450_1018898026/video.mp4 b/data/folders/007401_007450_1018898026/video.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..147d87f8d97d4640ac60fb62dd4b76a44d0357f4
--- /dev/null
+++ b/data/folders/007401_007450_1018898026/video.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a2c31db7ae4350d1eea207ba1207b01591ca52ef09a2ad5bcf1e9029e72a60fd
+size 819794
diff --git a/data/folders/007401_007450_1018898026/visible_mask_01.jpg b/data/folders/007401_007450_1018898026/visible_mask_01.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..b66bc4d2be4429faf58be5608e3dfea1f13688d6
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_01.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_02.jpg b/data/folders/007401_007450_1018898026/visible_mask_02.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..9b328d62f5572b93806f921d94ecb9fef4ccc823
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_02.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_03.jpg b/data/folders/007401_007450_1018898026/visible_mask_03.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..6934653afddb17f85bfd63927251e89ce3995d37
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_03.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_04.jpg b/data/folders/007401_007450_1018898026/visible_mask_04.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..856e3c4f91f9f4b87376d4d5608b30ea0c247164
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_04.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_05.jpg b/data/folders/007401_007450_1018898026/visible_mask_05.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..e8d9868d03de85db4f8c9075b1489dcd26c12e53
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_05.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_06.jpg b/data/folders/007401_007450_1018898026/visible_mask_06.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8abc0bb8805f05abc0af51d44b84c3a0eb3b6b90
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_06.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_07.jpg b/data/folders/007401_007450_1018898026/visible_mask_07.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..ca75c7bf49daa16fb3919bd7384daa055a57ec38
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_07.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_08.jpg b/data/folders/007401_007450_1018898026/visible_mask_08.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..b29747faadb11367fa67f53b2719540d6e4fc78a
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_08.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_09.jpg b/data/folders/007401_007450_1018898026/visible_mask_09.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..2ed0eadaa3afa9b3581d2b04860b45373a24aafa
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_09.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_10.jpg b/data/folders/007401_007450_1018898026/visible_mask_10.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..325cad0403a5f6ea0819c81411dcfd7932def912
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_10.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_11.jpg b/data/folders/007401_007450_1018898026/visible_mask_11.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..936ced915f203d281c865c08d816c89b06dbef58
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_11.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_12.jpg b/data/folders/007401_007450_1018898026/visible_mask_12.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..a28cca7171b160e1cecad320dda239b67b6720e0
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_12.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_13.jpg b/data/folders/007401_007450_1018898026/visible_mask_13.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..236628845d5da78820a604e1f7bd44e4beb303b1
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_13.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_14.jpg b/data/folders/007401_007450_1018898026/visible_mask_14.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..0e66b26058320185f30b00356b81115f8e13489c
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_14.jpg differ
diff --git a/data/folders/007401_007450_1018898026/visible_mask_15.jpg b/data/folders/007401_007450_1018898026/visible_mask_15.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..518366758e696359cc5852aab29289c0709fd123
Binary files /dev/null and b/data/folders/007401_007450_1018898026/visible_mask_15.jpg differ
diff --git a/data/folders/046001_046050_1011035429/flow_01.npy b/data/folders/046001_046050_1011035429/flow_01.npy
new file mode 100644
index 0000000000000000000000000000000000000000..3e34fe4f6efed666a80db7a63fc11185a191f8ce
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_01.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ee8d9ac54b9ad72048b73f232ca22e1165330b886fdfcecb6911f721a1994fd5
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_02.npy b/data/folders/046001_046050_1011035429/flow_02.npy
new file mode 100644
index 0000000000000000000000000000000000000000..1cde63726b2778a5f4e4374921db783c071bb4e3
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_02.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:26a062b171d0dc12b20240371e7e08d65a3ada0590c210b63aa86297771a204e
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_03.npy b/data/folders/046001_046050_1011035429/flow_03.npy
new file mode 100644
index 0000000000000000000000000000000000000000..71e620d20e4515b083a301da9964d2986b306838
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_03.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ee30743d63ce1ae694dbefb8d3b9a2dce37800d46164f93fc0fd82bb3d262b6f
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_04.npy b/data/folders/046001_046050_1011035429/flow_04.npy
new file mode 100644
index 0000000000000000000000000000000000000000..3d461dc7b8ced6f2084308df0fa525c342148f43
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_04.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:79fe4f859ba2d560b0d622f0f83f79f0ba55cecc85837c32c0dceb71d7fd8547
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_05.npy b/data/folders/046001_046050_1011035429/flow_05.npy
new file mode 100644
index 0000000000000000000000000000000000000000..d9a71913e1b7196e672cba3d03ea84359486d5cc
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_05.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b5bb078183962f044493e79046739b3ad203eff39bee3f3e7e94cbaaf06e5210
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_06.npy b/data/folders/046001_046050_1011035429/flow_06.npy
new file mode 100644
index 0000000000000000000000000000000000000000..8e460c0e9407d89e47738bab33235f98f6e346c8
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_06.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0340d95c3365e71780e23b9286673b4180dfcaa042911596f0e85efdbb2f81a0
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_07.npy b/data/folders/046001_046050_1011035429/flow_07.npy
new file mode 100644
index 0000000000000000000000000000000000000000..c737bdb1599a6b463ae0500e9a0056e7a60c540e
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_07.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d0d947cfbcdd66fc77ca9436bee752211065b94607278837bcc745a6dfd1fdaa
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_08.npy b/data/folders/046001_046050_1011035429/flow_08.npy
new file mode 100644
index 0000000000000000000000000000000000000000..4c8f30fb48f32c786d47d5d5aff1d22c41c26c37
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_08.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6e42f534606500b0ba275c2b238d502abfa51c07043af931a4c4556bbe5c0fbc
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_09.npy b/data/folders/046001_046050_1011035429/flow_09.npy
new file mode 100644
index 0000000000000000000000000000000000000000..3fd4e67f8297a7110542aa9ee904042ceac6f80e
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_09.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:085798a705dd5a44374bc8c8b73c1afb771fd370a4394bd973cd602526f46dfe
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_10.npy b/data/folders/046001_046050_1011035429/flow_10.npy
new file mode 100644
index 0000000000000000000000000000000000000000..42dfe4475bc7360db603a447ea18ad5234af0381
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_10.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:11f66abb00a267f4f974bb537efc6adfd37d2c0603a7b4fc4e1f5c3feb410536
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_11.npy b/data/folders/046001_046050_1011035429/flow_11.npy
new file mode 100644
index 0000000000000000000000000000000000000000..5bb15da80f8f82850376a807faf5236de1537199
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_11.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a8e73ab226d3f9345cd3b27b7623ef4b858aa6804125adc6683525e9e2c2f688
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_12.npy b/data/folders/046001_046050_1011035429/flow_12.npy
new file mode 100644
index 0000000000000000000000000000000000000000..1b4a9bb4f8e8903554c8b056142ce719589de37d
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_12.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:06d77e75116068fb4ccef3e867d70ba4208989986f054380c098daa5ce27c5a4
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_13.npy b/data/folders/046001_046050_1011035429/flow_13.npy
new file mode 100644
index 0000000000000000000000000000000000000000..395429922a330d2cd1b0035d9e0eb04ed6dc7414
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_13.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3520dbaa73208b2167ccea35253e62edf0beb9c21be2423ff38ff8b15524635f
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_14.npy b/data/folders/046001_046050_1011035429/flow_14.npy
new file mode 100644
index 0000000000000000000000000000000000000000..8f39f7901bf4ed49e69f8bdd08632cdbac6bdc51
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_14.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f7d699115db6f6ce00aaca69370f9b3d7c282d281b9314d2c6b034105322b9b8
+size 655488
diff --git a/data/folders/046001_046050_1011035429/flow_15.npy b/data/folders/046001_046050_1011035429/flow_15.npy
new file mode 100644
index 0000000000000000000000000000000000000000..0ea5ee0d8eea4253475ae3b3ff90f1f8eb45babc
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/flow_15.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f7dc9ed878b25d2c4f3e16fe2e03306e705f05d52a27b6f86f4a536a45310120
+size 655488
diff --git a/data/folders/046001_046050_1011035429/video.mp4 b/data/folders/046001_046050_1011035429/video.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..5f5e6e771020665419dd57dbecc78b2437e98f6f
--- /dev/null
+++ b/data/folders/046001_046050_1011035429/video.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8ddb08203247878a8d06a117f988f6a428b29ea675ba0700212c6302dcd98641
+size 369511
diff --git a/data/folders/046001_046050_1011035429/visible_mask_01.jpg b/data/folders/046001_046050_1011035429/visible_mask_01.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..ea01c0e333922a744e33b64ab0fe7e193d14a685
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_01.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_02.jpg b/data/folders/046001_046050_1011035429/visible_mask_02.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..1e3db0fff7995d6d3dd6544bd5f4e8b9427efcc1
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_02.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_03.jpg b/data/folders/046001_046050_1011035429/visible_mask_03.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c7ca9ec1dd607fe65a631c6200a1812b413b3afa
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_03.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_04.jpg b/data/folders/046001_046050_1011035429/visible_mask_04.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..86188adff19931f70640e2ab7f3b304bc3b9d1ba
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_04.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_05.jpg b/data/folders/046001_046050_1011035429/visible_mask_05.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..7db17e7d2b8aa692327d9d788575d9209d316a44
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_05.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_06.jpg b/data/folders/046001_046050_1011035429/visible_mask_06.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..f0382e2eb48cb799d63bff664d2db7e260af6c9c
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_06.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_07.jpg b/data/folders/046001_046050_1011035429/visible_mask_07.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..203c4b86aa53e939c096d7fe159ff3a41691747c
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_07.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_08.jpg b/data/folders/046001_046050_1011035429/visible_mask_08.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..6cd0651d9bddc09353116b2546dce019fb00d257
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_08.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_09.jpg b/data/folders/046001_046050_1011035429/visible_mask_09.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..a92f72929cda5e7e61de60814ce9687762b44077
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_09.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_10.jpg b/data/folders/046001_046050_1011035429/visible_mask_10.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..b7046e5d935d96cc9c1facd6b8aa996cf9c2325c
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_10.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_11.jpg b/data/folders/046001_046050_1011035429/visible_mask_11.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c88a884df21badaa6adfc221ab3115d67c832ccb
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_11.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_12.jpg b/data/folders/046001_046050_1011035429/visible_mask_12.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..b34f56273c95abb9c814eb8936f27e5656669ee0
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_12.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_13.jpg b/data/folders/046001_046050_1011035429/visible_mask_13.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..2dcdeda91d53e34592d46bf7c98d445569030748
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_13.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_14.jpg b/data/folders/046001_046050_1011035429/visible_mask_14.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..70d7509f871ef2beed6b99e252c09c36a144ad20
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_14.jpg differ
diff --git a/data/folders/046001_046050_1011035429/visible_mask_15.jpg b/data/folders/046001_046050_1011035429/visible_mask_15.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..71d7bbbaa2789f383227bb2f9266bc104e4eea9b
Binary files /dev/null and b/data/folders/046001_046050_1011035429/visible_mask_15.jpg differ
diff --git a/data/folders/188701_188750_1026109505/flow_01.npy b/data/folders/188701_188750_1026109505/flow_01.npy
new file mode 100644
index 0000000000000000000000000000000000000000..9c57e341afcb2065e43b29aa737902f2acef62dc
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_01.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:06f3ce280af862883e8fd4687c26ccf440ce0f1520c18290db202049d65caa47
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_02.npy b/data/folders/188701_188750_1026109505/flow_02.npy
new file mode 100644
index 0000000000000000000000000000000000000000..c177380e068f8cd46f55defab60e3fac5f95be03
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_02.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:22e15f639a9102b7fe8f99ebd9219bcda09279aa23caead6ecae0c3753ac99b5
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_03.npy b/data/folders/188701_188750_1026109505/flow_03.npy
new file mode 100644
index 0000000000000000000000000000000000000000..c9e9d03367f30bf860839fe8d64bc3ce63f4eb1a
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_03.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:81b64d0bec5e43958ed3066c2485cf7ebfbf4e5a26480f274c5d610ef7a225ba
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_04.npy b/data/folders/188701_188750_1026109505/flow_04.npy
new file mode 100644
index 0000000000000000000000000000000000000000..4d02b78f8320ea17bd10e2427947a1f7568f80c2
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_04.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5307b54469cc043d34478d15103db863fec5e92545b9b3a849ea4dad634a8ed0
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_05.npy b/data/folders/188701_188750_1026109505/flow_05.npy
new file mode 100644
index 0000000000000000000000000000000000000000..5e65010f222655b3aac403d5e7c2cd3d42d13f64
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_05.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:21bf96be10f9cbeb6ae7219212a56875f0ec712b1b17295c0e47b34965ea8027
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_06.npy b/data/folders/188701_188750_1026109505/flow_06.npy
new file mode 100644
index 0000000000000000000000000000000000000000..2fcd4e7a987fe7b62530c3195662dd566cddb66f
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_06.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:72c61c2e3a6354f64644d9d368b6721c52330f345d5ee292df696ce7c72cd0a8
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_07.npy b/data/folders/188701_188750_1026109505/flow_07.npy
new file mode 100644
index 0000000000000000000000000000000000000000..f4f02151fc302e59ed7d0be00eb11c30a97964a7
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_07.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c03eb090101886df7aa2a69ba150acb8a1b63e2113904429e81fb37c6aecfd6d
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_08.npy b/data/folders/188701_188750_1026109505/flow_08.npy
new file mode 100644
index 0000000000000000000000000000000000000000..21ad4feec8c7bc7132b2b1b15c16fff65d187f53
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_08.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b415493fbee3e02aa1e3839f4de8ee9e6926f0bd7a8bf5ba574160c24df498cc
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_09.npy b/data/folders/188701_188750_1026109505/flow_09.npy
new file mode 100644
index 0000000000000000000000000000000000000000..200609f396f4d12b2bbdf7f7847ee0614d7a6b24
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_09.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:47c602d67d32c3286de02e7c5a2881bac23d3bf7caf85a77de1618d6f13f190b
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_10.npy b/data/folders/188701_188750_1026109505/flow_10.npy
new file mode 100644
index 0000000000000000000000000000000000000000..66aabdeaeca4741c521d3c790e31e720e2123c6b
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_10.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:bcc5704b0d088221cf98588741a7f32db6a03530df3907401cf0d13eeb44def5
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_11.npy b/data/folders/188701_188750_1026109505/flow_11.npy
new file mode 100644
index 0000000000000000000000000000000000000000..bf6648b8b7617ca05211130a3aee223801033b71
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_11.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:77c197e814f55155eba9868b49164f001ebd96f3459b4ac4d277db062197a583
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_12.npy b/data/folders/188701_188750_1026109505/flow_12.npy
new file mode 100644
index 0000000000000000000000000000000000000000..adaf943d136ed302589021b34abcc1d7658f0ade
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_12.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8089aab41fe6934cca1b9c001fe31f0e2b39176ec3f9142847c649adb42c21a6
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_13.npy b/data/folders/188701_188750_1026109505/flow_13.npy
new file mode 100644
index 0000000000000000000000000000000000000000..345ec5facce88df03acec43b5769a2758cd97552
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_13.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:adb75c6bcf15aef81e0dcecf0df20e0a7e562c7867b2a39c1aff743f76b242dd
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_14.npy b/data/folders/188701_188750_1026109505/flow_14.npy
new file mode 100644
index 0000000000000000000000000000000000000000..0a657715c02f15f7d88bd6f6272c6186aebecdd5
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_14.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5d48d7f61701ce4f94cc876e159dccda5ef84f06416eb79abc4fc0d1359350d2
+size 655488
diff --git a/data/folders/188701_188750_1026109505/flow_15.npy b/data/folders/188701_188750_1026109505/flow_15.npy
new file mode 100644
index 0000000000000000000000000000000000000000..932c69297adf0e604b3d821d2976565fb39812ee
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/flow_15.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cc0513a2bf30eee932a1b95b4c03ed91c59cfbd26e4fc9fcbb6bc8d776ca725e
+size 655488
diff --git a/data/folders/188701_188750_1026109505/video.mp4 b/data/folders/188701_188750_1026109505/video.mp4
new file mode 100644
index 0000000000000000000000000000000000000000..994a23a8af04774c059752d33a72e2809a38226a
--- /dev/null
+++ b/data/folders/188701_188750_1026109505/video.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:13bfb560b3541de5cb41962bd41eba33d9b39cb856c8ff75338ce4f0111e7046
+size 354322
diff --git a/data/folders/188701_188750_1026109505/visible_mask_01.jpg b/data/folders/188701_188750_1026109505/visible_mask_01.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..52ff591b4ab2eaf0f3ba77a7bb0cb30b029597ba
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_01.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_02.jpg b/data/folders/188701_188750_1026109505/visible_mask_02.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..ce3fd5b49943d3f8ba38ead0fbc30acfe8e65869
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_02.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_03.jpg b/data/folders/188701_188750_1026109505/visible_mask_03.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..52e3cfbd69bc846e9afb9cc19e9df699a41f4ce3
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_03.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_04.jpg b/data/folders/188701_188750_1026109505/visible_mask_04.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..efd42979c410b1be5673cf61e1cc226e830edab2
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_04.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_05.jpg b/data/folders/188701_188750_1026109505/visible_mask_05.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..5fd074bcafa5331b75978a0760a6d716c7849724
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_05.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_06.jpg b/data/folders/188701_188750_1026109505/visible_mask_06.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..dd0ea10e5a61dba8ca9e0a87ac50f02bdd17d95c
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_06.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_07.jpg b/data/folders/188701_188750_1026109505/visible_mask_07.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..353a30c437ab45f678bf9542e0184664f14dff0d
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_07.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_08.jpg b/data/folders/188701_188750_1026109505/visible_mask_08.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..69e71e55ab3389d074a53f84bdd8f2f0d360aed2
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_08.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_09.jpg b/data/folders/188701_188750_1026109505/visible_mask_09.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..05db2dd775009c302b06b4d9cd13ef615667278b
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_09.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_10.jpg b/data/folders/188701_188750_1026109505/visible_mask_10.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..91eda27a031a31a3a9e1f1c9a776fc87159f1705
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_10.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_11.jpg b/data/folders/188701_188750_1026109505/visible_mask_11.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..89d04593a4e395135614db79a7c1e6776cc21ff0
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_11.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_12.jpg b/data/folders/188701_188750_1026109505/visible_mask_12.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..77ebe84412804d1577b3e6873d6561023121b367
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_12.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_13.jpg b/data/folders/188701_188750_1026109505/visible_mask_13.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c8e492800837b038aee70b635f1b551fd9099d47
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_13.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_14.jpg b/data/folders/188701_188750_1026109505/visible_mask_14.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..7d55027253741bc1645eba4b97369dcfb3f4693d
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_14.jpg differ
diff --git a/data/folders/188701_188750_1026109505/visible_mask_15.jpg b/data/folders/188701_188750_1026109505/visible_mask_15.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..6c7f470a6effa5ab331f897a1e316314e27cb5bb
Binary files /dev/null and b/data/folders/188701_188750_1026109505/visible_mask_15.jpg differ
diff --git a/data/tars/p003_n000.tar b/data/tars/p003_n000.tar
new file mode 100644
index 0000000000000000000000000000000000000000..fe124528c54bd093c63b36489c8a4dc0d2df008e
--- /dev/null
+++ b/data/tars/p003_n000.tar
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8923b0c3ec8a9800f1b4570427e2562c022dfefbe02a5175c340715db1df6a79
+size 449150349
diff --git a/tools/co-tracker/checkpoints/scaled_offline.pth b/tools/co-tracker/checkpoints/scaled_offline.pth
new file mode 100644
index 0000000000000000000000000000000000000000..913f97895d1e173fb7b6b5a392302181ff831a25
--- /dev/null
+++ b/tools/co-tracker/checkpoints/scaled_offline.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:2670d4562ed69326dda775a26e54883925cd11b6fc9b24cb7aa9f8078bce7834
+size 101890938