You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
You agree to not use the model weights to conduct experiments that cause harm to human subjects.
Log in or Sign Up to review the conditions and access this model content.
βοΈ Introduction
DropletVideo is a project exploring high-order spatio-temporal consistency in image-to-video generation. It is trained on DropletVideo-10M. The model supports multi-resolution inputs, dynamic FPS control for motion intensity, and demonstrates potential for 3D consistency. The model supports multi-resolution inputs, dynamic FPS control for motion intensity, and demonstrates potential for 3D consistency. For further details, you can check our project page as well as the technical report.
π₯ Features
- Multi-resolution inputsοΌ accommodating pixel values from 512x512x85οΌdefault 672x384x85οΌto 896x896x85οΌdefault 1120x640x85, and videos with different aspect ratios.
- Dynamic FPS control for motion intensity.
π Installation
Follow the steps below to set up the environment for our project.
Our tested System Environment:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
NVIDIA A100-SXM4-80GB
Driver Version: 550.144.03
(Optional) Create a conda environment and activate it:
conda create -n DropletVideo python=3.8 conda activate DropletVideo
Install the required dependencies:
cd DropletVideo_inference pip install -r requirements.txt
We provide a
requirements.txt
file that contains all necessary dependencies for easy installation.The DropletVideo-5B checkpoints have been in DropletVideo-V1.0-weights folder.
The distribution of internal model weights is as follows:
The text_encoder as well as the tokenizer employs the google-t5 model weights(without training). The scheduler is the denoise strategy during inference. The vae is the pixel-to-latent network for our project. The transformer contains our 5B transformer model weights.
DropletVideo-V1.0-weights/ βββ configuration.json βββ LICENSE βββ model_index.json βββ scheduler β βββ scheduler_config.json βββ text_encoder β βββ config.json β βββ model-00001-of-00002.safetensors β βββ model-00002-of-00002.safetensors β βββ model.safetensors.index.json βββ tokenizer β βββ added_tokens.json β βββ special_tokens_map.json β βββ spiece.model β βββ tokenizer_config.json βββ transformer β βββ config.json β βββ diffusion_pytorch_model.safetensors βββ vae βββ config.json βββ diffusion_pytorch_model.safetensors
Notation:
All the model weights are stored in safetensors. Satetensors is a file format designed for stroing tensor data, aiming to provide efficient and secure read and write operations. It is commonly used to store weights and parameters in machine learning models. Below are methods for reading safetensors. You can check the model_weights from the state_dict variable.
from safetensors.torch import load_file
state_dict = load_file(file_path)
β‘ Usage
Once the installation is complete, you can run the demo using the following command:
python inference.py --ckpt DropletVideo-V1.0-weights --ref_img_dir your_path_to_ref_img --FPS 4 --prompt yout_text_input
Example:
python inference.py --ckpt DropletVideo-V1.0-weights --ref_img_dir assets/752.jpg --FPS 4 --prompt "The video showcases a magnificent music hall, with the focal point being a black triangular piano in the center. The entire scene is elegant and rich in artistic atmosphere. The video begins with warm lighting that illuminates the ornate ceiling, followed by a lavish chandelier. These chandeliers are arranged in a circular pattern, with a soft white light emanating from the center. The wall decorations and carvings are exquisite, with the walls predominantly featuring gold and ivory white, creating a sense of solemnity and elegance. The camera moves from the left rear of the piano to the right, revealing every decorative detail of the music hall, including the second-floor gallery, ornate arched windows with decorations, and rows of empty seats facing the audience. As the camera pans, the piano's outline becomes more distinct, with the half-open lid revealing the smooth black keys that glow slightly under the spotlight. As the movement continues, the acoustic structure of the hall, such as the wooden floor and sound-absorbing walls, is gradually revealed, making the space more suitable for music performance. The video concludes with the camera stopping at the center, showcasing the entire hall, with the piano and background forming a beautiful artistic landscape. The hall is spacious, but its design and decoration convey a sense of solemnity and tranquility."
Command Line Arguments
1. required arguments
--ckpt
: Path to the model weights.--ref_img_dir
: The input condition img path--FPS
: The input condition FPS control the motion strength--prompt
: The input text
2. Other arguments
--width
: The width of the generated video--height
: The height of the generated video--video_length
: The frame num of the generated video--num_inference_steps
: The denoise step for inference. Normally, the quality of the generated video will be better if the value is higher but with higher computation cost. Normally, we set it to 50.--seed
: The random seed for the inference, different seeds will generate different results.--guidance_scale
: The guidance scale of the denoise process. The value determines the relationship between the input prompt and the generated video. The higher value, the more relative.
Notation:
DropletVideo can support any-resolution input.(But for this version we did not add the auto-padding function, the input width and height must be divided by 16) The default width, height and video_length is set to 672, 384 and 85. This setting can generate videos in a single A100-40GB GPU card. You can minimize these parameters during the debug process of the optimization.
π Credits
This project leverages the following open-source frameworks. We appreciate their contributions and efforts in making this work possible.
- CogVideoX-Fun - Training Strategies
- CogVideoX - VAE compression
- EasyAnimate - I2V model configuration
- Open-Sora-Plan - Data processing
- Open-Sora - Extra control
Citation
π If you find our work helpful, please leave us a star and cite our paper.
@article{zhang2025dropletvideo,
title={DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation},
author={Zhang, Runze and Du, Guoguang and Li, Xiaochuan and Jia, Qi and Jin, Liang and Liu, Lu and Wang, Jingjing and Xu, Cong and Guo, Zhenhua and Zhao, Yaqian and Gong, Xiaoli and Li, Rengang and Fan, Baoyu},
journal={arXiv preprint arXiv:2503.06053},
year={2025}
}
βοΈ Contact us
If you have any questions, comments, or suggestions, please contact us at [email protected].
π License
This project is released under the Apache 2.0 license.
- Downloads last month
- 0
Model tree for DropletX/DropletVideo-5B
Base model
alibaba-pai/CogVideoX-Fun-V1.1-5b-InP