oieieio happynear commited on
Commit
559ff44
·
verified ·
0 Parent(s):

Duplicate from IamCreateAI/Ruyi-Mini-7B

Browse files

Co-authored-by: Feng Wang <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "en"
4
+ tags:
5
+ - video generation
6
+ - CreateAI
7
+ license: apache-2.0
8
+ pipeline_tag: image-to-video
9
+ ---
10
+
11
+
12
+ # Ruyi-Mini-7B
13
+ [Hugging Face](https://huggingface.co/IamCreateAI/Ruyi-Mini-7B) | [Github](https://github.com/IamCreateAI/Ruyi-Models)
14
+
15
+ An image-to-video model by CreateAI.
16
+
17
+ ## Overview
18
+
19
+ Ruyi-Mini-7B is an open-source image-to-video generation model. Starting with an input image, Ruyi produces subsequent video frames at resolutions ranging from 360p to 720p, supporting various aspect ratios and a maximum duration of 5 seconds. Enhanced with motion and camera control, Ruyi offers greater flexibility and creativity in video generation. We are releasing the model under the permissive Apache 2.0 license.
20
+
21
+ ## Update
22
+
23
+ Dec 24, 2024: The diffusion model is updated to fix the black lines when creating 3:4 or 4:5 videos.
24
+
25
+ Dec 16, 2024: Ruyi-mini-7B is released.
26
+
27
+ ## Installation
28
+
29
+ Install code from github:
30
+ ```bash
31
+ git clone https://github.com/IamCreateAI/Ruyi-Models
32
+ cd Ruyi-Models
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ## Running
37
+
38
+ We provide two ways to run our model. The first is directly using python code.
39
+
40
+ ```bash
41
+ python3 predict_i2v.py
42
+ ```
43
+
44
+ Or use ComfyUI wrapper in our [github repo](https://github.com/IamCreateAI/Ruyi-Models).
45
+
46
+ ## Model Architecture
47
+
48
+ Ruyi-Mini-7B is an advanced image-to-video model with about 7.1 billion parameters. The model architecture is modified from [EasyAnimate V4 model](https://github.com/aigc-apps/EasyAnimate), whose transformer module is inherited from [HunyuanDiT](https://github.com/Tencent/HunyuanDiT). It comprises three key components:
49
+ 1. Casual VAE Module: Handles video compression and decompression. It reduces spatial resolution to 1/8 and temporal resolution to 1/4, with each latent pixel is represented in 16 float numbers after compression.
50
+ 2. Diffusion Transformer Module: Generates compressed video data using 3D full attention, with:
51
+ - 2D Normalized-RoPE for spatial dimensions;
52
+ - Sin-cos position embedding for temporal dimensions;
53
+ - DDPM (Denoising Diffusion Probabilistic Models) for model training.
54
+ 3. Ruyi also utilizes a CLIP model to extract the semantic features from the input image to guide the whole video generation. The CLIP features are introduced into the transformer by cross-attention.
55
+
56
+ ## Training Data and Methodology
57
+ The training process is divided into four phases:
58
+ - Phase 1: Pre-training from scratch with ~200M video clips and ~30M images at a 256-resolution, using a batch size of 4096 for 350,000 iterations to achieve full convergence.
59
+ - Phase 2: Fine-tuning with ~60M video clips for multi-scale resolutions (384–512), with a batch size of 1024 for 60,000 iterations.
60
+ - Phase 3: High-quality fine-tuning with ~20M video clips and ~8M images for 384–1024 resolutions, with dynamic batch sizes based on memory and 10,000 iterations.
61
+ - Phase 4: Image-to-video training with ~10M curated high-quality video clips, with dynamic batch sizes based on memory for ~10,000 iterations.
62
+
63
+ ## Hardware Requirements
64
+ The VRAM cost of Ruyi depends on the resolution and duration of the video. Here we list the costs for some typical video size. Tested on single A100.
65
+ |Video Size | 360x480x120 | 384x672x120 | 480x640x120 | 630x1120x120 | 720x1280x120 |
66
+ |:--:|:--:|:--:|:--:|:--:|:--:|
67
+ |Memory | 21.5GB | 25.5GB | 27.7GB | 44.9GB | 54.8GB |
68
+ |Time | 03:10 | 05:29 | 06:49 | 24:18 | 39:02 |
69
+
70
+ For 24GB VRAM cards such as RTX4090, we provide `low_gpu_memory_mode`, under which the model can generate 720x1280x120 videos with a longer time.
71
+
72
+ ## Showcase
73
+
74
+ ### Image to Video Effects
75
+
76
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
77
+ <tr>
78
+ <td><video src="https://github.com/user-attachments/assets/4dedf40b-82f2-454c-9a67-5f4ed243f5ea" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
79
+ <td><video src="https://github.com/user-attachments/assets/905fef17-8c5d-49b0-a49a-6ae7e212fa07" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
80
+ <td><video src="https://github.com/user-attachments/assets/20daab12-b510-448a-9491-389d7bdbbf2e" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
81
+ <td><video src="https://github.com/user-attachments/assets/f1bb0a91-d52a-4611-bac2-8fcf9658cac0" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
82
+ </tr>
83
+ </table>
84
+
85
+ ### Camera Control
86
+
87
+ <table border="0" style="width: 100%; text-align: center; ">
88
+ <tr>
89
+ <td align=center><img src="https://github.com/user-attachments/assets/8aedcea6-3b8e-4c8b-9fed-9ceca4d41954" width="100%" style="max-height:240px; min-height: 100px; margin-top: 20%;"></img></td>
90
+ <td align=center><video src="https://github.com/user-attachments/assets/d9d027d4-0d4f-45f5-9d46-49860b562c69" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video></td>
91
+ <td align=center><video src="https://github.com/user-attachments/assets/7716a67b-1bb8-4d44-b128-346cbc35e4ee" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video></td>
92
+ </tr>
93
+ <tr><td>input</td><td>left</td><td>right</td></tr>
94
+ <tr>
95
+ <td align=center><video src="https://github.com/user-attachments/assets/cc1f1928-cab7-4c4b-90af-928936102e66" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video></td>
96
+ <td align=center><video src="https://github.com/user-attachments/assets/c742ea2c-503a-454f-a61a-10b539100cd9" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video></td>
97
+ <td align=center><video src="https://github.com/user-attachments/assets/442839fa-cc53-4b75-b015-909e44c065e0" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video></td>
98
+ </tr>
99
+ <tr><td>static</td><td>up</td><td>down</td></tr>
100
+ </table>
101
+
102
+ ### Motion Amplitude Control
103
+
104
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
105
+ <tr>
106
+ <td align=center><video src="https://github.com/user-attachments/assets/0020bd54-0ff6-46ad-91ee-d9f0df013772" width="100%" controls autoplay loop></video>motion 1</td>
107
+ <td align=center><video src="https://github.com/user-attachments/assets/d1c26419-54e3-4b86-8ae3-98e12de3022e" width="100%" controls autoplay loop></video>motion 2</td>
108
+ <td align=center><video src="https://github.com/user-attachments/assets/535147a2-049a-4afc-8d2a-017bc778977e" width="100%" controls autoplay loop></video>motion 3</td>
109
+ <td align=center><video src="https://github.com/user-attachments/assets/bf893d53-2e11-406f-bb9a-2aacffcecd44" width="100%" controls autoplay loop></video>motion 4</td>
110
+ </tr>
111
+ </table>
112
+
113
+ ## Limitations
114
+ There are some known limitations in this experimental release. Texts, hands and crowded human faces may be distorted. The video may cut to another scene when the model does not know how to generate future frames. We are still working on these problems and will update the model as we make progress.
115
+
116
+
117
+ ## BibTeX
118
+ ```
119
+ @misc{createai2024ruyi,
120
+ title={Ruyi-Mini-7B},
121
+ author={CreateAI Team},
122
+ year={2024},
123
+ publisher = {GitHub},
124
+ journal = {GitHub repository},
125
+ howpublished={\url{https://github.com/IamCreateAI/Ruyi-Models}}
126
+ }
127
+ ```
128
+
129
+ ## Contact Us
130
+
131
+ You are welcomed to join our [Discord](https://discord.com/invite/nueQFQwwGw) or Wechat Group (Scan QR code to add Ruyi Assistant and join the official group) for further discussion!
132
+
133
+ ![wechat](https://github.com/user-attachments/assets/cc5e25c6-34ab-4be1-a59b-7d5789264a9c)
embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2e2591798abd3b934815a2c737b75b1a6555728ca19b68af9ce1d53cc7878d5
3
+ size 74976632
image_encoder/config.json ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openai/clip-vit-large-patch14-336",
3
+ "architectures": [
4
+ "CLIPModel"
5
+ ],
6
+ "initializer_factor": 1.0,
7
+ "logit_scale_init_value": 2.6592,
8
+ "model_type": "clip",
9
+ "projection_dim": 768,
10
+ "text_config": {
11
+ "_name_or_path": "",
12
+ "add_cross_attention": false,
13
+ "architectures": null,
14
+ "attention_dropout": 0.0,
15
+ "bad_words_ids": null,
16
+ "bos_token_id": 0,
17
+ "chunk_size_feed_forward": 0,
18
+ "cross_attention_hidden_size": null,
19
+ "decoder_start_token_id": null,
20
+ "diversity_penalty": 0.0,
21
+ "do_sample": false,
22
+ "dropout": 0.0,
23
+ "early_stopping": false,
24
+ "encoder_no_repeat_ngram_size": 0,
25
+ "eos_token_id": 2,
26
+ "exponential_decay_length_penalty": null,
27
+ "finetuning_task": null,
28
+ "forced_bos_token_id": null,
29
+ "forced_eos_token_id": null,
30
+ "hidden_act": "quick_gelu",
31
+ "hidden_size": 768,
32
+ "id2label": {
33
+ "0": "LABEL_0",
34
+ "1": "LABEL_1"
35
+ },
36
+ "initializer_factor": 1.0,
37
+ "initializer_range": 0.02,
38
+ "intermediate_size": 3072,
39
+ "is_decoder": false,
40
+ "is_encoder_decoder": false,
41
+ "label2id": {
42
+ "LABEL_0": 0,
43
+ "LABEL_1": 1
44
+ },
45
+ "layer_norm_eps": 1e-05,
46
+ "length_penalty": 1.0,
47
+ "max_length": 20,
48
+ "max_position_embeddings": 77,
49
+ "min_length": 0,
50
+ "model_type": "clip_text_model",
51
+ "no_repeat_ngram_size": 0,
52
+ "num_attention_heads": 12,
53
+ "num_beam_groups": 1,
54
+ "num_beams": 1,
55
+ "num_hidden_layers": 12,
56
+ "num_return_sequences": 1,
57
+ "output_attentions": false,
58
+ "output_hidden_states": false,
59
+ "output_scores": false,
60
+ "pad_token_id": 1,
61
+ "prefix": null,
62
+ "problem_type": null,
63
+ "projection_dim": 768,
64
+ "pruned_heads": {},
65
+ "remove_invalid_values": false,
66
+ "repetition_penalty": 1.0,
67
+ "return_dict": true,
68
+ "return_dict_in_generate": false,
69
+ "sep_token_id": null,
70
+ "task_specific_params": null,
71
+ "temperature": 1.0,
72
+ "tf_legacy_loss": false,
73
+ "tie_encoder_decoder": false,
74
+ "tie_word_embeddings": true,
75
+ "tokenizer_class": null,
76
+ "top_k": 50,
77
+ "top_p": 1.0,
78
+ "torch_dtype": null,
79
+ "torchscript": false,
80
+ "transformers_version": "4.21.3",
81
+ "typical_p": 1.0,
82
+ "use_bfloat16": false,
83
+ "vocab_size": 49408
84
+ },
85
+ "text_config_dict": {
86
+ "hidden_size": 768,
87
+ "intermediate_size": 3072,
88
+ "num_attention_heads": 12,
89
+ "num_hidden_layers": 12,
90
+ "projection_dim": 768
91
+ },
92
+ "torch_dtype": "float32",
93
+ "transformers_version": null,
94
+ "vision_config": {
95
+ "_name_or_path": "",
96
+ "add_cross_attention": false,
97
+ "architectures": null,
98
+ "attention_dropout": 0.0,
99
+ "bad_words_ids": null,
100
+ "bos_token_id": null,
101
+ "chunk_size_feed_forward": 0,
102
+ "cross_attention_hidden_size": null,
103
+ "decoder_start_token_id": null,
104
+ "diversity_penalty": 0.0,
105
+ "do_sample": false,
106
+ "dropout": 0.0,
107
+ "early_stopping": false,
108
+ "encoder_no_repeat_ngram_size": 0,
109
+ "eos_token_id": null,
110
+ "exponential_decay_length_penalty": null,
111
+ "finetuning_task": null,
112
+ "forced_bos_token_id": null,
113
+ "forced_eos_token_id": null,
114
+ "hidden_act": "quick_gelu",
115
+ "hidden_size": 1024,
116
+ "id2label": {
117
+ "0": "LABEL_0",
118
+ "1": "LABEL_1"
119
+ },
120
+ "image_size": 336,
121
+ "initializer_factor": 1.0,
122
+ "initializer_range": 0.02,
123
+ "intermediate_size": 4096,
124
+ "is_decoder": false,
125
+ "is_encoder_decoder": false,
126
+ "label2id": {
127
+ "LABEL_0": 0,
128
+ "LABEL_1": 1
129
+ },
130
+ "layer_norm_eps": 1e-05,
131
+ "length_penalty": 1.0,
132
+ "max_length": 20,
133
+ "min_length": 0,
134
+ "model_type": "clip_vision_model",
135
+ "no_repeat_ngram_size": 0,
136
+ "num_attention_heads": 16,
137
+ "num_beam_groups": 1,
138
+ "num_beams": 1,
139
+ "num_channels": 3,
140
+ "num_hidden_layers": 24,
141
+ "num_return_sequences": 1,
142
+ "output_attentions": false,
143
+ "output_hidden_states": false,
144
+ "output_scores": false,
145
+ "pad_token_id": null,
146
+ "patch_size": 14,
147
+ "prefix": null,
148
+ "problem_type": null,
149
+ "projection_dim": 768,
150
+ "pruned_heads": {},
151
+ "remove_invalid_values": false,
152
+ "repetition_penalty": 1.0,
153
+ "return_dict": true,
154
+ "return_dict_in_generate": false,
155
+ "sep_token_id": null,
156
+ "task_specific_params": null,
157
+ "temperature": 1.0,
158
+ "tf_legacy_loss": false,
159
+ "tie_encoder_decoder": false,
160
+ "tie_word_embeddings": true,
161
+ "tokenizer_class": null,
162
+ "top_k": 50,
163
+ "top_p": 1.0,
164
+ "torch_dtype": null,
165
+ "torchscript": false,
166
+ "transformers_version": "4.21.3",
167
+ "typical_p": 1.0,
168
+ "use_bfloat16": false
169
+ },
170
+ "vision_config_dict": {
171
+ "hidden_size": 1024,
172
+ "image_size": 336,
173
+ "intermediate_size": 4096,
174
+ "num_attention_heads": 16,
175
+ "num_hidden_layers": 24,
176
+ "patch_size": 14,
177
+ "projection_dim": 768
178
+ }
179
+ }
image_encoder/preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 336,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "resample": 3,
18
+ "size": 336
19
+ }
image_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6032c2e0caae3dc2d4fba35535fa6307dbb49df59c7e182b1bc4b3329b81801
3
+ size 1711974081
model_index.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "RuyiInpaintPipeline",
3
+ "_diffusers_version": "0.29.0.dev0",
4
+ "feature_extractor": [
5
+ null,
6
+ null
7
+ ],
8
+ "requires_safety_checker": false,
9
+ "safety_checker": [
10
+ null,
11
+ null
12
+ ]
13
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDPMScheduler",
3
+ "_diffusers_version": "0.29.0.dev0",
4
+ "beta_end": 0.03,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "num_train_timesteps": 1000,
11
+ "prediction_type": "v_prediction",
12
+ "rescale_betas_zero_snr": false,
13
+ "sample_max_value": 1.0,
14
+ "set_alpha_to_one": false,
15
+ "skip_prk_steps": true,
16
+ "steps_offset": 1,
17
+ "thresholding": false,
18
+ "timestep_spacing": "leading",
19
+ "trained_betas": null,
20
+ "variance_type": "fixed_small"
21
+ }
transformer/config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "HunyuanTransformer3DModel",
3
+ "_diffusers_version": "0.30.2",
4
+ "activation_fn": "gelu-approximate",
5
+ "after_norm": false,
6
+ "attention_head_dim": 176,
7
+ "basic_block_type": "basic",
8
+ "cross_attention_dim": 1024,
9
+ "cross_attention_dim_t5": 2048,
10
+ "hidden_size": 2816,
11
+ "in_channels": 48,
12
+ "learn_sigma": true,
13
+ "mlp_ratio": 4.3637,
14
+ "motion_module_kwargs": null,
15
+ "motion_module_kwargs_even": null,
16
+ "motion_module_kwargs_odd": null,
17
+ "motion_module_type": "VanillaGrid",
18
+ "n_query": 16,
19
+ "norm_type": "layer_norm",
20
+ "num_attention_heads": 16,
21
+ "num_layers": 40,
22
+ "out_channels": 32,
23
+ "patch_size": 2,
24
+ "pooled_projection_dim": 1024,
25
+ "projection_dim": 1024,
26
+ "sample_size": 128,
27
+ "text_len": 77,
28
+ "text_len_t5": 256,
29
+ "time_position_encoding": true
30
+ }
transformer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c8c3a2dfa1a190a577067b5a983f551f8b5082436fb10ba17888f9ed6597acd
3
+ size 14489153432
vae/config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "SpatialDownBlock3D",
13
+ "SpatialTemporalDownBlock3D",
14
+ "SpatialTemporalDownBlock3D",
15
+ "SpatialTemporalDownBlock3D"
16
+ ],
17
+ "force_upcast": true,
18
+ "in_channels": 3,
19
+ "latent_channels": 16,
20
+ "layers_per_block": 2,
21
+ "norm_num_groups": 32,
22
+ "out_channels": 3,
23
+ "sample_size": 256,
24
+ "scaling_factor": 0.18215,
25
+ "slice_compression_vae": true,
26
+ "use_tiling": true,
27
+ "mid_block_attention_type": "3d",
28
+ "mini_batch_encoder": 8,
29
+ "mini_batch_decoder": 2,
30
+ "up_block_types": [
31
+ "SpatialUpBlock3D",
32
+ "SpatialTemporalUpBlock3D",
33
+ "SpatialTemporalUpBlock3D",
34
+ "SpatialTemporalUpBlock3D"
35
+ ]
36
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68e09062cf5d03d95f40bbef7a3372a24266ab7a6578cc0e8e8490664f8caaab
3
+ size 1055883284