File size: 5,202 Bytes
e276be2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# SAT CogVideoX-2B

This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
fine-tuning code for SAT weights.

This code is the framework used by the team to train the model. It has few comments and requires careful study.

## Inference Model

1. Ensure that you have correctly installed the dependencies required by this folder.

```shell
pip install -r requirements.txt
```

2. Download the model weights

First, go to the SAT mirror to download the dependencies.

```shell
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
mv 'index.html?dl=1' transformer.zip
unzip transformer.zip
```

Then unzip, the model structure should look like this:

```
.
β”œβ”€β”€ transformer
β”‚   β”œβ”€β”€ 1000
β”‚   β”‚   └── mp_rank_00_model_states.pt
β”‚   └── latest
└── vae
    └── 3d-vae.pt
```

Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.

```shell
git lfs install 
git clone https://huggingface.co/google/t5-v1_1-xxl.git
```

**We don't need the tf_model.h5** file. This file can be deleted.

3. Modify the file `configs/cogvideox_2b_infer.yaml`.

```yaml
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path

conditioner_config:
  target: sgm.modules.GeneralConditioner
  params:
    emb_models:
      - is_trainable: false
        input_key: txt
        ucg_rate: 0.1
        target: sgm.modules.encoders.modules.FrozenT5Embedder
        params:
          model_dir: "google/t5-v1_1-xxl" ## T5 model path
          max_length: 226

first_stage_config:
  target: sgm.models.autoencoder.VideoAutoencoderInferenceWrapper
  params:
    cp_size: 1
    ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE model path
```

+ If using txt to save multiple prompts, please refer to `configs/test.txt` for modification. One prompt per line. If
  you don't know how to write prompts, you can first use [this code](../inference/convert_demo.py) to call LLM for
  refinement.
+ If using the command line as input, modify

```yaml
input_type: cli
```

so that prompts can be entered from the command line.

If you want to change the output video directory, you can modify:

```yaml
output_dir: outputs/
```

The default is saved in the `.outputs/` folder.

4. Run the inference code to start inference

```shell
bash inference.sh
```

## Fine-Tuning the Model

### Preparing the Dataset

The dataset format should be as follows:

```
.
β”œβ”€β”€ labels
β”‚Β Β  β”œβ”€β”€ 1.txt
β”‚Β Β  β”œβ”€β”€ 2.txt
β”‚Β Β  β”œβ”€β”€ ...
└── videos
    β”œβ”€β”€ 1.mp4
    β”œβ”€β”€ 2.mp4
    β”œβ”€β”€ ...
```

Each txt file should have the same name as its corresponding video file and contain the labels for that video. Each
video should have a one-to-one correspondence with a label. Typically, a video should not have multiple labels.

For style fine-tuning, please prepare at least 50 videos and labels with similar styles to facilitate fitting.

### Modifying the Configuration File

We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.

the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.

```yaml
  # checkpoint_activations: True ## using gradient checkpointing (both checkpoint_activations in the configuration file need to be set to True)
  model_parallel_size: 1 # Model parallel size
  experiment_name: lora-disney  # Experiment name (do not change)
  mode: finetune # Mode (do not change)
  load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer model path
  no_load_rng: True # Whether to load the random seed
  train_iters: 1000 # Number of training iterations
  eval_iters: 1 # Number of evaluation iterations
  eval_interval: 100 # Evaluation interval
  eval_batch_size: 1 # Batch size for evaluation
  save: ckpts # Model save path
  save_interval: 100 # Model save interval
  log_interval: 20 # Log output interval
  train_data: [ "your train data path" ]
  valid_data: [ "your val data path" ] # Training and validation sets can be the same
  split: 1,0,0 # Ratio of training, validation, and test sets
  num_workers: 8 # Number of worker threads for data loading
```

If you wish to use Lora fine-tuning, you also need to modify:

```yaml
model:
  scale_factor: 1.15258426
  disable_first_stage_autocast: true
  not_trainable_prefixes: [ 'all' ] ## Uncomment
  log_keys:
    - txt'

  lora_config: ## Uncomment
    target: sat.model.finetune.lora2.LoraMixin
    params:
      r: 256
```

### Fine-Tuning and Validation

1. Run the inference code to start fine-tuning.

```shell
bash finetune.sh
```

### Converting to Huggingface Diffusers Supported Weights

The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:

```shell
python ../tools/convert_weight_sat2hf.py
```

**Note**: This content has not yet been tested with LORA fine-tuning models.