jbilcke-hf HF Staff commited on
Commit
bd0b128
·
1 Parent(s): 52addc5

update research

Browse files
Files changed (1) hide show
  1. how-frames-work.md +71 -30
how-frames-work.md CHANGED
@@ -133,36 +133,77 @@ This compression strategy:
133
  2. **Pools temporal information**: Averages remaining frames
134
  3. **Maintains continuity**: Ensures smooth transitions
135
 
136
- ## Why Can't We Easily Change to 18 Frames?
137
-
138
- Changing from 33 to 18 frames per chunk is problematic for multiple reasons:
139
-
140
- ### 1. Training-Time Fixed Parameters
141
- According to the paper:
142
- - The model was trained with **33-frame chunks at 25 FPS**
143
- - The hybrid history conditioning ratios were optimized for 33-frame segments
144
- - The entire dataset was annotated and partitioned into 6-second clips containing multiple 33-frame chunks
145
-
146
- ### 2. VAE Constraints
147
- - **884 VAE**: Requires (n-1) or (n-2) divisible by 4
148
- - 18 frames: (18-1)/4 = 4.25 ❌ (not integer)
149
- - Would need 17 or 21 frames for proper compression
150
- - **888 VAE**: Requires (n-1) divisible by 8
151
- - 18 frames: (18-1)/8 = 2.125 ❌ (not integer)
152
- - Would need 17 or 25 frames instead
153
-
154
- ### 3. Hardcoded Dependencies
155
- Multiple components assume 33-frame chunks:
156
- - **sample_inference.py**: Lines 525, 527, 613, 615 hardcode 34/37/66/69
157
- - **cameranet.py**: Line 150 specifically checks for 34 or 66 frames
158
- - **ActionToPoseFromID**: Hardcoded duration=33 for camera trajectory generation
159
- - **app.py**: sample_n_frames=33 is fixed
160
-
161
- ### 4. Model Architecture Assumptions
162
- - **MM-DiT backbone**: Trained with specific sequence lengths
163
- - **Rotary Position Embeddings**: Optimized for 37/69 frame sequences
164
- - **Camera encoder**: Designed for 33-frame action sequences
165
- - **Attention patterns**: Expect these specific sequence lengths
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
  ## Recommendations for Frame Count Modification
168
 
 
133
  2. **Pools temporal information**: Averages remaining frames
134
  3. **Maintains continuity**: Ensures smooth transitions
135
 
136
+ ## Case Study: Using 17 Frames Instead of 33
137
+
138
+ While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:
139
+
140
+ ### 1. Why 17 Frames Works with VAE
141
+
142
+ 17 frames is actually compatible with both VAE architectures:
143
+
144
+ - **884 VAE** (4:1 temporal compression):
145
+ - Formula: (17-1)/4 + 1 = 5 latent frames ✓
146
+ - Clean division ensures proper encoding/decoding
147
+
148
+ - **888 VAE** (8:1 temporal compression):
149
+ - Formula: (17-1)/8 + 1 = 3 latent frames ✓
150
+ - Also divides cleanly for proper compression
151
+
152
+ ### 2. Required Code Modifications
153
+
154
+ To implement 17-frame generation, you would need to update:
155
+
156
+ #### a. Core Frame Configuration
157
+ - **app.py**: Change `args.sample_n_frames = 17`
158
+ - **ActionToPoseFromID**: Update `duration=17` parameter
159
+ - **sample_inference.py**: Adjust target_length calculations:
160
+ ```python
161
+ if is_image:
162
+ target_length = 18 # 17 generated + 1 initial
163
+ else:
164
+ target_length = 34 # 2 × 17 for video continuation
165
+ ```
166
+
167
+ #### b. RoPE Embeddings
168
+ - For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
169
+ - For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)
170
+
171
+ #### c. CameraNet Compression
172
+ Update the frame count checks in `cameranet.py`:
173
+ ```python
174
+ if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes
175
+ # Adjust compression logic for shorter sequences
176
+ ```
177
+
178
+ ### 3. Trade-offs and Considerations
179
+
180
+ **Advantages of 17 frames:**
181
+ - **Reduced memory usage**: ~48% less VRAM required
182
+ - **Faster generation**: Shorter sequences process quicker
183
+ - **More responsive**: Actions complete in 0.68 seconds vs 1.32 seconds
184
+
185
+ **Disadvantages:**
186
+ - **Quality degradation**: Model wasn't trained on 17-frame chunks
187
+ - **Choppy motion**: Less temporal information for smooth transitions
188
+ - **Action granularity**: Shorter actions may feel abrupt
189
+ - **Potential artifacts**: VAE and attention patterns optimized for 33 frames
190
+
191
+ ### 4. Why Other Frame Counts Are Problematic
192
+
193
+ Not all frame counts work with the VAE constraints:
194
+ - **18 frames**: ❌ (18-1)/4 = 4.25 (not integer for 884 VAE)
195
+ - **19 frames**: ❌ (19-1)/4 = 4.5 (not integer)
196
+ - **20 frames**: ❌ (20-1)/4 = 4.75 (not integer)
197
+ - **21 frames**: ✓ Works with 884 VAE (6 latent frames)
198
+ - **25 frames**: ✓ Works with both VAEs (7 and 4 latent frames)
199
+
200
+ ### 5. Implementation Note
201
+
202
+ While technically possible, using 17 frames would require:
203
+ 1. **Extensive testing**: Verify quality and temporal consistency
204
+ 2. **Possible fine-tuning**: The model may need adaptation for optimal results
205
+ 3. **Adjustment of action speeds**: Camera movements calibrated for 33 frames
206
+ 4. **Modified training strategy**: If fine-tuning, adjust hybrid history ratios
207
 
208
  ## Recommendations for Frame Count Modification
209