Spaces:
Running
on
A100
Running
on
A100
Commit
·
bd0b128
1
Parent(s):
52addc5
update research
Browse files- how-frames-work.md +71 -30
how-frames-work.md
CHANGED
@@ -133,36 +133,77 @@ This compression strategy:
|
|
133 |
2. **Pools temporal information**: Averages remaining frames
|
134 |
3. **Maintains continuity**: Ensures smooth transitions
|
135 |
|
136 |
-
##
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
### 1.
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
-
|
150 |
-
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
- **
|
158 |
-
- **ActionToPoseFromID**:
|
159 |
-
- **
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
|
167 |
## Recommendations for Frame Count Modification
|
168 |
|
|
|
133 |
2. **Pools temporal information**: Averages remaining frames
|
134 |
3. **Maintains continuity**: Ensures smooth transitions
|
135 |
|
136 |
+
## Case Study: Using 17 Frames Instead of 33
|
137 |
+
|
138 |
+
While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:
|
139 |
+
|
140 |
+
### 1. Why 17 Frames Works with VAE
|
141 |
+
|
142 |
+
17 frames is actually compatible with both VAE architectures:
|
143 |
+
|
144 |
+
- **884 VAE** (4:1 temporal compression):
|
145 |
+
- Formula: (17-1)/4 + 1 = 5 latent frames ✓
|
146 |
+
- Clean division ensures proper encoding/decoding
|
147 |
+
|
148 |
+
- **888 VAE** (8:1 temporal compression):
|
149 |
+
- Formula: (17-1)/8 + 1 = 3 latent frames ✓
|
150 |
+
- Also divides cleanly for proper compression
|
151 |
+
|
152 |
+
### 2. Required Code Modifications
|
153 |
+
|
154 |
+
To implement 17-frame generation, you would need to update:
|
155 |
+
|
156 |
+
#### a. Core Frame Configuration
|
157 |
+
- **app.py**: Change `args.sample_n_frames = 17`
|
158 |
+
- **ActionToPoseFromID**: Update `duration=17` parameter
|
159 |
+
- **sample_inference.py**: Adjust target_length calculations:
|
160 |
+
```python
|
161 |
+
if is_image:
|
162 |
+
target_length = 18 # 17 generated + 1 initial
|
163 |
+
else:
|
164 |
+
target_length = 34 # 2 × 17 for video continuation
|
165 |
+
```
|
166 |
+
|
167 |
+
#### b. RoPE Embeddings
|
168 |
+
- For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
|
169 |
+
- For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)
|
170 |
+
|
171 |
+
#### c. CameraNet Compression
|
172 |
+
Update the frame count checks in `cameranet.py`:
|
173 |
+
```python
|
174 |
+
if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes
|
175 |
+
# Adjust compression logic for shorter sequences
|
176 |
+
```
|
177 |
+
|
178 |
+
### 3. Trade-offs and Considerations
|
179 |
+
|
180 |
+
**Advantages of 17 frames:**
|
181 |
+
- **Reduced memory usage**: ~48% less VRAM required
|
182 |
+
- **Faster generation**: Shorter sequences process quicker
|
183 |
+
- **More responsive**: Actions complete in 0.68 seconds vs 1.32 seconds
|
184 |
+
|
185 |
+
**Disadvantages:**
|
186 |
+
- **Quality degradation**: Model wasn't trained on 17-frame chunks
|
187 |
+
- **Choppy motion**: Less temporal information for smooth transitions
|
188 |
+
- **Action granularity**: Shorter actions may feel abrupt
|
189 |
+
- **Potential artifacts**: VAE and attention patterns optimized for 33 frames
|
190 |
+
|
191 |
+
### 4. Why Other Frame Counts Are Problematic
|
192 |
+
|
193 |
+
Not all frame counts work with the VAE constraints:
|
194 |
+
- **18 frames**: ❌ (18-1)/4 = 4.25 (not integer for 884 VAE)
|
195 |
+
- **19 frames**: ❌ (19-1)/4 = 4.5 (not integer)
|
196 |
+
- **20 frames**: ❌ (20-1)/4 = 4.75 (not integer)
|
197 |
+
- **21 frames**: ✓ Works with 884 VAE (6 latent frames)
|
198 |
+
- **25 frames**: ✓ Works with both VAEs (7 and 4 latent frames)
|
199 |
+
|
200 |
+
### 5. Implementation Note
|
201 |
+
|
202 |
+
While technically possible, using 17 frames would require:
|
203 |
+
1. **Extensive testing**: Verify quality and temporal consistency
|
204 |
+
2. **Possible fine-tuning**: The model may need adaptation for optimal results
|
205 |
+
3. **Adjustment of action speeds**: Camera movements calibrated for 33 frames
|
206 |
+
4. **Modified training strategy**: If fine-tuning, adjust hybrid history ratios
|
207 |
|
208 |
## Recommendations for Frame Count Modification
|
209 |
|