Spaces:
Running
on
A100
Running
on
A100
File size: 11,725 Bytes
7c02212 bd0b128 7c02212 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
# How Frames Work in Hunyuan-GameCraft
## Overview
The Hunyuan-GameCraft system generates high-dynamic interactive game videos using diffusion models and a hybrid history-conditioned training strategy. The frame handling is complex due to several factors:
1. **Causal VAE compression** (spatial and temporal with different ratios)
2. **Hybrid history conditioning** (using past frames/clips as context for autoregressive generation)
3. **Different generation modes** (image-to-video vs. video-to-video continuation)
4. **Rotary position embeddings (RoPE)** requirements for the MM-DiT backbone
## Paper Context
According to the paper, Hunyuan-GameCraft:
- Operates at **25 FPS** with each video chunk comprising **33-frame clips** at 720p resolution
- Uses a **causal VAE** for encoding/decoding that has uneven encoding of initial vs. subsequent frames
- Implements **chunk-wise autoregressive extension** where each chunk corresponds to one action
- Employs **hybrid history conditioning** with ratios: 70% single historical clip, 5% multiple clips, 25% single frame
## Key Frame Numbers Explained
### The Magic Numbers: 33, 34, 37, 66, 69
These numbers are fundamental to the architecture and not arbitrary:
- **33 frames**: The base video chunk size - each action generates exactly 33 frames (1.32 seconds at 25 FPS)
- **34 frames**: Used for image-to-video generation in latent space (33 + 1 initial frame)
- **37 frames**: Used for rotary position embeddings when starting from an image
- **66 frames**: Used for video-to-video continuation in latent space (2 ร 33 frame chunks)
- **69 frames**: Used for rotary position embeddings when continuing from video
### Why These Specific Numbers?
The paper mentions "chunk latent denoising" where each chunk is a 33-frame segment. The specific numbers arise from:
1. **Base Chunk Size**: 33 frames per action (fixed by training)
2. **Initial Frame Handling**: +1 frame for the reference image in image-to-video mode
3. **RoPE Alignment**: +3 frames for proper positional encoding alignment in the transformer
4. **History Conditioning**: Doubling for video continuation (using previous chunk as context)
## VAE Compression Explained
### VAE Types and the "4n+1" / "8n+1" Formula
The project uses different VAE (Variational Autoencoder) models identified by codes like "884" or "888":
#### VAE Naming Convention: "XYZ-16c-hy0801"
- **First digit (X)**: Temporal compression ratio
- **Second digit (Y)**: Spatial compression ratio (height)
- **Third digit (Z)**: Spatial compression ratio (width)
- **16c**: 16 latent channels
- **hy0801**: Version identifier
#### "884" VAE (Default in Code)
- **Temporal compression**: 4:1 (every 4 frames โ 1 latent frame)
- **Spatial compression**: 8:1 for both height and width
- **Frame formula**:
- Standard: `latent_frames = (video_frames - 1) // 4 + 1`
- Special handling: `latent_frames = (video_frames - 2) // 4 + 2` (for certain cases)
- **Why "4n+1"?**: The causal VAE requires frames in multiples of 4 plus 1 for proper temporal compression
- Example: 33 frames โ (33-1)/4 + 1 = 9 latent frames
- Example: 34 frames โ (34-2)/4 + 2 = 9 latent frames (special case in pipeline)
- Example: 66 frames โ (66-2)/4 + 2 = 17 latent frames
#### "888" VAE (Alternative)
- **Temporal compression**: 8:1 (every 8 frames โ 1 latent frame)
- **Spatial compression**: 8:1 for both height and width
- **Frame formula**: `latent_frames = (video_frames - 1) // 8 + 1`
- **Why "8n+1"?**: Similar principle but with 8:1 temporal compression
- Example: 33 frames โ (33-1)/8 + 1 = 5 latent frames
- Example: 65 frames โ (65-1)/8 + 1 = 9 latent frames
#### No Compression VAE
- When VAE code doesn't match the pattern, no temporal compression is applied
- `latent_frames = video_frames`
### Why Different Formulas?
The formulas handle the causal nature of the VAE as mentioned in the paper:
1. **Causal VAE Characteristics**: The paper states that causal VAEs have "uneven encoding of initial versus subsequent frames"
2. **First Frame Special Treatment**: The initial frame requires different handling than subsequent frames
3. **Temporal Consistency**: The causal attention ensures each frame only attends to previous frames, maintaining temporal coherence
4. **Chunk Boundaries**: The formulas ensure proper alignment with the 33-frame chunk size used in training
## Frame Processing Pipeline
### 1. Image-to-Video Generation (First Segment)
```python
# Starting from a single image
if is_image:
target_length = 34 # In latent space
# For RoPE embeddings
freqs_cos, freqs_sin = get_rotary_pos_embed(37, height, width)
```
**Why 34 and 37?**
- 34 frames in latent space = 33 generated frames + 1 initial frame
- 37 for RoPE = 34 + 3 extra for positional encoding alignment
### 2. Video-to-Video Continuation
```python
# Continuing from existing video
else:
target_length = 66 # In latent space
# For RoPE embeddings
freqs_cos, freqs_sin = get_rotary_pos_embed(69, height, width)
```
**Why 66 and 69?**
- 66 frames = 2 ร 33 frames (using previous segment as context)
- 69 for RoPE = 66 + 3 extra for positional encoding alignment
### 3. Camera Network Compression
The CameraNet has special handling for these frame counts:
```python
def compress_time(self, x, num_frames):
if x.shape[-1] == 66 or x.shape[-1] == 34:
# Split into two segments
x_len = x.shape[-1]
# First segment: keep first frame, pool the rest
x_clip1 = x[...,:x_len//2]
# Second segment: keep first frame, pool the rest
x_clip2 = x[...,x_len//2:x_len]
```
This compression strategy:
1. **Preserves key frames**: First frame of each segment
2. **Pools temporal information**: Averages remaining frames
3. **Maintains continuity**: Ensures smooth transitions
## Case Study: Using 17 Frames Instead of 33
While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:
### 1. Why 17 Frames Works with VAE
17 frames is actually compatible with both VAE architectures:
- **884 VAE** (4:1 temporal compression):
- Formula: (17-1)/4 + 1 = 5 latent frames โ
- Clean division ensures proper encoding/decoding
- **888 VAE** (8:1 temporal compression):
- Formula: (17-1)/8 + 1 = 3 latent frames โ
- Also divides cleanly for proper compression
### 2. Required Code Modifications
To implement 17-frame generation, you would need to update:
#### a. Core Frame Configuration
- **app.py**: Change `args.sample_n_frames = 17`
- **ActionToPoseFromID**: Update `duration=17` parameter
- **sample_inference.py**: Adjust target_length calculations:
```python
if is_image:
target_length = 18 # 17 generated + 1 initial
else:
target_length = 34 # 2 ร 17 for video continuation
```
#### b. RoPE Embeddings
- For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
- For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)
#### c. CameraNet Compression
Update the frame count checks in `cameranet.py`:
```python
if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes
# Adjust compression logic for shorter sequences
```
### 3. Trade-offs and Considerations
**Advantages of 17 frames:**
- **Reduced memory usage**: ~48% less VRAM required
- **Faster generation**: Shorter sequences process quicker
- **More responsive**: Actions complete in 0.68 seconds vs 1.32 seconds
**Disadvantages:**
- **Quality degradation**: Model wasn't trained on 17-frame chunks
- **Choppy motion**: Less temporal information for smooth transitions
- **Action granularity**: Shorter actions may feel abrupt
- **Potential artifacts**: VAE and attention patterns optimized for 33 frames
### 4. Why Other Frame Counts Are Problematic
Not all frame counts work with the VAE constraints:
- **18 frames**: โ (18-1)/4 = 4.25 (not integer for 884 VAE)
- **19 frames**: โ (19-1)/4 = 4.5 (not integer)
- **20 frames**: โ (20-1)/4 = 4.75 (not integer)
- **21 frames**: โ Works with 884 VAE (6 latent frames)
- **25 frames**: โ Works with both VAEs (7 and 4 latent frames)
### 5. Implementation Note
While technically possible, using 17 frames would require:
1. **Extensive testing**: Verify quality and temporal consistency
2. **Possible fine-tuning**: The model may need adaptation for optimal results
3. **Adjustment of action speeds**: Camera movements calibrated for 33 frames
4. **Modified training strategy**: If fine-tuning, adjust hybrid history ratios
## Recommendations for Frame Count Modification
If you must change frame counts, consider:
1. **Use VAE-compatible numbers**:
- For 884 VAE: 17, 21, 25, 29, 33, 37... (4n+1 pattern)
- For 888 VAE: 17, 25, 33, 41... (8n+1 pattern)
2. **Modify all dependent locations**:
- `sample_inference.py`: Update target_length logic
- `cameranet.py`: Update compress_time conditions
- `ActionToPoseFromID`: Change duration parameter
- App configuration: Update sample_n_frames
3. **Consider retraining or fine-tuning**:
- The model may need adaptation for different sequence lengths
- Quality might be suboptimal without retraining
4. **Test thoroughly**:
- Different frame counts may expose edge cases
- Ensure VAE encoding/decoding works correctly
- Verify temporal consistency in generated videos
## Technical Details
### Latent Space Calculation Examples
For **884 VAE** (4:1 temporal compression):
```
Input: 33 frames โ (33-1)/4 + 1 = 9 latent frames
Input: 34 frames โ (34-2)/4 + 2 = 9 latent frames (special case)
Input: 66 frames โ (66-2)/4 + 2 = 17 latent frames
```
For **888 VAE** (8:1 temporal compression):
```
Input: 33 frames โ (33-1)/8 + 1 = 5 latent frames
Input: 65 frames โ (65-1)/8 + 1 = 9 latent frames
```
### Memory Implications
Fewer frames = less memory usage:
- 33 frames at 704ร1216: ~85MB per frame in FP16
- 18 frames would use ~46% less memory
- But VAE constraints limit viable options
## Paper-Code Consistency Analysis
The documentation is consistent with both the paper and the codebase:
### From the Paper:
- "The system operates at 25 fps, with each video chunk comprising 33-frame clips at 720p resolution"
- Uses "chunk latent denoising process" for autoregressive generation
- Implements "hybrid history-conditioned training strategy"
- Mentions causal VAE's "uneven encoding of initial versus subsequent frames"
### From the Code:
- `sample_n_frames = 33` throughout the codebase
- VAE compression formulas match the 884/888 patterns
- Hardcoded frame values (34, 37, 66, 69) align with the chunk-based architecture
- CameraNet's special handling for 34/66 frames confirms the two-mode generation
## Conclusion
The frame counts in Hunyuan-GameCraft are fundamental to its architecture:
1. **33 frames** is the atomic unit, trained into the model and fixed by the dataset construction
2. **34/37 and 66/69** emerge from the interaction between:
- The 33-frame chunk size
- Causal VAE requirements
- MM-DiT transformer's RoPE needs
- Hybrid history conditioning strategy
3. The **884 VAE** (4:1 temporal compression) is the default, requiring frames in patterns of 4n+1 or 4n+2
4. Changing to different frame counts (like 18) would require:
- Retraining the entire model
- Reconstructing the dataset
- Modifying the VAE architecture
- Updating all hardcoded dependencies
The system's design reflects careful engineering trade-offs between generation quality, temporal consistency, and computational efficiency, as validated by the paper's experimental results showing superior performance compared to alternatives. |