# How Frames Work in Hunyuan-GameCraft ## Overview The Hunyuan-GameCraft system generates high-dynamic interactive game videos using diffusion models and a hybrid history-conditioned training strategy. The frame handling is complex due to several factors: 1. **Causal VAE compression** (spatial and temporal with different ratios) 2. **Hybrid history conditioning** (using past frames/clips as context for autoregressive generation) 3. **Different generation modes** (image-to-video vs. video-to-video continuation) 4. **Rotary position embeddings (RoPE)** requirements for the MM-DiT backbone ## Paper Context According to the paper, Hunyuan-GameCraft: - Operates at **25 FPS** with each video chunk comprising **33-frame clips** at 720p resolution - Uses a **causal VAE** for encoding/decoding that has uneven encoding of initial vs. subsequent frames - Implements **chunk-wise autoregressive extension** where each chunk corresponds to one action - Employs **hybrid history conditioning** with ratios: 70% single historical clip, 5% multiple clips, 25% single frame ## Key Frame Numbers Explained ### The Magic Numbers: 33, 34, 37, 66, 69 These numbers are fundamental to the architecture and not arbitrary: - **33 frames**: The base video chunk size - each action generates exactly 33 frames (1.32 seconds at 25 FPS) - **34 frames**: Used for image-to-video generation in latent space (33 + 1 initial frame) - **37 frames**: Used for rotary position embeddings when starting from an image - **66 frames**: Used for video-to-video continuation in latent space (2 × 33 frame chunks) - **69 frames**: Used for rotary position embeddings when continuing from video ### Why These Specific Numbers? The paper mentions "chunk latent denoising" where each chunk is a 33-frame segment. The specific numbers arise from: 1. **Base Chunk Size**: 33 frames per action (fixed by training) 2. **Initial Frame Handling**: +1 frame for the reference image in image-to-video mode 3. **RoPE Alignment**: +3 frames for proper positional encoding alignment in the transformer 4. **History Conditioning**: Doubling for video continuation (using previous chunk as context) ## VAE Compression Explained ### VAE Types and the "4n+1" / "8n+1" Formula The project uses different VAE (Variational Autoencoder) models identified by codes like "884" or "888": #### VAE Naming Convention: "XYZ-16c-hy0801" - **First digit (X)**: Temporal compression ratio - **Second digit (Y)**: Spatial compression ratio (height) - **Third digit (Z)**: Spatial compression ratio (width) - **16c**: 16 latent channels - **hy0801**: Version identifier #### "884" VAE (Default in Code) - **Temporal compression**: 4:1 (every 4 frames → 1 latent frame) - **Spatial compression**: 8:1 for both height and width - **Frame formula**: - Standard: `latent_frames = (video_frames - 1) // 4 + 1` - Special handling: `latent_frames = (video_frames - 2) // 4 + 2` (for certain cases) - **Why "4n+1"?**: The causal VAE requires frames in multiples of 4 plus 1 for proper temporal compression - Example: 33 frames → (33-1)/4 + 1 = 9 latent frames - Example: 34 frames → (34-2)/4 + 2 = 9 latent frames (special case in pipeline) - Example: 66 frames → (66-2)/4 + 2 = 17 latent frames #### "888" VAE (Alternative) - **Temporal compression**: 8:1 (every 8 frames → 1 latent frame) - **Spatial compression**: 8:1 for both height and width - **Frame formula**: `latent_frames = (video_frames - 1) // 8 + 1` - **Why "8n+1"?**: Similar principle but with 8:1 temporal compression - Example: 33 frames → (33-1)/8 + 1 = 5 latent frames - Example: 65 frames → (65-1)/8 + 1 = 9 latent frames #### No Compression VAE - When VAE code doesn't match the pattern, no temporal compression is applied - `latent_frames = video_frames` ### Why Different Formulas? The formulas handle the causal nature of the VAE as mentioned in the paper: 1. **Causal VAE Characteristics**: The paper states that causal VAEs have "uneven encoding of initial versus subsequent frames" 2. **First Frame Special Treatment**: The initial frame requires different handling than subsequent frames 3. **Temporal Consistency**: The causal attention ensures each frame only attends to previous frames, maintaining temporal coherence 4. **Chunk Boundaries**: The formulas ensure proper alignment with the 33-frame chunk size used in training ## Frame Processing Pipeline ### 1. Image-to-Video Generation (First Segment) ```python # Starting from a single image if is_image: target_length = 34 # In latent space # For RoPE embeddings freqs_cos, freqs_sin = get_rotary_pos_embed(37, height, width) ``` **Why 34 and 37?** - 34 frames in latent space = 33 generated frames + 1 initial frame - 37 for RoPE = 34 + 3 extra for positional encoding alignment ### 2. Video-to-Video Continuation ```python # Continuing from existing video else: target_length = 66 # In latent space # For RoPE embeddings freqs_cos, freqs_sin = get_rotary_pos_embed(69, height, width) ``` **Why 66 and 69?** - 66 frames = 2 × 33 frames (using previous segment as context) - 69 for RoPE = 66 + 3 extra for positional encoding alignment ### 3. Camera Network Compression The CameraNet has special handling for these frame counts: ```python def compress_time(self, x, num_frames): if x.shape[-1] == 66 or x.shape[-1] == 34: # Split into two segments x_len = x.shape[-1] # First segment: keep first frame, pool the rest x_clip1 = x[...,:x_len//2] # Second segment: keep first frame, pool the rest x_clip2 = x[...,x_len//2:x_len] ``` This compression strategy: 1. **Preserves key frames**: First frame of each segment 2. **Pools temporal information**: Averages remaining frames 3. **Maintains continuity**: Ensures smooth transitions ## Case Study: Using 17 Frames Instead of 33 While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility: ### 1. Why 17 Frames Works with VAE 17 frames is actually compatible with both VAE architectures: - **884 VAE** (4:1 temporal compression): - Formula: (17-1)/4 + 1 = 5 latent frames ✓ - Clean division ensures proper encoding/decoding - **888 VAE** (8:1 temporal compression): - Formula: (17-1)/8 + 1 = 3 latent frames ✓ - Also divides cleanly for proper compression ### 2. Required Code Modifications To implement 17-frame generation, you would need to update: #### a. Core Frame Configuration - **app.py**: Change `args.sample_n_frames = 17` - **ActionToPoseFromID**: Update `duration=17` parameter - **sample_inference.py**: Adjust target_length calculations: ```python if is_image: target_length = 18 # 17 generated + 1 initial else: target_length = 34 # 2 × 17 for video continuation ``` #### b. RoPE Embeddings - For image-to-video: Use 21 instead of 37 (18 + 3 for alignment) - For video-to-video: Use 37 instead of 69 (34 + 3 for alignment) #### c. CameraNet Compression Update the frame count checks in `cameranet.py`: ```python if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes # Adjust compression logic for shorter sequences ``` ### 3. Trade-offs and Considerations **Advantages of 17 frames:** - **Reduced memory usage**: ~48% less VRAM required - **Faster generation**: Shorter sequences process quicker - **More responsive**: Actions complete in 0.68 seconds vs 1.32 seconds **Disadvantages:** - **Quality degradation**: Model wasn't trained on 17-frame chunks - **Choppy motion**: Less temporal information for smooth transitions - **Action granularity**: Shorter actions may feel abrupt - **Potential artifacts**: VAE and attention patterns optimized for 33 frames ### 4. Why Other Frame Counts Are Problematic Not all frame counts work with the VAE constraints: - **18 frames**: ❌ (18-1)/4 = 4.25 (not integer for 884 VAE) - **19 frames**: ❌ (19-1)/4 = 4.5 (not integer) - **20 frames**: ❌ (20-1)/4 = 4.75 (not integer) - **21 frames**: ✓ Works with 884 VAE (6 latent frames) - **25 frames**: ✓ Works with both VAEs (7 and 4 latent frames) ### 5. Implementation Note While technically possible, using 17 frames would require: 1. **Extensive testing**: Verify quality and temporal consistency 2. **Possible fine-tuning**: The model may need adaptation for optimal results 3. **Adjustment of action speeds**: Camera movements calibrated for 33 frames 4. **Modified training strategy**: If fine-tuning, adjust hybrid history ratios ## Recommendations for Frame Count Modification If you must change frame counts, consider: 1. **Use VAE-compatible numbers**: - For 884 VAE: 17, 21, 25, 29, 33, 37... (4n+1 pattern) - For 888 VAE: 17, 25, 33, 41... (8n+1 pattern) 2. **Modify all dependent locations**: - `sample_inference.py`: Update target_length logic - `cameranet.py`: Update compress_time conditions - `ActionToPoseFromID`: Change duration parameter - App configuration: Update sample_n_frames 3. **Consider retraining or fine-tuning**: - The model may need adaptation for different sequence lengths - Quality might be suboptimal without retraining 4. **Test thoroughly**: - Different frame counts may expose edge cases - Ensure VAE encoding/decoding works correctly - Verify temporal consistency in generated videos ## Technical Details ### Latent Space Calculation Examples For **884 VAE** (4:1 temporal compression): ``` Input: 33 frames → (33-1)/4 + 1 = 9 latent frames Input: 34 frames → (34-2)/4 + 2 = 9 latent frames (special case) Input: 66 frames → (66-2)/4 + 2 = 17 latent frames ``` For **888 VAE** (8:1 temporal compression): ``` Input: 33 frames → (33-1)/8 + 1 = 5 latent frames Input: 65 frames → (65-1)/8 + 1 = 9 latent frames ``` ### Memory Implications Fewer frames = less memory usage: - 33 frames at 704×1216: ~85MB per frame in FP16 - 18 frames would use ~46% less memory - But VAE constraints limit viable options ## Paper-Code Consistency Analysis The documentation is consistent with both the paper and the codebase: ### From the Paper: - "The system operates at 25 fps, with each video chunk comprising 33-frame clips at 720p resolution" - Uses "chunk latent denoising process" for autoregressive generation - Implements "hybrid history-conditioned training strategy" - Mentions causal VAE's "uneven encoding of initial versus subsequent frames" ### From the Code: - `sample_n_frames = 33` throughout the codebase - VAE compression formulas match the 884/888 patterns - Hardcoded frame values (34, 37, 66, 69) align with the chunk-based architecture - CameraNet's special handling for 34/66 frames confirms the two-mode generation ## Conclusion The frame counts in Hunyuan-GameCraft are fundamental to its architecture: 1. **33 frames** is the atomic unit, trained into the model and fixed by the dataset construction 2. **34/37 and 66/69** emerge from the interaction between: - The 33-frame chunk size - Causal VAE requirements - MM-DiT transformer's RoPE needs - Hybrid history conditioning strategy 3. The **884 VAE** (4:1 temporal compression) is the default, requiring frames in patterns of 4n+1 or 4n+2 4. Changing to different frame counts (like 18) would require: - Retraining the entire model - Reconstructing the dataset - Modifying the VAE architecture - Updating all hardcoded dependencies The system's design reflects careful engineering trade-offs between generation quality, temporal consistency, and computational efficiency, as validated by the paper's experimental results showing superior performance compared to alternatives.