Hunyuan-GameCraft / how-frames-work.md
jbilcke-hf's picture
jbilcke-hf HF Staff
update research
bd0b128

A newer version of the Gradio SDK is available: 5.44.1

Upgrade

How Frames Work in Hunyuan-GameCraft

Overview

The Hunyuan-GameCraft system generates high-dynamic interactive game videos using diffusion models and a hybrid history-conditioned training strategy. The frame handling is complex due to several factors:

  1. Causal VAE compression (spatial and temporal with different ratios)
  2. Hybrid history conditioning (using past frames/clips as context for autoregressive generation)
  3. Different generation modes (image-to-video vs. video-to-video continuation)
  4. Rotary position embeddings (RoPE) requirements for the MM-DiT backbone

Paper Context

According to the paper, Hunyuan-GameCraft:

  • Operates at 25 FPS with each video chunk comprising 33-frame clips at 720p resolution
  • Uses a causal VAE for encoding/decoding that has uneven encoding of initial vs. subsequent frames
  • Implements chunk-wise autoregressive extension where each chunk corresponds to one action
  • Employs hybrid history conditioning with ratios: 70% single historical clip, 5% multiple clips, 25% single frame

Key Frame Numbers Explained

The Magic Numbers: 33, 34, 37, 66, 69

These numbers are fundamental to the architecture and not arbitrary:

  • 33 frames: The base video chunk size - each action generates exactly 33 frames (1.32 seconds at 25 FPS)
  • 34 frames: Used for image-to-video generation in latent space (33 + 1 initial frame)
  • 37 frames: Used for rotary position embeddings when starting from an image
  • 66 frames: Used for video-to-video continuation in latent space (2 ร— 33 frame chunks)
  • 69 frames: Used for rotary position embeddings when continuing from video

Why These Specific Numbers?

The paper mentions "chunk latent denoising" where each chunk is a 33-frame segment. The specific numbers arise from:

  1. Base Chunk Size: 33 frames per action (fixed by training)
  2. Initial Frame Handling: +1 frame for the reference image in image-to-video mode
  3. RoPE Alignment: +3 frames for proper positional encoding alignment in the transformer
  4. History Conditioning: Doubling for video continuation (using previous chunk as context)

VAE Compression Explained

VAE Types and the "4n+1" / "8n+1" Formula

The project uses different VAE (Variational Autoencoder) models identified by codes like "884" or "888":

VAE Naming Convention: "XYZ-16c-hy0801"

  • First digit (X): Temporal compression ratio
  • Second digit (Y): Spatial compression ratio (height)
  • Third digit (Z): Spatial compression ratio (width)
  • 16c: 16 latent channels
  • hy0801: Version identifier

"884" VAE (Default in Code)

  • Temporal compression: 4:1 (every 4 frames โ†’ 1 latent frame)
  • Spatial compression: 8:1 for both height and width
  • Frame formula:
    • Standard: latent_frames = (video_frames - 1) // 4 + 1
    • Special handling: latent_frames = (video_frames - 2) // 4 + 2 (for certain cases)
  • Why "4n+1"?: The causal VAE requires frames in multiples of 4 plus 1 for proper temporal compression
    • Example: 33 frames โ†’ (33-1)/4 + 1 = 9 latent frames
    • Example: 34 frames โ†’ (34-2)/4 + 2 = 9 latent frames (special case in pipeline)
    • Example: 66 frames โ†’ (66-2)/4 + 2 = 17 latent frames

"888" VAE (Alternative)

  • Temporal compression: 8:1 (every 8 frames โ†’ 1 latent frame)
  • Spatial compression: 8:1 for both height and width
  • Frame formula: latent_frames = (video_frames - 1) // 8 + 1
  • Why "8n+1"?: Similar principle but with 8:1 temporal compression
    • Example: 33 frames โ†’ (33-1)/8 + 1 = 5 latent frames
    • Example: 65 frames โ†’ (65-1)/8 + 1 = 9 latent frames

No Compression VAE

  • When VAE code doesn't match the pattern, no temporal compression is applied
  • latent_frames = video_frames

Why Different Formulas?

The formulas handle the causal nature of the VAE as mentioned in the paper:

  1. Causal VAE Characteristics: The paper states that causal VAEs have "uneven encoding of initial versus subsequent frames"
  2. First Frame Special Treatment: The initial frame requires different handling than subsequent frames
  3. Temporal Consistency: The causal attention ensures each frame only attends to previous frames, maintaining temporal coherence
  4. Chunk Boundaries: The formulas ensure proper alignment with the 33-frame chunk size used in training

Frame Processing Pipeline

1. Image-to-Video Generation (First Segment)

# Starting from a single image
if is_image:
    target_length = 34  # In latent space
    # For RoPE embeddings
    freqs_cos, freqs_sin = get_rotary_pos_embed(37, height, width)

Why 34 and 37?

  • 34 frames in latent space = 33 generated frames + 1 initial frame
  • 37 for RoPE = 34 + 3 extra for positional encoding alignment

2. Video-to-Video Continuation

# Continuing from existing video
else:
    target_length = 66  # In latent space
    # For RoPE embeddings
    freqs_cos, freqs_sin = get_rotary_pos_embed(69, height, width)

Why 66 and 69?

  • 66 frames = 2 ร— 33 frames (using previous segment as context)
  • 69 for RoPE = 66 + 3 extra for positional encoding alignment

3. Camera Network Compression

The CameraNet has special handling for these frame counts:

def compress_time(self, x, num_frames):
    if x.shape[-1] == 66 or x.shape[-1] == 34:
        # Split into two segments
        x_len = x.shape[-1]
        # First segment: keep first frame, pool the rest
        x_clip1 = x[...,:x_len//2]
        # Second segment: keep first frame, pool the rest
        x_clip2 = x[...,x_len//2:x_len]

This compression strategy:

  1. Preserves key frames: First frame of each segment
  2. Pools temporal information: Averages remaining frames
  3. Maintains continuity: Ensures smooth transitions

Case Study: Using 17 Frames Instead of 33

While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:

1. Why 17 Frames Works with VAE

17 frames is actually compatible with both VAE architectures:

  • 884 VAE (4:1 temporal compression):

    • Formula: (17-1)/4 + 1 = 5 latent frames โœ“
    • Clean division ensures proper encoding/decoding
  • 888 VAE (8:1 temporal compression):

    • Formula: (17-1)/8 + 1 = 3 latent frames โœ“
    • Also divides cleanly for proper compression

2. Required Code Modifications

To implement 17-frame generation, you would need to update:

a. Core Frame Configuration

  • app.py: Change args.sample_n_frames = 17
  • ActionToPoseFromID: Update duration=17 parameter
  • sample_inference.py: Adjust target_length calculations:
    if is_image:
        target_length = 18  # 17 generated + 1 initial
    else:
        target_length = 34  # 2 ร— 17 for video continuation
    

b. RoPE Embeddings

  • For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
  • For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)

c. CameraNet Compression

Update the frame count checks in cameranet.py:

if x.shape[-1] == 34 or x.shape[-1] == 18:  # Support both 33 and 17 frame modes
    # Adjust compression logic for shorter sequences

3. Trade-offs and Considerations

Advantages of 17 frames:

  • Reduced memory usage: ~48% less VRAM required
  • Faster generation: Shorter sequences process quicker
  • More responsive: Actions complete in 0.68 seconds vs 1.32 seconds

Disadvantages:

  • Quality degradation: Model wasn't trained on 17-frame chunks
  • Choppy motion: Less temporal information for smooth transitions
  • Action granularity: Shorter actions may feel abrupt
  • Potential artifacts: VAE and attention patterns optimized for 33 frames

4. Why Other Frame Counts Are Problematic

Not all frame counts work with the VAE constraints:

  • 18 frames: โŒ (18-1)/4 = 4.25 (not integer for 884 VAE)
  • 19 frames: โŒ (19-1)/4 = 4.5 (not integer)
  • 20 frames: โŒ (20-1)/4 = 4.75 (not integer)
  • 21 frames: โœ“ Works with 884 VAE (6 latent frames)
  • 25 frames: โœ“ Works with both VAEs (7 and 4 latent frames)

5. Implementation Note

While technically possible, using 17 frames would require:

  1. Extensive testing: Verify quality and temporal consistency
  2. Possible fine-tuning: The model may need adaptation for optimal results
  3. Adjustment of action speeds: Camera movements calibrated for 33 frames
  4. Modified training strategy: If fine-tuning, adjust hybrid history ratios

Recommendations for Frame Count Modification

If you must change frame counts, consider:

  1. Use VAE-compatible numbers:

    • For 884 VAE: 17, 21, 25, 29, 33, 37... (4n+1 pattern)
    • For 888 VAE: 17, 25, 33, 41... (8n+1 pattern)
  2. Modify all dependent locations:

    • sample_inference.py: Update target_length logic
    • cameranet.py: Update compress_time conditions
    • ActionToPoseFromID: Change duration parameter
    • App configuration: Update sample_n_frames
  3. Consider retraining or fine-tuning:

    • The model may need adaptation for different sequence lengths
    • Quality might be suboptimal without retraining
  4. Test thoroughly:

    • Different frame counts may expose edge cases
    • Ensure VAE encoding/decoding works correctly
    • Verify temporal consistency in generated videos

Technical Details

Latent Space Calculation Examples

For 884 VAE (4:1 temporal compression):

Input: 33 frames โ†’ (33-1)/4 + 1 = 9 latent frames
Input: 34 frames โ†’ (34-2)/4 + 2 = 9 latent frames (special case)
Input: 66 frames โ†’ (66-2)/4 + 2 = 17 latent frames

For 888 VAE (8:1 temporal compression):

Input: 33 frames โ†’ (33-1)/8 + 1 = 5 latent frames
Input: 65 frames โ†’ (65-1)/8 + 1 = 9 latent frames

Memory Implications

Fewer frames = less memory usage:

  • 33 frames at 704ร—1216: ~85MB per frame in FP16
  • 18 frames would use ~46% less memory
  • But VAE constraints limit viable options

Paper-Code Consistency Analysis

The documentation is consistent with both the paper and the codebase:

From the Paper:

  • "The system operates at 25 fps, with each video chunk comprising 33-frame clips at 720p resolution"
  • Uses "chunk latent denoising process" for autoregressive generation
  • Implements "hybrid history-conditioned training strategy"
  • Mentions causal VAE's "uneven encoding of initial versus subsequent frames"

From the Code:

  • sample_n_frames = 33 throughout the codebase
  • VAE compression formulas match the 884/888 patterns
  • Hardcoded frame values (34, 37, 66, 69) align with the chunk-based architecture
  • CameraNet's special handling for 34/66 frames confirms the two-mode generation

Conclusion

The frame counts in Hunyuan-GameCraft are fundamental to its architecture:

  1. 33 frames is the atomic unit, trained into the model and fixed by the dataset construction

  2. 34/37 and 66/69 emerge from the interaction between:

    • The 33-frame chunk size
    • Causal VAE requirements
    • MM-DiT transformer's RoPE needs
    • Hybrid history conditioning strategy
  3. The 884 VAE (4:1 temporal compression) is the default, requiring frames in patterns of 4n+1 or 4n+2

  4. Changing to different frame counts (like 18) would require:

    • Retraining the entire model
    • Reconstructing the dataset
    • Modifying the VAE architecture
    • Updating all hardcoded dependencies

The system's design reflects careful engineering trade-offs between generation quality, temporal consistency, and computational efficiency, as validated by the paper's experimental results showing superior performance compared to alternatives.