File size: 5,226 Bytes
01c0e76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Hunyuan-GameCraft is a high-dynamic interactive game video generation system that creates gameplay videos with controllable camera movements and actions. The system uses diffusion models and action-controlled generation to synthesize realistic game footage from reference images and keyboard/mouse input controls.

## Key Commands

### Installation
```bash
# Create and activate conda environment
conda create -n HYGameCraft python==3.10
conda activate HYGameCraft

# Install PyTorch and dependencies
conda install pytorch==2.5.1 torchvision==0.20.0 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install requirements
python -m pip install -r requirements.txt

# Install flash attention (optional, for acceleration)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]
```

### Download Models
```bash
cd weights
huggingface-cli download tencent/Hunyuan-GameCraft-1.0 --local-dir ./
```

### Run Inference

**Multi-GPU (8 GPUs) - Standard Model:**
```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
    --image-path "asset/village.png" \
    --prompt "YOUR_PROMPT" \
    --ckpt weights/gamecraft_models/mp_rank_00_model_states.pt \
    --video-size 704 1216 \
    --cfg-scale 2.0 \
    --image-start \
    --action-list w s d a \
    --action-speed-list 0.2 0.2 0.2 0.2 \
    --seed 250160 \
    --infer-steps 50 \
    --save-path './results/'
```

**Single GPU with Low VRAM (24GB minimum):**
```bash
export DISABLE_SP=1
export CPU_OFFLOAD=1
torchrun --nnodes=1 --nproc_per_node=1 --master_port 29605 hymm_sp/sample_batch.py \
    --ckpt weights/gamecraft_models/mp_rank_00_model_states.pt \
    --cpu-offload \
    --use-fp8 \
    [other parameters...]
```

**Distilled Model (faster, 8 inference steps):**
```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
    --ckpt weights/gamecraft_models/mp_rank_00_model_states_distill.pt \
    --cfg-scale 1.0 \
    --infer-steps 8 \
    --use-fp8 \
    [other parameters...]
```

## Architecture Overview

### Core Components

1. **Main Entry Points**
   - `hymm_sp/sample_batch.py`: Main script for batch video generation with distributed processing
   - `hymm_sp/sample_inference.py`: Core inference logic and model sampling
   - `hymm_sp/config.py`: Configuration parsing and argument handling

2. **Model Architecture (`hymm_sp/modules/`)**
   - `models.py`: Core diffusion model implementation
   - `cameranet.py`: Camera control and action encoding for game interactions
   - `token_refiner.py`: Text token refinement for prompt conditioning
   - `parallel_states.py`: Distributed training/inference state management
   - `fp8_optimization.py`: FP8 quantization for memory/speed optimization

3. **VAE Module (`hymm_sp/vae/`)**
   - `autoencoder_kl_causal_3d.py`: 3D causal VAE for video encoding/decoding
   - Handles latent space conversion for video frames

4. **Diffusion Pipeline (`hymm_sp/diffusion/`)**
   - `pipeline_hunyuan_video_game.py`: Custom pipeline for game video generation
   - `scheduling_flow_match_discrete.py`: Flow matching scheduler for denoising

5. **Data Processing (`hymm_sp/data_kits/`)**
   - `video_dataset.py`: Dataset handling for video inputs
   - `data_tools.py`: Video saving and processing utilities

### Key Features

- **Action Control**: Maps keyboard inputs (w/a/s/d) to continuous camera space for smooth transitions
- **Hybrid History Conditioning**: Extends video sequences autoregressively while preserving scene context
- **Model Distillation**: Accelerated inference model (8 steps vs 50 steps)
- **Memory Optimization**: FP8 quantization, CPU offloading, and SageAttention support
- **Distributed Processing**: Multi-GPU support with sequence parallelism

### Important Parameters

- `--action-list`: Sequence of keyboard actions (w/a/s/d)
- `--action-speed-list`: Movement speed for each action (0.0-3.0)
- `--video-size`: Output resolution (height width)
- `--cfg-scale`: Classifier-free guidance scale (1.0 for distilled, 2.0 for standard)
- `--infer-steps`: Denoising steps (8 for distilled, 50 for standard)
- `--use-fp8`: Enable FP8 optimization for memory reduction
- `--cpu-offload`: Offload model to CPU for low VRAM scenarios

### Model Weights Structure
```
weights/
โ”œโ”€โ”€ gamecraft_models/
โ”‚   โ”œโ”€โ”€ mp_rank_00_model_states.pt        # Standard model
โ”‚   โ””โ”€โ”€ mp_rank_00_model_states_distill.pt # Distilled model
โ””โ”€โ”€ stdmodels/
    โ”œโ”€โ”€ vae_3d/                            # 3D VAE model
    โ”œโ”€โ”€ llava-llama-3-8b-v1_1-transformers/ # Text encoder
    โ””โ”€โ”€ openai_clip-vit-large-patch14/     # CLIP encoder
```

## Development Notes

- Environment variable `MODEL_BASE` should point to `weights/stdmodels`
- Use `export DISABLE_SP=1` and `export CPU_OFFLOAD=1` for single GPU inference
- Minimum GPU memory: 24GB (very slow), Recommended: 80GB per GPU
- Action length determines video duration (1 action = 33 frames at 25 FPS)
- SageAttention can be installed for additional acceleration