File size: 6,346 Bytes
84a1c35
 
 
 
 
 
 
 
 
 
 
 
 
 
dd8bc5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1803ce
dd8bc5d
f1803ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd8bc5d
 
f1803ce
 
 
 
 
 
 
 
dd8bc5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "Matrix"
emoji: 🐟
colorFrom: blue
colorTo: blue
sdk: docker
app_file: server.py
pinned: true
short_description: AI Gaming server
app_port: 8080
disable_embedding: false
---


<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

# Matrix-Game: Interactive World Foundation Model
<font size=7><div align='center' >  [[πŸ€— Huggingface](https://huggingface.co/Skywork/Matrix-Game)] [[πŸ“– Technical Report](https://github.com/SkyworkAI/Matrix-Game/blob/main/assets/report.pdf)] [[πŸš€ Project Website](https://matrix-game-homepage.github.io/)] </div></font>

<div align="center">
  <img src="assets/videos/demo.gif" alt="teaser" />
</div>

## πŸ“ Overview
**Matrix-Game** is a 17B-parameter interactive world foundation model for controllable game world generation.

## ✨ Key Features

- 🎯 **Feature 1**: **Interactive Generation.**  A diffusion-based image-to-world model that generates high-quality videos conditioned on keyboard and mouse inputs, enabling fine-grained control and dynamic scene evolution.
- πŸš€ **Feature 2**: **GameWorld Score.** A comprehensive benchmark for evaluating Minecraft world models across four key dimensions, including visual quality, temporal quality, action controllability, and physical rule understanding. 
- πŸ’‘ **Feature 3**: **Matrix-Game Dataset** A large-scale Minecraft dataset with fine-grained action annotations, supporting scalable training for interactive and physically grounded world modeling.

## πŸ”₯ Latest Updates

* [2025-05] πŸŽ‰ Initial release of Matrix-Game Model

## πŸš€ Performance Comparison
### GameWorld Score Benchmark Comparison

| Model     | Image Quality ↑ | Aesthetic Quality ↑ | Temporal Cons. ↑ | Motion Smooth. ↑ | Keyboard Acc. ↑ | Mouse Acc. ↑ | 3D Cons. ↑ |
|-----------|------------------|-------------|-------------------|-------------------|------------------|---------------|-------------|
| Oasis     | 0.65             | 0.48        | 0.94              | **0.98**          | 0.77             | 0.56          | 0.56        |
| MineWorld | 0.69             | 0.47        | 0.95              | **0.98**          | 0.86             | 0.64          | 0.51        |
| **Ours**  | **0.72**         | **0.49**    | **0.97**          | **0.98**          | **0.95**         | **0.95**      | **0.76**    |

**Metric Descriptions**:

- **Image Quality** / **Aesthetic**: Visual fidelity and perceptual appeal of generated frames  
- **Temporal Consistency** / **Motion Smoothness**: Temporal coherence and smoothness between frames  
- **Keyboard Accuracy** / **Mouse Accuracy**: Accuracy in following user control signals  
- **3D Consistency**: Geometric stability and physical plausibility over time

  Please check our [GameWorld](https://github.com/SkyworkAI/Matrix-Game/tree/main/GameWorldScore) benchmark for detailed implementation.

### Human Evaluation

![Human Win Rate](assets/imgs/human_win_rate.png)

> Double-blind human evaluation by two independent groups across four key dimensions: **Overall Quality**, **Controllability**, **Visual Quality**, and **Temporal Consistency**.  
> Scores represent the percentage of pairwise comparisons in which each method was preferred. Matrix-Game consistently outperforms prior models across all metrics and both groups.


## πŸš€ Quick Start

```
# clone the repository:
git clone https://github.com/SkyworkAI/Matrix-Game.git
cd Matrix-Game

# install dependencies:
pip install -r requirements.txt

# install apex and FlashAttention-3
# Our project also depends on [apex](https://github.com/NVIDIA/apex) and [FlashAttention-3](https://github.com/Dao-AILab/flash-attention)

# Run batch inference to generate videos
bash run_inference.sh

# Run interactive websocket server
python server.py --model_root ./models/matrixgame
```

## Interactive WebSocket Server

We've implemented a real-time interactive WebSocket server that uses the Matrix-Game model to generate game frames based on keyboard and mouse inputs:

### Features:
- **Real-time Generation**: Frames are generated on-the-fly based on user inputs
- **Keyboard & Mouse Control**: Move through the virtual world using WASD keys and mouse movements
- **Multiple Scenes**: Choose from different environments (forest, desert, beach, hills, etc.)
- **Fallback Mode**: Automatically falls back to demo mode when GPU resources are unavailable

### Usage:
```bash
# Basic startup
python server.py

# With custom model paths
python server.py --model_root ./models/matrixgame --port 8080

# With individual model component paths
python server.py --dit_path ./custom/dit --vae_path ./custom/vae --textenc_path ./custom/textenc
```

### Connection:
- WebSocket endpoint: ws://localhost:8080/ws
- Web client: http://localhost:8080/

### System Requirements:
- NVIDIA GPU with CUDA support
- 24GB+ VRAM recommended for smooth frame generation


## πŸ”§ Hardware Requirements
- **GPU**:
  - NVIDIA A100/H100
- **VRAM**:
  - Requires **β‰₯80GB of GPU memory** for a single 65-frame video inference.


## ⭐ Acknowledgements

We would like to express our gratitude to:

- [Diffusers](https://github.com/huggingface/diffusers) for their excellent diffusion model framework
- [HunyuanVideo](https://github.com/Tencent/HunyuanVideo) for their strong base model
- [MineDojo](https://minedojo.org/knowledge_base) for their Minecraft video dataset
- [MineRL](https://github.com/minerllabs/minerl) for their excellent gym framework
- [Video-Pre-Training](https://github.com/openai/Video-Pre-Training) for their accurate Inverse Dynamics Model
- [GameFactory](https://github.com/KwaiVGI/GameFactory) for their idea of action control module 

We are grateful to the broader research community for their open exploration and contributions to the field of interactive world generation.

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ“Ž Citation
If you find this project useful, please cite our paper:
```bibtex
@article{zhang2025matrixgame,
  title     = {Matrix-Game: Interactive World Foundation Model},
  author    = {Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou},
  journal   = {arXiv},
  year      = {2025}
}
```