|
--- |
|
license: apache-2.0 |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
# UniTok: A Unified Tokenizer for Visual Generation and Understanding |
|
|
|
This repository contains UniTok, a unified visual tokenizer for both image generation and understanding tasks, as presented in [UniTok: A Unified Tokenizer for Visual Generation and Understanding](https://hf.co/papers/2502.20321). |
|
|
|
Project Page: https://foundationvision.github.io/UniTok/ <br> |
|
Code: https://github.com/FoundationVision/UniTok |
|
|
|
<p align="center"> |
|
<img src="https://github.com/FoundationVision/UniTok/blob/main/assets/teaser.png?raw=true" width=93%> |
|
<p> |
|
|
|
UniTok encodes fine-grained details for generation and captures high-level semantics for understanding. It's compatible with autoregressive generative models (e.g., LlamaGen), multimodal understanding models (e.g., LLaVA), and unified MLLMs (e.g., Chameleon and Liquid). |
|
|
|
Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding, which sets a new state-of-the-art among unified autoregressive MLLMs. The weights of our MLLM will be released soon. |
|
|
|
<p align="center"> |
|
<img src="https://github.com/FoundationVision/UniTok/blob/main/assets/samples.png?raw=true" width=93%> |
|
<p> |
|
|
|
## Performance |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th>Method</th> |
|
<th>#Tokens</th> |
|
<th>rFID ↓</th> |
|
<th>Accuracy</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td colspan="4"><i>VQVAE Model</i></td> |
|
</tr> |
|
<tr align="center"> |
|
<td>VQ-GAN</td> |
|
<td>256</td> |
|
<td>4.98</td> |
|
<td>--</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>RQ-VAE</td> |
|
<td>256</td> |
|
<td>1.30</td> |
|
<td>--</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>VAR</td> |
|
<td>680</td> |
|
<td>0.90</td> |
|
<td>--</td> |
|
</tr> |
|
<tr> |
|
<td colspan="4"><i>CLIP Model</i></td> |
|
</tr> |
|
<tr align="center"> |
|
<td>CLIP</td> |
|
<td>256</td> |
|
<td>--</td> |
|
<td>76.2</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>SigLIP</td> |
|
<td>256</td> |
|
<td>--</td> |
|
<td>80.5</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>ViTamin</td> |
|
<td>256</td> |
|
<td>--</td> |
|
<td>81.2</td> |
|
</tr> |
|
<tr> |
|
<td colspan="4"><i>Unified Model</i></td> |
|
</tr> |
|
<tr align="center"> |
|
<td>TokenFlow †</td> |
|
<td>680</td> |
|
<td>1.37</td> |
|
<td>--</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>VILA-U †</td> |
|
<td>256</td> |
|
<td>1.80</td> |
|
<td>73.3</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>UniTok</td> |
|
<td>256</td> |
|
<td>0.39</td> |
|
<td>70.5</td> |
|
</tr> |
|
<tr align="center"> |
|
<td>UniTok †</td> |
|
<td>256</td> |
|
<td>0.38</td> |
|
<td>78.6</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
|
|
This repo is used for hosting UniTok's checkpoints. |
|
|
|
For more details or tutorials see https://github.com/FoundationVision/UniTok. |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{unitok, |
|
title={UniTok: A Unified Tokenizer for Visual Generation and Understanding}, |
|
author={Ma, Chuofan and Jiang, Yi and Wu, Junfeng and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan}, |
|
journal={arXiv preprint arXiv:2502.20321}, |
|
year={2025} |
|
} |
|
``` |