Papers
arxiv:2105.13290

CogView: Mastering Text-to-Image Generation via Transformers

Published on May 26, 2021
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

Community

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2105.13290 in a dataset README.md to link it from this page.

Spaces citing this paper 7

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.