Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale and further enhanced in the follow-up paper How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. The weights were converted from the S_16-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0.npz file in GCS buckets presented in the original repository.

Downloads last month
11
Safetensors
Model size
30.1M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Collection including cs-giung/vit-small-patch16-imagenet21k-augreg