# vit_base_patch16_224 |
Implementation of Vision Transformer (ViT) proposed in [An Image Is |
Worth 16x16 Words: Transformers For Image Recognition At |
Scale](https://arxiv.org/pdf/2010.11929.pdf) |
The following image from the authors shows the architecture. |
``` python |
ViT.vit_small_patch16_224() |
ViT.vit_base_patch16_224() |
ViT.vit_base_patch16_384() |
ViT.vit_base_patch32_384() |
ViT.vit_huge_patch16_224() |
ViT.vit_huge_patch32_384() |
ViT.vit_large_patch16_224() |
ViT.vit_large_patch16_384() |
ViT.vit_large_patch32_384() |
``` |
Examples: |
``` python |
# change activation |
ViT.vit_base_patch16_224(activation = nn.SELU) |
# change number of classes (default is 1000 ) |
ViT.vit_base_patch16_224(n_classes=100) |
# pass a different block, default is TransformerEncoderBlock |
ViT.vit_base_patch16_224(block=MyCoolTransformerBlock) |
# get features |
model = ViT.vit_base_patch16_224 |
# first call .features, this will activate the forward hooks and tells the model you'll like to get the features |
model.encoder.features |
model(torch.randn((1,3,224,224))) |
# get the features from the encoder |
features = model.encoder.features |
print([x.shape for x in features]) |
#[[torch.Size([1, 197, 768]), torch.Size([1, 197, 768]), ...] |
# change the tokens, you have to subclass ViTTokens |
class MyTokens(ViTTokens): |
def __init__(self, emb_size: int): |
super().__init__(emb_size) |
self.my_new_token = nn.Parameter(torch.randn(1, 1, emb_size)) |
ViT(tokens=MyTokens) |
``` |