Efficient Online Inference of Vision Transformers by Training-Free Tokenization

The repository wraps the code for the paper titled Efficient Online Inference of Vision Transformers by Training-Free Tokenization on Arxiv, into a ready to use library for your own application.

Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

Affiliations: University of Sussex, University of Surrey, Basque Center for Applied Mathematics, Monash University (Indonesia)

Installation

git clone https://github.com/wearepal/visual-word-tokenizer.git

Usage

Inter-image Approach

Note that the inter-image approach requires attention masking to work. The encoder implementation of the transformer on HuggingFace already possesses an attention_mask flag that is unused for the vision transformer. Add the following line to the input of self.encoder:

attention_mask=getattr(hidden_states, 'attention_mask', None),

Please refer to modeling_clip.py and modeling_blip.py in the examples folder for more clarity.

from examples.modeling_clip import CLIPModel

from vwt.inter import wrap_model


model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

# load your pre-processing dataset here...
pre_process_data = ['A list of images', '...']  # dummy data

# initializing an intra-image tokenizer
wrap_model(model.vision_model, thresh=0.1)

vwt = model.vision_model.embeddings 
vwt.learn_words(
    split,
    vocab_size=1000, # number of visual words
    batch_size=1024 # batch size for clustering
)
# deploy the model for inference on your downstream task...

# saving the visual word vocabulary
vwt.save_pretrained('pre_process_data')

# reusing the visual word vocabulary
new_model = AutoModel.from_pretrained('openai/clip-vit-base-patch16')
wrap_model(new_model.vision_model, thresh=0.1)

new_vwt = model.vision_model.embeddings 
new_vwt.load_words('pre_process_data/vocab.pt')

You may also load the pre-processed visual words from HuggingFace. We provide the ImageNet-1K vocabulary with sizes of 100, 1000, and 10000.

from huggingface_hub import snapshot_download

from examples.modeling_clip import CLIPModel
from vwt.inter import wrap_model


model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

# initializing an intra-image tokenizer
wrap_model(model.vision_model, thresh=0.1)

vwt = model.vision_model.embeddings 

# downloading the visual word vocabulary
snapshot_download(repo_id='LeonidasY/inter-image-imgnet-100', local_dir='tokenizer')
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-1000', local_dir='tokenizer')
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-10000', local_dir='tokenizer')

# loading the visual word vocabulary
vwt.load_words('tokenizer/vocab.pt')

# deploy the model for inference on your downstream task...

Citation

@misc{gee2024efficientonlineinferencevision,
      title={Efficient Online Inference of Vision Transformers by Training-Free Tokenization}, 
      author={Leonidas Gee and Wing Yan Li and Viktoriia Sharmanska and Novi Quadrianto},
      year={2024},
      eprint={2411.15397},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15397}, 
}