# Efficient Online Inference of Vision Transformers by Training-Free Tokenization The [repository](https://github.com/wearepal/visual-word-tokenizer) wraps the code for the paper titled [**Efficient Online Inference of Vision Transformers by Training-Free Tokenization**](https://arxiv.org/abs/2411.15397) on Arxiv, into a ready to use library for your own application. **Authors:** Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto **Affiliations:** University of Sussex, University of Surrey, Basque Center for Applied Mathematics, Monash University (Indonesia) ## Installation ``` git clone https://github.com/wearepal/visual-word-tokenizer.git ``` ## Usage ### Inter-image Approach Note that the inter-image approach requires attention masking to work. The encoder implementation of the transformer on HuggingFace already possesses an `attention_mask` flag that is unused for the vision transformer. Add the following line to the input of `self.encoder`: ```python attention_mask=getattr(hidden_states, 'attention_mask', None), ``` Please refer to *modeling_clip.py* and *modeling_blip.py* in the examples folder for more clarity. ```python from examples.modeling_clip import CLIPModel from vwt.inter import wrap_model model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16') # load your pre-processing dataset here... pre_process_data = ['A list of images', '...'] # dummy data # initializing an intra-image tokenizer wrap_model(model.vision_model, thresh=0.1) vwt = model.vision_model.embeddings vwt.learn_words( split, vocab_size=1000, # number of visual words batch_size=1024 # batch size for clustering ) # deploy the model for inference on your downstream task... # saving the visual word vocabulary vwt.save_pretrained('pre_process_data') # reusing the visual word vocabulary new_model = AutoModel.from_pretrained('openai/clip-vit-base-patch16') wrap_model(new_model.vision_model, thresh=0.1) new_vwt = model.vision_model.embeddings new_vwt.load_words('pre_process_data/vocab.pt') ``` You may also load the pre-processed visual words from HuggingFace. We provide the ImageNet-1K vocabulary with sizes of 100, 1000, and 10000. ```python from huggingface_hub import snapshot_download from examples.modeling_clip import CLIPModel from vwt.inter import wrap_model model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16') # initializing an intra-image tokenizer wrap_model(model.vision_model, thresh=0.1) vwt = model.vision_model.embeddings # downloading the visual word vocabulary snapshot_download(repo_id='LeonidasY/inter-image-imgnet-100', local_dir='tokenizer') # snapshot_download(repo_id='LeonidasY/inter-image-imgnet-1000', local_dir='tokenizer') # snapshot_download(repo_id='LeonidasY/inter-image-imgnet-10000', local_dir='tokenizer') # loading the visual word vocabulary vwt.load_words('tokenizer/vocab.pt') # deploy the model for inference on your downstream task... ``` ## Citation ``` @misc{gee2024efficientonlineinferencevision, title={Efficient Online Inference of Vision Transformers by Training-Free Tokenization}, author={Leonidas Gee and Wing Yan Li and Viktoriia Sharmanska and Novi Quadrianto}, year={2024}, eprint={2411.15397}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.15397}, } ```