YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

The repository wraps the code for the paper titled Efficient Online Inference of Vision Transformers by Training-Free Tokenization on Arxiv, into a ready to use library for your own application.

Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

Affiliations: University of Sussex, University of Surrey, Basque Center for Applied Mathematics, Monash University (Indonesia)

Installation

git clone https://github.com/wearepal/visual-word-tokenizer.git

Usage

Inter-image Approach

Note that the inter-image approach requires attention masking to work. The encoder implementation of the transformer on HuggingFace already possesses an attention_mask flag that is unused for the vision transformer. Add the following line to the input of self.encoder:

attention_mask=getattr(hidden_states, 'attention_mask', None),

Please refer to modeling_clip.py and modeling_blip.py in the examples folder for more clarity.

from examples.modeling_clip import CLIPModel

from vwt.inter import wrap_model


model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

# load your pre-processing dataset here...
pre_process_data = ['A list of images', '...']  # dummy data

# initializing an intra-image tokenizer
wrap_model(model.vision_model, thresh=0.1)

vwt = model.vision_model.embeddings 
vwt.learn_words(
    split,
    vocab_size=1000, # number of visual words
    batch_size=1024 # batch size for clustering
)
# deploy the model for inference on your downstream task...

# saving the visual word vocabulary
vwt.save_pretrained('pre_process_data')

# reusing the visual word vocabulary
new_model = AutoModel.from_pretrained('openai/clip-vit-base-patch16')
wrap_model(new_model.vision_model, thresh=0.1)

new_vwt = model.vision_model.embeddings 
new_vwt.load_words('pre_process_data/vocab.pt')

You may also load the pre-processed visual words from HuggingFace. We provide the ImageNet-1K vocabulary with sizes of 100, 1000, and 10000.

from huggingface_hub import snapshot_download

from examples.modeling_clip import CLIPModel
from vwt.inter import wrap_model


model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

# initializing an intra-image tokenizer
wrap_model(model.vision_model, thresh=0.1)

vwt = model.vision_model.embeddings 

# downloading the visual word vocabulary
snapshot_download(repo_id='LeonidasY/inter-image-imgnet-100', local_dir='tokenizer')
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-1000', local_dir='tokenizer')
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-10000', local_dir='tokenizer')

# loading the visual word vocabulary
vwt.load_words('tokenizer/vocab.pt')

# deploy the model for inference on your downstream task...

Citation

@misc{gee2024efficientonlineinferencevision,
      title={Efficient Online Inference of Vision Transformers by Training-Free Tokenization}, 
      author={Leonidas Gee and Wing Yan Li and Viktoriia Sharmanska and Novi Quadrianto},
      year={2024},
      eprint={2411.15397},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.15397}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.