|
# Efficient Online Inference of Vision Transformers by Training-Free Tokenization |
|
|
|
The [repository](https://github.com/wearepal/visual-word-tokenizer) wraps the code for the paper titled [**Efficient Online Inference of Vision Transformers by Training-Free Tokenization**](https://arxiv.org/abs/2411.15397) on Arxiv, into a ready to use library for your own application. |
|
|
|
**Authors:** Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto |
|
|
|
**Affiliations:** University of Sussex, University of Surrey, Basque Center for Applied Mathematics, Monash University (Indonesia) |
|
|
|
## Installation |
|
``` |
|
git clone https://github.com/wearepal/visual-word-tokenizer.git |
|
``` |
|
|
|
## Usage |
|
|
|
### Inter-image Approach |
|
Note that the inter-image approach requires attention masking to work. The encoder implementation of the transformer on HuggingFace already possesses an `attention_mask` flag that is unused for the vision transformer. Add the following line to the input of `self.encoder`: |
|
|
|
```python |
|
attention_mask=getattr(hidden_states, 'attention_mask', None), |
|
|
|
``` |
|
|
|
Please refer to *modeling_clip.py* and *modeling_blip.py* in the examples folder for more clarity. |
|
|
|
```python |
|
from examples.modeling_clip import CLIPModel |
|
|
|
from vwt.inter import wrap_model |
|
|
|
|
|
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16') |
|
|
|
# load your pre-processing dataset here... |
|
pre_process_data = ['A list of images', '...'] # dummy data |
|
|
|
# initializing an intra-image tokenizer |
|
wrap_model(model.vision_model, thresh=0.1) |
|
|
|
vwt = model.vision_model.embeddings |
|
vwt.learn_words( |
|
split, |
|
vocab_size=1000, # number of visual words |
|
batch_size=1024 # batch size for clustering |
|
) |
|
# deploy the model for inference on your downstream task... |
|
|
|
# saving the visual word vocabulary |
|
vwt.save_pretrained('pre_process_data') |
|
|
|
# reusing the visual word vocabulary |
|
new_model = AutoModel.from_pretrained('openai/clip-vit-base-patch16') |
|
wrap_model(new_model.vision_model, thresh=0.1) |
|
|
|
new_vwt = model.vision_model.embeddings |
|
new_vwt.load_words('pre_process_data/vocab.pt') |
|
|
|
``` |
|
|
|
You may also load the pre-processed visual words from HuggingFace. We provide the ImageNet-1K vocabulary with sizes of 100, 1000, and 10000. |
|
|
|
```python |
|
from huggingface_hub import snapshot_download |
|
|
|
from examples.modeling_clip import CLIPModel |
|
from vwt.inter import wrap_model |
|
|
|
|
|
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16') |
|
|
|
# initializing an intra-image tokenizer |
|
wrap_model(model.vision_model, thresh=0.1) |
|
|
|
vwt = model.vision_model.embeddings |
|
|
|
# downloading the visual word vocabulary |
|
snapshot_download(repo_id='LeonidasY/inter-image-imgnet-100', local_dir='tokenizer') |
|
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-1000', local_dir='tokenizer') |
|
# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-10000', local_dir='tokenizer') |
|
|
|
# loading the visual word vocabulary |
|
vwt.load_words('tokenizer/vocab.pt') |
|
|
|
# deploy the model for inference on your downstream task... |
|
|
|
``` |
|
|
|
## Citation |
|
``` |
|
@misc{gee2024efficientonlineinferencevision, |
|
title={Efficient Online Inference of Vision Transformers by Training-Free Tokenization}, |
|
author={Leonidas Gee and Wing Yan Li and Viktoriia Sharmanska and Novi Quadrianto}, |
|
year={2024}, |
|
eprint={2411.15397}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2411.15397}, |
|
} |
|
``` |
|
|