LeonidasY
/

inter-image-imgnet-10000

Model card Files Files and versions Community

inter-image-imgnet-10000 / README.md

LeonidasY's picture

Update README.md

a72eb18 verified 3 months ago

|

history blame contribute delete

3.37 kB

	# Efficient Online Inference of Vision Transformers by Training-Free Tokenization

	The [repository](https://github.com/wearepal/visual-word-tokenizer) wraps the code for the paper titled [Efficient Online Inference of Vision Transformers by Training-Free Tokenization](https://arxiv.org/abs/2411.15397) on Arxiv, into a ready to use library for your own application.

	Authors: Leonidas Gee, Wing Yan Li, Viktoriia Sharmanska, Novi Quadrianto

	Affiliations: University of Sussex, University of Surrey, Basque Center for Applied Mathematics, Monash University (Indonesia)

	## Installation
	```
	git clone https://github.com/wearepal/visual-word-tokenizer.git
	```

	## Usage

	### Inter-image Approach
	Note that the inter-image approach requires attention masking to work. The encoder implementation of the transformer on HuggingFace already possesses an `attention_mask` flag that is unused for the vision transformer. Add the following line to the input of `self.encoder`:

	```python
	attention_mask=getattr(hidden_states, 'attention_mask', None),

	```

	Please refer to modeling_clip.py and modeling_blip.py in the examples folder for more clarity.

	```python
	from examples.modeling_clip import CLIPModel

	from vwt.inter import wrap_model


	model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

	# load your pre-processing dataset here...
	pre_process_data = ['A list of images', '...'] # dummy data

	# initializing an intra-image tokenizer
	wrap_model(model.vision_model, thresh=0.1)

	vwt = model.vision_model.embeddings
	vwt.learn_words(
	split,
	vocab_size=1000, # number of visual words
	batch_size=1024 # batch size for clustering
	)
	# deploy the model for inference on your downstream task...

	# saving the visual word vocabulary
	vwt.save_pretrained('pre_process_data')

	# reusing the visual word vocabulary
	new_model = AutoModel.from_pretrained('openai/clip-vit-base-patch16')
	wrap_model(new_model.vision_model, thresh=0.1)

	new_vwt = model.vision_model.embeddings
	new_vwt.load_words('pre_process_data/vocab.pt')

	```

	You may also load the pre-processed visual words from HuggingFace. We provide the ImageNet-1K vocabulary with sizes of 100, 1000, and 10000.

	```python
	from huggingface_hub import snapshot_download

	from examples.modeling_clip import CLIPModel
	from vwt.inter import wrap_model


	model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

	# initializing an intra-image tokenizer
	wrap_model(model.vision_model, thresh=0.1)

	vwt = model.vision_model.embeddings

	# downloading the visual word vocabulary
	snapshot_download(repo_id='LeonidasY/inter-image-imgnet-100', local_dir='tokenizer')
	# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-1000', local_dir='tokenizer')
	# snapshot_download(repo_id='LeonidasY/inter-image-imgnet-10000', local_dir='tokenizer')

	# loading the visual word vocabulary
	vwt.load_words('tokenizer/vocab.pt')

	# deploy the model for inference on your downstream task...

	```

	## Citation
	```
	@misc{gee2024efficientonlineinferencevision,
	title={Efficient Online Inference of Vision Transformers by Training-Free Tokenization},
	author={Leonidas Gee and Wing Yan Li and Viktoriia Sharmanska and Novi Quadrianto},
	year={2024},
	eprint={2411.15397},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2411.15397},
	}
	```