visheratin
/

mexma-siglip

Zero-Shot Image Classification

Model card Files Files and versions Community

mexma-siglip / README.md

visheratin's picture

Update README.md

c284b0d verified 2 months ago

|

history blame contribute delete

2.11 kB

	---
	license: mit
	language:
	- ar
	- kn
	- ar
	- ka
	- af
	- kk
	- am
	- km
	- ar
	- ky
	- ar
	- ko
	- as
	- lo
	- az
	- ml
	- az
	- mr
	- be
	- mk
	- bn
	- my
	- bs
	- nl
	- bg
	- ca
	- 'no'
	- cs
	- ne
	- ku
	- pl
	- cy
	- pt
	- da
	- ro
	- de
	- ru
	- el
	- sa
	- en
	- si
	- eo
	- sk
	- et
	- sl
	- eu
	- sd
	- fi
	- so
	- fr
	- es
	- gd
	- sr
	- ga
	- su
	- gl
	- sv
	- gu
	- sw
	- ha
	- ta
	- he
	- te
	- hi
	- th
	- hr
	- tr
	- hu
	- ug
	- hy
	- uk
	- id
	- ur
	- is
	- vi
	- it
	- xh
	- jv
	- zh
	- ja
	pipeline_tag: zero-shot-image-classification
	tags:
	- siglip
	- clip
	- mexma
	---

	## Model Summary

	MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
	[SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages.
	MEXMA-SigLIP sets state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset across commercial use-friendly models.


	## How to use

	```
	from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
	from PIL import Image
	import requests
	import torch

	model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
	tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
	processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")

	img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
	img = processor(images=img, return_tensors="pt")["pixel_values"]
	img = img.to(torch.bfloat16).to("cuda")
	with torch.inference_mode():
	text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
	image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
	probs = image_logits.softmax(dim=-1)
	print(probs)
	```

	## Acknowledgements

	I thank [ML Collective](https://mlcollective.org/) and [Lambda](https://lambdalabs.com/) for providing compute resources to train the model.