ViLaH / README.md

Update README.md

5fd925d verified 8 months ago

5.2 kB

	---
	license: gemma
	datasets:
	- damerajee/clean_hin_vqa
	language:
	- en
	- hi
	inference: false
	library_name: transformers
	pipeline_tag: visual-question-answering
	tags:
	- visual-question-answering
	- Bilingual
	---

	# ViLaH
	ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output.


	# Training Details
	* Model Configuration: Fine-tuned on a single epoch using a V100 gpu.
	* Training Duration: Approximately one day.
	* Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch.
	* The model is still being train as of right now with better quality dataset
	* The model's performance may be compromised due to insufficient data and the fact that it was trained for only one epoch.

	# Dataset
	The model was finetuned on only one dataset
	* [damerajee/clean_hin_vqa](https://huggingface.co/datasets/damerajee/clean_hin_vqa) : This dataset was derived from [Lin-Chen/ShareGPT4V](https://huggingface.co/google/paligemma-3b-pt-224) and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content.

	# How to Use

	```python
	!pip install peft trl datasets accelerate bitsandbytes
	!pip install transformers --upgrade
	```
	### To Run the model on a single T4 GPU on Float16
	```python
	from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
	from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
	from peft import PeftModel, PeftConfig
	from datasets import load_dataset
	import torch
	from datasets import load_dataset

	dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
	test_example = dataset[10000]
	test_image = test_example["image"]
	text = test_example['question']

	device_index = torch.cuda.current_device()
	print("device_index:",device_index)
	base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
	processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

	inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
	for k,v in inputs.items():
	print(k,v.shape)

	MAX_LENGTH = 200
	# Autoregressively generate
	# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
	generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

	# Next we turn each predicted token ID back into a string using the decode method
	# We chop of the prompt, which consists of image tokens and our text prompt
	image_token_index = base_model.config.image_token_index
	num_image_tokens = len(generated_ids[generated_ids==image_token_index])
	num_text_tokens = len(processor.tokenizer.encode(text))
	num_prompt_tokens = num_image_tokens + num_text_tokens + 2
	generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	generated_text

	```
	### To Run the model on a single T4 GPU in 4Bits
	```python
	from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
	from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
	from peft import PeftModel, PeftConfig
	from datasets import load_dataset
	import torch
	from datasets import load_dataset

	dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
	test_example = dataset[10000]
	test_image = test_example["image"]
	text = test_example['question']

	device_index = torch.cuda.current_device()
	print("device_index:",device_index)
	quantization_config = BitsAndBytesConfig(load_in_4bit=True)
	base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
	processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

	inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
	for k,v in inputs.items():
	print(k,v.shape)

	MAX_LENGTH = 200
	# Autoregressively generate
	# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
	generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

	# Next we turn each predicted token ID back into a string using the decode method
	# We chop of the prompt, which consists of image tokens and our text prompt
	image_token_index = base_model.config.image_token_index
	num_image_tokens = len(generated_ids[generated_ids==image_token_index])
	num_text_tokens = len(processor.tokenizer.encode(text))
	num_prompt_tokens = num_image_tokens + num_text_tokens + 2
	generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	generated_text
	```