--- license: gemma datasets: - damerajee/clean_hin_vqa language: - en - hi inference: false library_name: transformers pipeline_tag: visual-question-answering tags: - visual-question-answering - Bilingual --- # ViLaH ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output. # Training Details * Model Configuration: Fine-tuned on a single epoch using a V100 gpu. * Training Duration: Approximately one day. * Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch. * The model is still being train as of right now with better quality dataset * The model's performance may be compromised due to insufficient data and the fact that it was trained for only one epoch. # Dataset The model was finetuned on only one dataset * [damerajee/clean_hin_vqa](https://huggingface.co/datasets/damerajee/clean_hin_vqa) : This dataset was derived from [Lin-Chen/ShareGPT4V](https://huggingface.co/google/paligemma-3b-pt-224) and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content. # How to Use ```python !pip install peft trl datasets accelerate bitsandbytes !pip install transformers --upgrade ``` ### To Run the model on a single T4 GPU on Float16 ```python from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer from peft import PeftModel, PeftConfig from datasets import load_dataset import torch from datasets import load_dataset dataset = load_dataset("damerajee/clean_hin_vqa",split='train') test_example = dataset[10000] test_image = test_example["image"] text = test_example['question'] device_index = torch.cuda.current_device() print("device_index:",device_index) base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True) processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH") inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda") for k,v in inputs.items(): print(k,v.shape) MAX_LENGTH = 200 # Autoregressively generate # We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) # Next we turn each predicted token ID back into a string using the decode method # We chop of the prompt, which consists of image tokens and our text prompt image_token_index = base_model.config.image_token_index num_image_tokens = len(generated_ids[generated_ids==image_token_index]) num_text_tokens = len(processor.tokenizer.encode(text)) num_prompt_tokens = num_image_tokens + num_text_tokens + 2 generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] generated_text ``` ### To Run the model on a single T4 GPU in 4Bits ```python from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer from peft import PeftModel, PeftConfig from datasets import load_dataset import torch from datasets import load_dataset dataset = load_dataset("damerajee/clean_hin_vqa",split='train') test_example = dataset[10000] test_image = test_example["image"] text = test_example['question'] device_index = torch.cuda.current_device() print("device_index:",device_index) quantization_config = BitsAndBytesConfig(load_in_4bit=True) base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True) processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH") inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda") for k,v in inputs.items(): print(k,v.shape) MAX_LENGTH = 200 # Autoregressively generate # We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) # Next we turn each predicted token ID back into a string using the decode method # We chop of the prompt, which consists of image tokens and our text prompt image_token_index = base_model.config.image_token_index num_image_tokens = len(generated_ids[generated_ids==image_token_index]) num_text_tokens = len(processor.tokenizer.encode(text)) num_prompt_tokens = num_image_tokens + num_text_tokens + 2 generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] generated_text ```