---
license: gemma
datasets:
- damerajee/clean_hin_vqa
language:
- en
- hi
inference: false
library_name: transformers
pipeline_tag: visual-question-answering
tags:
- visual-question-answering
- Bilingual 
---

# ViLaH 
ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output.


# Training Details 
* Model Configuration: Fine-tuned on a single epoch using a V100 gpu.
* Training Duration: Approximately one day.
* Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch.
* The model is still being train as of right now with better quality dataset 
* The model's performance may be compromised due to insufficient data and the fact that it was trained for only one epoch.
  
# Dataset 
The model  was finetuned on only one dataset
* [damerajee/clean_hin_vqa](https://huggingface.co/datasets/damerajee/clean_hin_vqa) : This dataset was derived from [Lin-Chen/ShareGPT4V](https://huggingface.co/google/paligemma-3b-pt-224)  and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content.
  
# How to Use 

```python
!pip install peft trl datasets accelerate bitsandbytes
!pip install transformers --upgrade
```
### To Run the model on a single T4 GPU on Float16 
```python
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset

dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']

device_index = torch.cuda.current_device()
print("device_index:",device_index)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text

```
### To Run the model on a single T4 GPU in 4Bits
```python
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset

dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']

device_index = torch.cuda.current_device()
print("device_index:",device_index)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text
```