|
--- |
|
license: gemma |
|
datasets: |
|
- damerajee/clean_hin_vqa |
|
language: |
|
- en |
|
- hi |
|
inference: false |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
tags: |
|
- visual-question-answering |
|
- Bilingual |
|
--- |
|
|
|
# ViLaH |
|
ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output. |
|
|
|
|
|
# Training Details |
|
* Model Configuration: Fine-tuned on a single epoch using a V100 gpu. |
|
* Training Duration: Approximately one day. |
|
* Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch. |
|
* The model is still being train as of right now with better quality dataset |
|
* The model's performance may be compromised due to insufficient data and the fact that it was trained for only one epoch. |
|
|
|
# Dataset |
|
The model was finetuned on only one dataset |
|
* [damerajee/clean_hin_vqa](https://huggingface.co/datasets/damerajee/clean_hin_vqa) : This dataset was derived from [Lin-Chen/ShareGPT4V](https://huggingface.co/google/paligemma-3b-pt-224) and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content. |
|
|
|
# How to Use |
|
|
|
```python |
|
!pip install peft trl datasets accelerate bitsandbytes |
|
!pip install transformers --upgrade |
|
``` |
|
### To Run the model on a single T4 GPU on Float16 |
|
```python |
|
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training |
|
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer |
|
from peft import PeftModel, PeftConfig |
|
from datasets import load_dataset |
|
import torch |
|
from datasets import load_dataset |
|
|
|
dataset = load_dataset("damerajee/clean_hin_vqa",split='train') |
|
test_example = dataset[10000] |
|
test_image = test_example["image"] |
|
text = test_example['question'] |
|
|
|
device_index = torch.cuda.current_device() |
|
print("device_index:",device_index) |
|
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True) |
|
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH") |
|
|
|
inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda") |
|
for k,v in inputs.items(): |
|
print(k,v.shape) |
|
|
|
MAX_LENGTH = 200 |
|
# Autoregressively generate |
|
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate |
|
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) |
|
|
|
# Next we turn each predicted token ID back into a string using the decode method |
|
# We chop of the prompt, which consists of image tokens and our text prompt |
|
image_token_index = base_model.config.image_token_index |
|
num_image_tokens = len(generated_ids[generated_ids==image_token_index]) |
|
num_text_tokens = len(processor.tokenizer.encode(text)) |
|
num_prompt_tokens = num_image_tokens + num_text_tokens + 2 |
|
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
generated_text |
|
|
|
``` |
|
### To Run the model on a single T4 GPU in 4Bits |
|
```python |
|
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training |
|
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer |
|
from peft import PeftModel, PeftConfig |
|
from datasets import load_dataset |
|
import torch |
|
from datasets import load_dataset |
|
|
|
dataset = load_dataset("damerajee/clean_hin_vqa",split='train') |
|
test_example = dataset[10000] |
|
test_image = test_example["image"] |
|
text = test_example['question'] |
|
|
|
device_index = torch.cuda.current_device() |
|
print("device_index:",device_index) |
|
quantization_config = BitsAndBytesConfig(load_in_4bit=True) |
|
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True) |
|
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH") |
|
|
|
inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda") |
|
for k,v in inputs.items(): |
|
print(k,v.shape) |
|
|
|
MAX_LENGTH = 200 |
|
# Autoregressively generate |
|
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate |
|
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) |
|
|
|
# Next we turn each predicted token ID back into a string using the decode method |
|
# We chop of the prompt, which consists of image tokens and our text prompt |
|
image_token_index = base_model.config.image_token_index |
|
num_image_tokens = len(generated_ids[generated_ids==image_token_index]) |
|
num_text_tokens = len(processor.tokenizer.encode(text)) |
|
num_prompt_tokens = num_image_tokens + num_text_tokens + 2 |
|
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
generated_text |
|
``` |
|
|
|
|
|
|
|
|