|
--- |
|
license: mit |
|
datasets: |
|
- nielsr/docvqa_1200_examples_donut |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
### IDEFICS2-OCR |
|
|
|
Finetuned of Idefics2-8b with fp16 weight update on nielsr/docvqa_1200_examples_donut dataset for document VQA pairs. |
|
|
|
## Usage |
|
|
|
```Python |
|
from transformers import BitsAndBytesConfig, AutoModelForVision2Seq, AutoProcessor |
|
from transformers.image_utils import load_image |
|
|
|
processor = AutoProcessor.from_pretrained("smishr-18/Idefics2-OCR", do_image_splitting=False) |
|
|
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.float16 |
|
) |
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
model = AutoModelForVision2Seq.from_pretrained( |
|
"smishr-18/Idefics2-OCR", |
|
quantization_config=bnb_config, |
|
device_map=device, |
|
low_cpu_mem_usage=True |
|
) |
|
|
|
image = load_image("https://images.pokemontcg.io/pl1/1_hires.png") |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": "Explain."}, |
|
{"type": "image"}, |
|
{"type": "text", "text": "What is the reflex energy in the image?"} |
|
] |
|
} |
|
] |
|
|
|
text = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
inputs = processor(text=[text.strip()], images=[image4], return_tensors="pt", padding=True) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
# Generate texts |
|
generated_ids = model.generate(**inputs, max_new_tokens=500) |
|
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_texts) |
|
# The reflex energy in the image is 70. |
|
``` |
|
|
|
## Limitations |
|
|
|
The model was finetuned on limited T4 GPU and could be fintuned with more adapters on |
|
devices with ```torch.cuda.get_device_capability()[0] >= 8``` or Ampere GPUs. |
|
|
|
- **Developed by:** Shubh Mishra, Aug 2024 |
|
- **Model Type:** VLM |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** HuggingFaceM4/idefics2-8b |