--- library_name: transformers license: mit language: - en metrics: - accuracy --- # Model Card for Logic2Vision Logic2Vision is a [LLaVA-1.5-13B](https://huggingface.co/llava-hf/llava-1.5-13b-hf) model finetuned on [VisReas dataset](https://arxiv.org/abs/2403.10534) for complex visual reasoning tasks. ## Model Details ### Model Description Logic2Vision is a [LLaVA-1.5-13B](https://huggingface.co/llava-hf/llava-1.5-13b-hf) model finetuned on [VisReas dataset](https://arxiv.org/abs/2403.10534) for complex visual reasoning tasks. The model has been finetuned using LoRA to generate python pseudocode outputs to solve a complex visual reasoning tasks. - **Developed by:** Sangwu Lee and Syeda Akter - **Model type:** Multimodal (Text + Image) - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** [LLaVA-1.5-13B](https://huggingface.co/llava-hf/llava-1.5-13b-hf) ### Model Sources - **Repository:** TBD - **Paper:** [VisReas dataset](https://arxiv.org/abs/2403.10534) ## Uses The inference method is identical to [LLaVA-1.5-13B](https://huggingface.co/llava-hf/llava-1.5-13b-hf). ```python import torch from transformers import AutoProcessor, LlavaForConditionalGeneration from PIL import Image image = Image.open("") image = image.convert("RGB") question = "What material attribute do the stove, the oven behind the white and dirty wall and the tea_kettle have in common?" codes = """ selected_wall = select(wall) filtered_wall = filter(selected_wall, ['white', 'dirty']) related_oven = relate(oven, behind, o, filtered_wall) selected_stove = select(stove) selected_tea_kettle = select(tea_kettle) materials = query_material(related_oven, selected_stove, selected_tea_kettle) material = common(materials) """ prompt = """ USER: Executes the code and logs the results step-by-step to provide an answer to the question. Question {question} Code {codes} ASSISTANT: Log """ prompt = prompt.format(question=question, codes=codes) model = LlavaForConditionalGeneration.from_pretrained("RE-N-Y/logic2vision", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True) processor = AutoProcessor.from_pretrained("RE-N-Y/logic2vision") processor.tokenizer.pad_token = processor.tokenizer.eos_token processor.tokenizer.padding_side = "left" prompts = processor(images=image, text=prompt, return_tensors="pt") generate_ids = model.generate(**inputs, max_new_tokens=256) processor.batch_decode(generate_ids, skip_special_tokens=True) ``` ## Bias, Risks, and Limitations TBD ## Training / Evaluation Details The model has been finetuned using 2 A6000 GPUs on CMU LTI's Babel cluster using. The model has been finetuned using LoRA (`r=8, alpha=16, dropout=0.05, task_type="CAUSAL_LM"`). LoRA modules were attached to `["q_proj", "v_proj"]`. We use DDP for distributed training and BF16 to speed up training. For more details, check [our paper](https://arxiv.org/abs/2403.10534)! ### Results TBD ## Citation **BibTeX:** ``` @misc{akter2024visreas, title={VISREAS: Complex Visual Reasoning with Unanswerable Questions}, author={Syeda Nahida Akter and Sangwu Lee and Yingshan Chang and Yonatan Bisk and Eric Nyberg}, year={2024}, eprint={2403.10534}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Model Card Authors TBD