sid819
/

Llava-Phi2

Visual Question Answering

text-generation

Inference Endpoints

Model card Files Files and versions Community

Model Card for Model ID

This is a multimodal implementation of Phi2 model inspired by LlaVA-Phi.

Model Details

LLM Backbone: Phi2
Vision Tower: clip-vit-large-patch14-336
Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
Finetuning Dataset: Instruct 150k dataset based on COCO
Finetuned Model: marianna13/llava-phi-2-3b

Model Sources

Original Repository: Llava-Phi
Paper [optional]: LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
Demo [optional]: Demo Link

Downloads last month: 4

Safetensors

Model size

2.79B params

Tensor type

F32

·

Inference Providers NEW

Visual Question Answering

This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Datasets used to train sid819/Llava-Phi2