llava-hf/llava-v1.6-34b-hf · Inference taking so long

J812

Jun 6, 2024

Hi, thank you for this great model.
I am trying to get some predictions on a picture locally on my PC/CPU, I don't have GPU.
& it's taking like 1h30 and still no prediction, is this expected?

RaushanTurganbay

Llava Hugging Face org Jun 7, 2024

Hey! It's recommended to do inference with these models on a GPU. For CPU I think it's expected since it's a 34B model and also llava-1.6 has more sub-images per input compared to other VLMs. From my experience, I once ran a 7B model on CPU and waited for around 30-40 min

If you cannot fit 34B model on small GPUs, I recommend to take advantage of different optimization methods we have. For example, load model in 4 bits with bitsandbytes (docs here) or use Flash-Attention for long-context sequences. Hope this helps!

merve

Llava Hugging Face org Jun 11, 2024

it will take forever to infer if you use 4-bit inference with 34B on a T4, because it quantizes/dequantizes the weights on the fly so I do not recommend that. there's no free lunch to be honest when it comes to fitting in 34B models to smaller hardware.