Inference taking so long

#14
by J812 - opened

Hi, thank you for this great model.
I am trying to get some predictions on a picture locally on my PC/CPU, I don't have GPU.
& it's taking like 1h30 and still no prediction, is this expected?

Llava Hugging Face org

Hey! It's recommended to do inference with these models on a GPU. For CPU I think it's expected since it's a 34B model and also llava-1.6 has more sub-images per input compared to other VLMs. From my experience, I once ran a 7B model on CPU and waited for around 30-40 min

If you cannot fit 34B model on small GPUs, I recommend to take advantage of different optimization methods we have. For example, load model in 4 bits with bitsandbytes (docs here) or use Flash-Attention for long-context sequences. Hope this helps!

Llava Hugging Face org

it will take forever to infer if you use 4-bit inference with 34B on a T4, because it quantizes/dequantizes the weights on the fly so I do not recommend that. there's no free lunch to be honest when it comes to fitting in 34B models to smaller hardware.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment