Score image-text similarity using CLIP or SigLIP models
Annotate and describe images with text prompts
Cobra: Extending Mamba to MLLM for Efficient Inference