CurHarsh/sft_robotics_vlm_all_task_821_Llama-3.2-11B-Vision-Instruct Image-Text-to-Text • Updated 14 days ago • 56
CurHarsh/sft_robotics_vlm_all_task_821_Qwen2-VL-7B-Instruct Image-Text-to-Text • Updated 14 days ago • 23
CurHarsh/sft_robotics_vlm_all_task_821_Llama-3.2-11B-Vision-Instruct Image-Text-to-Text • Updated 14 days ago • 56
CurHarsh/sft_robotics_vlm_all_task_821_Qwen2-VL-7B-Instruct Image-Text-to-Text • Updated 14 days ago • 23
CurHarsh/sft_robotics_vlm_all_task_821_llava-v1.6-mistral-7b-hf Image-Text-to-Text • Updated 14 days ago • 23
CurHarsh/sft_robotics_vlm_all_task_821_llava-v1.6-mistral-7b-hf Image-Text-to-Text • Updated 14 days ago • 23
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Paper • 2411.04923 • Published Nov 7, 2024 • 21