first vision language model built off openai/gpt-oss-20b just dropped! π₯
InternVL3.5 comes with 32 models π€― pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part ‡οΈ
We just released TRL v0.20 with major multimodal upgrades!
ποΈ VLM support for GRPO (highly requested by the community!) ποΈ New GSPO trainer (from @Qwen, released last week, VLM-ready) π New MPO trainer (multimodal by design, as in the paper)
Yet Another New Multimodal Fine-Tuning Recipe π₯§
π§βπ³ In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.
π‘ This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!
We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).
Many VLMs claim to process hours of video. But can they follow the story?π€ Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!β³
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsβnow the community can start fixing them.π
Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.