AI & ML interests

None defined yet.

merveΒ 
posted an update about 11 hours ago
view post
Post
178
large AI labs have dropped so many open models last week πŸ”₯ don't miss out on them

β†’ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
β†’ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
β†’ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
sergiopaniegoΒ 
posted an update 6 days ago
view post
Post
320
It's now posible to do end-2-end ML without leaving the @huggingface Hub, by combining TRL + HF jobs + Trackio!!

🐑We just released a full guide explaining the process.

Go check it out!

πŸ“– Guide: https://huggingface.co/docs/trl/main/en/jobs_training

πŸ’‘ Reminder: HF Jobs is only available for Pro, Team, or Enterprise plans. Yet another reason to upgrade
merveΒ 
posted an update 7 days ago
view post
Post
5782
first vision language model built off openai/gpt-oss-20b just dropped! πŸ”₯

InternVL3.5 comes with 32 models 🀯 pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb
comes with gpt-oss or Qwen3 for LLM part ‡️
  • 1 reply
Β·
sergiopaniegoΒ 
posted an update 20 days ago
sergiopaniegoΒ 
posted an update 21 days ago
view post
Post
388
New Zero-Shot Object Detectors in transformers! πŸ₯½

We’ve added LLMDet and MM GroundingDINO, plus a demo Space to compare them with others πŸ–ΌοΈ

Play with it: ariG23498/zero-shot-od
sergiopaniegoΒ 
posted an update 22 days ago
sergiopaniegoΒ 
posted an update 25 days ago
view post
Post
442
Latest TRL release brings major upgrades for multimodal alignment!

We dive into 3 new techniques to improve VLM post-training in our new blog:

πŸŒ‹ GRPO
🎞️ GSPO
πŸ™ MPO
βž• vLLM integration for online training w/ transformers backend\

🐑 Blog: https://huggingface.co/blog/trl-vlm-alignment
merveΒ 
posted an update 26 days ago
view post
Post
3229
GPT-4.1-mini level model right in your iPhone 🀯

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks πŸ”₯

allows commercial use as well!
sergiopaniegoΒ 
posted an update 27 days ago
merveΒ 
posted an update 28 days ago
view post
Post
1118
we're all sleeping on this OCR model rednote-hilab/dots.ocr πŸ”₯

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🀯

single e2e model to extract image, convert tables, formula, and more into markdown πŸ“
try it MohamedRashad/Dots-OCR
sergiopaniegoΒ 
posted an update 28 days ago
view post
Post
3404
Want to learn how to align a Vision Language Model (VLM) for reasoning using GRPO and TRL? πŸŒ‹

πŸ§‘β€πŸ³ We've got you covered!!

NEW multimodal post training recipe to align a VLM using TRL in @HuggingFace 's Cookbook.

Go to the recipe πŸ‘‰https://huggingface.co/learn/cookbook/fine_tuning_vlm_grpo_trl

Powered by the latest TRL v0.20 release, this recipe shows how to teach Qwen2.5-VL-3B-Instruct to reason over images πŸŒ‹
merveΒ 
posted an update 28 days ago
view post
Post
654
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫑 merve/releases-august-2-6890c14248203522b7d0267f

LLMs πŸ’¬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text β†’ image+text) (OS)
sergiopaniegoΒ 
posted an update 29 days ago
view post
Post
4500
Just included example scripts for aligning models using GSPO (including VLM example) πŸ™†β€β™‚οΈπŸ™†β€β™‚οΈ

GSPO is the latest RL alignment algo by @Alibaba_Qwen and it's already supported in the latest TRL v0.20 release.

Super-easy-to-get-started example scripts below, GO run them!πŸ‘©β€πŸ’»πŸ‘©β€πŸ’»

πŸ§‘β€πŸŽ¨ Script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo.py
πŸ¦„ VLM script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo_vlm.py
🧩 More TRL examples: https://huggingface.co/docs/trl/main/en/example_overview
πŸ§™β€β™‚οΈ GSPO paper: Group Sequence Policy Optimization (2507.18071)
merveΒ 
posted an update about 1 month ago
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
342
Did you miss this? πŸ‘“

πŸ§™β€β™‚οΈvLLM + transformers integration just got upgraded with direct VLM support.

Select a VLM + model_impl=transformers and play via vLLM!
merveΒ 
posted an update about 1 month ago
view post
Post
3611
past week in open AI was insane πŸ”₯ here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

πŸ’¬ LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🀯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 πŸ“
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

πŸ–ΌοΈ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
Β·
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
2661
We just released TRL v0.20 with major multimodal upgrades!

πŸ‘οΈ VLM support for GRPO (highly requested by the community!)
🎞️ New GSPO trainer (from @Qwen , released last week, VLM-ready)
πŸ™ New MPO trainer (multimodal by design, as in the paper)

πŸ“ Full release notes here: https://github.com/huggingface/trl/releases/tag/v0.20.0
merveΒ 
posted an update about 1 month ago
view post
Post
4366
🀯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (πŸ‘‘ best model βœ… best open model)
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
1203
Yet Another New Multimodal Fine-Tuning Recipe πŸ₯§

πŸ§‘β€πŸ³ In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.

πŸ’‘ This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!

We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).

Check it out! ➑️ https://huggingface.co/learn/cookbook/fine_tuning_vlm_mpo
  • 2 replies
Β·
anditoΒ 
posted an update about 1 month ago
view post
Post
2814
Many VLMs claim to process hours of video. But can they follow the story?πŸ€”
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
πŸ”Ž Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
πŸƒ Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsβ€”now the community can start fixing them.πŸ“ˆ

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

πŸ“– Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
πŸ‘©β€πŸ’» Leaderboard & Demo: Apollo-LMMs/TimeScope
πŸ“Š Dataset: Apollo-LMMs/TimeScope
βš™οΈ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval