Google just released PaliGemma 2 Mix: new versatile instruction vision language models 🔥
> Three new models: 3B, 10B, 28B with res 224, 448 💙 > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything 🤯
Last year, their GOT-OCR 2.0 took the community by storm 🔥but many didn’t know they were also building some amazing models. Now, they’ve just dropped something huge on the hub!
📺 Step-Video-T2V: a 30B bilingual open video model that generates 204 frames (8-10s) at 540P resolution with high information density & consistency. stepfun-ai/stepvideo-t2v
Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.
I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.
Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" 🤗
The steps are simple:
For each round:
1> Starting by sampling 2 starting noises with different seeds. 2> Score the generations w.r.t a metric. 3> Obtain the best generation from the current round.
If you have more compute budget, go to the next search round. Scale the noise pool (2 ** search_round) and repeat 1 - 3.
This constitutes the random search method as done in the paper by Google DeepMind.
👀 Multimodal > OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context > AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support > ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size > Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
💬 LLMs A lot of math models! > Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B > Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models > DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math > LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
🗣️ Audio > Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
🖼️ Vision and Image Generation > We have ported DepthPro of Apple to transformers for your convenience! > illustrious-xl-v1.0 is a new illustration generation model
Ovis2 🔥 a multimodal LLM released by Alibaba AIDC team. AIDC-AI/ovis2-67ab36c7e497429034874464 ✨1B/2B/4B/8B/16B/34B ✨Strong CoT for deeper problem solving ✨Multilingual OCR – Expanded beyond English & Chinese, with better data extraction
🤖 Robotics > Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
💬 LLMs > Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math 🤯 s1-32B and s1K is out! > Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis > Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
👀 Multimodal > PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything > OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese > Krutrim released Chitrarth, a new vision language model for Indic languages and English
🖼️ Vision > BiRefNet_HR is a new higher resolution BiRefNet for background removal
🗣️ Audio > kyutai released Hibiki, it's a real-time speech-to-speech translation model 🤯 it's available for French-English translation > Krutrim released Dhwani, a new STT model for Indic languages > They also release a new dataset for STT-TTS
🖼️ Image Generation > Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation > Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint > boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan
The most difficult part was getting the model running in the first place, but the next steps are simple: ✂️ Implement sentence splitting, allowing for streamed responses 🌍 Multilingual support (only phonemization left)
Xwen 🔥 a series of open models based on Qwen2.5 models, developed by a brilliant research team of PhD students from the Chinese community. shenzhi-wang/xwen-chat-679e30ab1f4b90cfa7dbc49e ✨ 7B/72B ✨ Apache 2.0 ✨ Xwen-72B-Chat outperformed DeepSeek V3 on Arena Hard Auto
From ancient medical ethics to modern AI challenges, the journey of consent represents one of humanity's most fascinating ethical evolutions. In my latest blog post, I explore how we've moved from medical paternalism to a new frontier where AI capabilities force us to rethink consent.
The "consent gap" in AI is real: while we can approve initial data use, AI systems can generate countless unforeseen applications of our personal information. It's like signing a blank check without knowing all possible amounts that could be filled in.
Should we reimagine consent for the AI age? Perhaps we need dynamic consent systems that evolve alongside AI capabilities, similar to how healthcare transformed from physician-centered authority to patient autonomy.
Curious to hear your thoughts: how can we balance technological innovation with meaningful user sovereignty over digital identity?