smolagents can see ๐ฅ we just shipped vision support to smolagents ๐ค agentic computers FTW
you can now: ๐ป let the agent get images dynamically (e.g. agentic web browser) ๐ pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! ๐คฏ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) ๐ค
Multimodal ๐ผ๏ธ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook โ 22k hours worth of samples from instruction videos ๐คฏ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs ๐ฌ > Microsoft released Phi-4, sota open-source 14B language model ๐ฅ > Dolphin is back with Dolphin 3.0 Llama 3.1 8B ๐ฌ๐ฌ > Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment > SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct ๐ญ > Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview ๐ > Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs ๐ > Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences ๐ฉ๐ปโ๐ป
Embeddings ๐ > @MoritzLaurer released zero-shot version of ModernBERT large ๐ > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation โฏ๏ธ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts ๐ฅ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
๐ฌ Revolutionize Your Video Creation Dokdo Multimodal AI Transform a single image into a stunning video with perfect audio harmony! ๐
Superior Technology ๐ซ Advanced Flow Matching: Smoother video transitions surpassing Kling and Sora Intelligent Sound System: Automatically generates perfect audio by analyzing video mood Multimodal Framework: Advanced AI integrating image, text, and audio analysis Outstanding Performance ๐ฏ Ultra-High Resolution: 4K video quality with bfloat16 acceleration Real-Time Optimization: 3x faster processing with PyTorch GPU acceleration Smart Sound Matching: Real-time audio effects based on scene transitions and motion Exceptional Features โจ Custom Audio Creation: Natural soundtrack matching video tempo and rhythm Intelligent Watermarking: Adaptive watermark adjusting to video characteristics Multilingual Support: Precise translation engine powered by Helsinki-NLP Versatile Applications ๐ Social Media Marketing: Create engaging shorts for Instagram and YouTube Product Promotion: Dynamic promotional videos highlighting product features Educational Content: Interactive learning materials with enhanced engagement Portfolio Enhancement: Professional-grade videos showcasing your work Experience the video revolution with Dokdo Multimodal, where anyone can create professional-quality content from a single image. Elevate your content with perfectly synchronized video and audio that captivates your audience! ๐จ
Start creating stunning videos that stand out from the crowd - whether you're a marketer, educator, content creator, or business owner. Join the future of AI-powered video creation today!