Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2411.11922

MutiModal_Paper

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 56
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 30
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 76
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 43

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 114
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Paper • 2411.06469 • Published Nov 10, 2024 • 17
Sharingan: Extract User Action Sequence from Desktop Recordings

Paper • 2411.08768 • Published Nov 13, 2024 • 10
AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 22

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 33
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 26
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 123
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Paper • 2405.08748 • Published May 14, 2024 • 24
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 29
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 131
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published May 20, 2024 • 38

LocalMamba: Visual State Space Model with Windowed Selective Scan

Paper • 2403.09338 • Published Mar 14, 2024 • 9
GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper • 2403.09394 • Published Mar 14, 2024 • 27
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 34
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 29

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs