Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2505.03739

about 6 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

VITA-MLLM/VITA-Audio-Boost

10B • Updated May 15 • 40 • 3
VITA-MLLM/VITA-Audio-Balance

10B • Updated Apr 28 • 31 • 3
VITA-MLLM/VITA-Audio-Plus-Boost

11B • Updated May 15 • 76 • 3
VITA-MLLM/VITA-Audio-Plus-Vanilla

8B • Updated May 6 • 536 • 5

about 8 hours ago

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Paper • 2405.18503 • Published May 28, 2024 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Paper • 2405.20289 • Published May 30, 2024 • 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Paper • 2406.02897 • Published Jun 5, 2024 • 16
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Paper • 2406.03344 • Published Jun 5, 2024 • 21

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

fishaudio/fish-speech-1.4

Text-to-Speech • Updated Nov 5, 2024 • 180 • 451
stepfun-ai/GOT-OCR2_0

Image-Text-to-Text • 0.7B • Updated Feb 4 • 51.9k • 1.51k
OuteAI/OuteTTS-0.1-350M

Text-to-Speech • 0.4B • Updated Apr 17 • 1.24k • 302
hexgrad/Kokoro-82M

Text-to-Speech • Updated Apr 10 • 2.49M • • 4.97k

about 6 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

VITA-MLLM/VITA-Audio-Boost

10B • Updated May 15 • 40 • 3
VITA-MLLM/VITA-Audio-Balance

10B • Updated Apr 28 • 31 • 3
VITA-MLLM/VITA-Audio-Plus-Boost

11B • Updated May 15 • 76 • 3
VITA-MLLM/VITA-Audio-Plus-Vanilla

8B • Updated May 6 • 536 • 5

fishaudio/fish-speech-1.4

Text-to-Speech • Updated Nov 5, 2024 • 180 • 451
stepfun-ai/GOT-OCR2_0

Image-Text-to-Text • 0.7B • Updated Feb 4 • 51.9k • 1.51k
OuteAI/OuteTTS-0.1-350M

Text-to-Speech • 0.4B • Updated Apr 17 • 1.24k • 302
hexgrad/Kokoro-82M

Text-to-Speech • Updated Apr 10 • 2.49M • • 4.97k

about 8 hours ago

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Paper • 2405.18503 • Published May 28, 2024 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Paper • 2405.20289 • Published May 30, 2024 • 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Paper • 2406.02897 • Published Jun 5, 2024 • 16
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Paper • 2406.03344 • Published Jun 5, 2024 • 21

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs