
AV LLMs
A collection of Audio, Video and Visual LLMs.
- Text-to-Speech • Updated • 468
- 1.09k
OpenVoice
🤗 dataautogpt3/ProteusV0.3
Text-to-Image • Updated • 231k • 94ByteDance/SDXL-Lightning
Text-to-Image • Updated • 131k • • 2.08kopenai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.51M • • 4.84kstabilityai/TripoSR
Image-to-3D • Updated • 15.2k • 554Efficient-Large-Model/VILA-7b
Text Generation • 7B • Updated • 47 • 26google/paligemma-3b-pt-896
Image-Text-to-Text • 3B • Updated • 2.12k • 119microsoft/Phi-3-vision-128k-instruct
Text Generation • 4B • Updated • 54.2k • 964stabilityai/stable-audio-open-1.0
Text-to-Audio • Updated • 21.1k • 1.29kOpenVLA: An Open-Source Vision-Language-Action Model
Paper • 2406.09246 • Published • 42aiola/whisper-medusa-v1
2B • Updated • 13 • 178merve/idefics3llama-vqav2
Updated • 8black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 628k • • 4.23k- 115
Llama3.1 S V0.2 Checkpoint 2024 08 20
😻Convert text to audio and vice versa
gpt-omni/mini-omni
Text-to-Speech • Updated • 428fishaudio/fish-speech-1.4
Text-to-Speech • Updated • 179 • 451- 178
Tonic's GOT OCR
📲GOT - OCR (from : UCAS, Beijing)
stepfun-ai/GOT-OCR2_0
Image-Text-to-Text • 0.7B • Updated • 51.8k • 1.51kapple/coreml-sam2-large
Mask Generation • Updated • 55 • 28coreml-projects/sam-2-studio
Updated • 25mistralai/Pixtral-12B-2409
Updated • 1.97k • 660allenai/Molmo-72B-0924
Image-Text-to-Text • 73B • Updated • 4.01k • 291openai/whisper-large-v3-turbo
Automatic Speech Recognition • 0.8B • Updated • 3.18M • • 2.57kRevai/reverb-asr
Automatic Speech Recognition • Updated • 11 • 87- 359
GOT Online
💬Extract text from images using various OCR modes
facebook/vfusion3d
Image-to-3D • 0.5B • Updated • 79 • 65facebook/cotracker
Updated • 1.4k • 35rhymes-ai/Aria
Image-Text-to-Text • 25B • Updated • 55k • 633SWivid/F5-TTS
Text-to-Speech • Updated • 925k • 1.1k- 64
Ichigo Llama3.1 S Instruct
🏢Generate text from audio recordings
kyutai/moshiko-mlx-q4
Updated • 231 • 28kyutai/moshiko-mlx-q8
Updated • 135 • 5- 120
Open VLM Video Leaderboard
🌎VLMEvalKit Eval Results in video understanding benchmark
jimmycarter/LibreFLUX
Text-to-Image • Updated • 51 • 169microsoft/OmniParser
Image-Text-to-Text • Updated • 1.69k • 1.69k- 321
Aya Models
🌍Interact with the Aya family of models.
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 7.64k • • 267stabilityai/stable-diffusion-3.5-medium
Text-to-Image • Updated • 133k • • 820OuteAI/OuteTTS-0.1-350M
Text-to-Speech • 0.4B • Updated • 1.25k • 302vidore/colpali
Visual Document Retrieval • Updated • 5.14k • 459vidore/colpali-v1.2
Visual Document Retrieval • Updated • 27.2k • 110si-pbc/hertz-dev
Audio-to-Audio • Updated • 213- 38
Talk To Ultravox
⚡Talk to Fixie.ai's Ultravox with WebRTC ⚡️
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 128Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text • 11B • Updated • 3.5k • 155google/paligemma-3b-pt-224
Image-Text-to-Text • 3B • Updated • 30k • 351apple/coreml-mobileclip
Updated • 537 • 47InstantX/InstantIR
Image-to-Image • Updated • 179- 86
InstantIR
🖼diffusion-based Image Restoration model
- 168
Flux IP Adapter
🖼Prompt with Images in flux[dev]
- 38
Image Preferences - Argilla annotation space
🖼A community project to create an image preferences dataset.
fishaudio/fish-speech-1.5
Text-to-Speech • Updated • 1.81k • 621meta-llama/Llama-3.3-70B-Instruct
Text Generation • 71B • Updated • 521k • • 2.48k- 48
Paligemma2 Vqav2
🐨PaliGemma2 LoRA finetuned on VQAv2
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 119fancyfeast/llama-joycaption-alpha-two-hf-llava
8B • Updated • 10.8k • 189taohu/mask
Updated • 5[MASK] is All You Need
Paper • 2412.06787 • Published • 2- 870
Open VLM Leaderboard
🌎VLMEvalKit Evaluation Results Collection
microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 10LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper • 2411.04997 • Published • 40Generative Powers of Ten
Paper • 2312.02149 • Published • 8- 24
StoryStar
💬Fantasy story generator
GoodiesHere/Apollo-LMMs-Apollo-7B-t32
Video-Text-to-Text • Updated • 11 • 56Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 147Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 438k • • 1.22kXiaoduoAILab/Xmodel_VLM
Text Generation • 2B • Updated • 778 • 13nvidia/Cosmos-1.0-Diffusion-14B-Text2World
Updated • 2.86k • 59nvidia/Cosmos-1.0-Autoregressive-12B
Updated • 33 • 30nvidia/Cosmos-1.0-Autoregressive-13B-Video2World
Updated • 33 • 31nvidia/Cosmos-1.0-Diffusion-7B-Text2World
Text-to-Video • Updated • 17k • 226nvidia/Cosmos-1.0-Diffusion-14B-Video2World
Updated • 361 • 56- 455
Stable Point-Aware 3D
⚡Create 3D models from single images
hexgrad/Kokoro-82M
Text-to-Speech • Updated • 2.41M • • 4.97k- 2.9k
Kokoro TTS
❤Upgraded to v1.0!
openbmb/MiniCPM-o-2_6
Any-to-Any • 9B • Updated • 205k • 1.23k- 424
TTS Spaces Arena
🤗Blind vote on HF TTS models!
google/paligemma2-10b-pt-896
Image-Text-to-Text • 10B • Updated • 2.25k • 31NovaSky-AI/Sky-T1-32B-Preview
Text Generation • 33B • Updated • 1.45k • • 549MiniMaxAI/MiniMax-VL-01
Image-Text-to-Text • 456B • Updated • 99k • 277- 65
SmolVLM
📊Generate descriptions from images and text prompts
HKUSTAudio/Llasa-3B
Text-to-Speech • 4B • Updated • 1.85k • 514HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • 0.5B • Updated • 131k • 170deepseek-ai/Janus-Pro-7B
Any-to-Any • Updated • 173k • 3.48k- 309
Kokoro TTS Zero
🎴✨[With v1.0.0] Accelerated TTS on Kokoro-82M
kyutai/hibiki-2b-mlx-bf16
Translation • Updated • 41 • 21kyutai/hibiki-2b-pytorch-bf16
Translation • Updated • 940 • 55ARTPARK-IISc/Vaani
Updated • 4.61k • 63Zyphra/Zonos-v0.1-hybrid
Text-to-Speech • Updated • 54.9k • 1.1kZyphra/Zonos-v0.1-transformer
Text-to-Speech • Updated • 28k • 411microsoft/OmniParser-v2.0
Updated • 723 • 1.28k- 91
Paligemma2 Mix
🌖Generate text and segment images using PaliGemma 2
google/paligemma2-3b-mix-448
Image-Text-to-Text • 3B • Updated • 4.46k • 48google/paligemma2-3b-mix-224
Image-Text-to-Text • 3B • Updated • 12.6k • 34google/paligemma2-28b-mix-224
Image-Text-to-Text • 28B • Updated • 2.29k • 4google/paligemma2-28b-mix-448
Image-Text-to-Text • 28B • Updated • 2.22k • 27google/paligemma2-10b-mix-224
Image-Text-to-Text • 10B • Updated • 2.38k • 9google/paligemma2-10b-mix-448
Image-Text-to-Text • 10B • Updated • 3.97k • 31stepfun-ai/stepvideo-t2v
Text-to-Video • Updated • 85 • 469stepfun-ai/stepvideo-t2v-turbo
Updated • 96Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Paper • 2502.10248 • Published • 56HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text • 2B • Updated • 106k • 250nvidia/canary-1b
Automatic Speech Recognition • Updated • 3.35k • 444Wan-AI/Wan2.1-I2V-14B-720P
Image-to-Video • Updated • 13.7k • • 516fastrtc/kokoro-onnx
Updated • 11- 2
Fastphone
🐠Download and run an app from a Hugging Face repository
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 349k • 1.48kmicrosoft/Magma-8B
Image-Text-to-Text • 9B • Updated • 2.36k • 407- 45
Magma UI
📚Magma-8B model for UI Agents
- 625
Di♪♪Rhythm
🎶Blazingly Fast and Embarrassingly Simple Song Generation
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Paper • 2503.01183 • Published • 28ASLP-lab/DiffRhythm-vae
Updated • 41ASLP-lab/DiffRhythm-base
Updated • 39 • 168Large Language Diffusion Models
Paper • 2502.09992 • Published • 123GSAI-ML/LLaDA-8B-Instruct
Text Generation • 8B • Updated • 298k • 313unsloth/gemma-3-12b-pt
Image-Text-to-Text • 12B • Updated • 1.64k • 5google/gemma-3-27b-it
Image-Text-to-Text • 27B • Updated • 669k • • 1.58ksesame/csm-1b
Text-to-Speech • Updated • 30.6k • 2.2kunsloth/gemma-3-27b-it-GGUF
Image-Text-to-Text • 27B • Updated • 54.3k • 152ds4sd/SmolDocling-256M-preview
Image-Text-to-Text • 0.3B • Updated • 28.5k • 1.56kstarvector/starvector-8b-im2svg
Text Generation • 8B • Updated • 1.66k • 500starvector/starvector-1b-im2svg
Text Generation • 1B • Updated • 1.5k • 173Tokenize Image as a Set
Paper • 2503.16425 • Published • 16kyutai/moshika-vis-pytorch-bf16
Updated • 56kyutai/Babillage
Viewer • Updated • 465k • 368 • 9ByteDance/InfiniteYou
Text-to-Image • Updated • 3.33k • 629InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper • 2503.16418 • Published • 36openfree/flux-chatgpt-ghibli-lora
Text-to-Image • Updated • 1.64k • • 317Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
Paper • 2504.00595 • Published • 37weizhiwang/Open-Qwen2VL
Image-Text-to-Text • Updated • 52 • 19ostris/Flex.1-alpha-Redux
Text-to-Image • Updated • 2.02k • 114unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit
Image-Text-to-Text • 57B • Updated • 3.78k • 80unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-8bit
Image-Text-to-Text • 109B • Updated • 1.17k • 9SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 199canopylabs/3b-hi-ft-research_release
Text-to-Speech • 3B • Updated • 507 • 22canopylabs/3b-es_it-ft-research_release
Text-to-Speech • 3B • Updated • 1.85k • 15nvidia/C-RADIOv2-g
Image Feature Extraction • 1B • Updated • 84 • 12InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 281OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 87.5k • 72OpenGVLab/InternVL3-78B
Image-Text-to-Text • 78B • Updated • 426k • 216InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Paper • 2504.05303 • Published • 5- 1.66k
Dia 1.6B
👯Generate realistic dialogue from a script, using Dia!
nari-labs/Dia-1.6B
Text-to-Speech • Updated • 112k • • 2.72kDescribe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 63nvidia/DAM-3B-Self-Contained
Image-Text-to-Text • Updated • 7.3k • 23nvidia/DAM-3B-Video
Image-Text-to-Text • Updated • 776 • 55nvidia/DAM-3B
Image-Text-to-Text • Updated • 6.08k • 125Qwen/Qwen2.5-Omni-3B
Any-to-Any • 6B • Updated • 332k • 280MMaDA: Multimodal Large Diffusion Language Models
Paper • 2505.15809 • Published • 96One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper • 2505.18129 • Published • 60- 115
PlayDiffusion
🎨Generate modified audio from text and voice
lerobot/smolvla_base
Robotics • Updated • 13.7k • 245stockmark/Stockmark-2-VL-100B-beta
Image-Text-to-Text • 96B • Updated • 1.21k • 21Qwen/Qwen2.5-Omni-7B
Any-to-Any • 11B • Updated • 171k • 1.77kQwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 165- 1.39k
Chatterbox TTS
🍿Expressive Zeroshot TTS
ResembleAI/chatterbox
Text-to-Speech • Updated • 882k • • 1.04kPrunaAI/FLUX.1-schnell-smashed
Text-to-Image • Updated • 5ByteDance/Dolphin
Image-Text-to-Text • 0.4B • Updated • 87k • 472nanonets/Nanonets-OCR-s
Image-Text-to-Text • 4B • Updated • 314k • 1.49k- 34
Nanonets Ocr S
👁https://nanonets.com/research/nanonets-ocr-s/
calcuis/cosmos-predict2-gguf
Text-to-Image • 14B • Updated • 3k • 30Arrexel/pattern-diffusion
Text-to-Image • Updated • 797 • 103numind/NuMarkdown-8B-Thinking
Image-to-Text • 8B • Updated • 8.94k • 195Qwen/Qwen-Image
Text-to-Image • Updated • 185k • • 1.95krednote-hilab/dots.ocr
Image-Text-to-Text • 3B • Updated • 184k • 878Runware/Qwen-Image-Edit
Image-to-Image • Updated • 160 • 15- 442
Qwen Image Edit
✒Edit images based on user instructions
Qwen/Qwen-Image-Edit
Image-to-Image • Updated • 77.6k • • 1.61kzju-community/matchanything_eloftr
0.0B • Updated • 1.38k • 69- 214
MatchAnything
🏢Find matching images based on input criteria
microsoft/VibeVoice-1.5B
Text-to-Speech • 3B • Updated • 107k • 1.22kbytedance-research/USO
Text-to-Image • Updated • 228 • 109- 252
FastVLM WebGPU
🍎Real-time video captioning powered by FastVLM
onnx-community/FastVLM-0.5B-ONNX
Image-Text-to-Text • Updated • 5.54k • 39apple/FastVLM-0.5B
Text Generation • 0.8B • Updated • 3.11k • 121