-
microsoft/VibeVoice-1.5B
Text-to-Speech ⢠3B ⢠Updated ⢠107k ⢠1.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠5.83k ⢠56 -
apple/FastVLM-1.5B
Text Generation ⢠2B ⢠Updated ⢠786 ⢠21 -
stepfun-ai/Step-Audio-2-mini
8B ⢠Updated ⢠737 ⢠152
merve PRO
AI & ML interests
Recent Activity
Organizations
-
openai/gpt-oss-120b
Text Generation ⢠120B ⢠Updated ⢠2.43M ⢠⢠3.69k -
openai/gpt-oss-20b
Text Generation ⢠22B ⢠Updated ⢠9.04M ⢠⢠3.36k -
openai/BrowseCompLongContext
Viewer ⢠Updated ⢠295 ⢠3.39k ⢠40 -
baichuan-inc/Baichuan-M2-32B
Text Generation ⢠33B ⢠Updated ⢠99.9k ⢠⢠85
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video ⢠Updated ⢠10.7k ⢠⢠277 -
allenai/olmOCR-7B-0725
Image-Text-to-Text ⢠8B ⢠Updated ⢠11.7k ⢠53 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video ⢠Updated ⢠12.9k ⢠⢠261 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation ⢠235B ⢠Updated ⢠44k ⢠⢠334
-
HuggingFaceTB/SmolLM3-3B
Text Generation ⢠3B ⢠Updated ⢠163k ⢠⢠678 -
moonshotai/Kimi-K2-Instruct
Text Generation ⢠Updated ⢠410k ⢠⢠2.12k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image ⢠Updated ⢠1.51k ⢠⢠45 -
Alibaba-NLP/WebSailor-3B
3B ⢠Updated ⢠169 ⢠71
-
nari-labs/Dia-1.6B-0626
Text-to-Speech ⢠2B ⢠Updated ⢠52.2k ⢠93 -
google/gemma-3n-E4B-it
Image-Text-to-Text ⢠8B ⢠Updated ⢠140k ⢠744 -
ByteDance/XVerse
Text-to-Image ⢠Updated ⢠194 ⢠90 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval ⢠4B ⢠Updated ⢠2.17k ⢠39
-
opendatalab/OmniDocBench
Viewer ⢠Updated ⢠984 ⢠5.28k ⢠34 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text ⢠4B ⢠Updated ⢠314k ⢠1.49k -
echo840/MonkeyOCR
Image-Text-to-Text ⢠Updated ⢠1.29k ⢠505 -
Running on ZeroMCP126126
OCR2
š»nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any ⢠15B ⢠Updated ⢠760 ⢠1.12k -
mistralai/Devstral-Small-2505
24B ⢠Updated ⢠13.2k ⢠844 -
ByteDance/Dolphin
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠87k ⢠471 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text ⢠1B ⢠Updated ⢠18.7k ⢠53
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text ⢠16B ⢠Updated ⢠4.94k ⢠438 -
agentica-org/DeepCoder-14B-Preview
Text Generation ⢠15B ⢠Updated ⢠45.4k ⢠⢠674 -
HiDream-ai/HiDream-I1-Full
Text-to-Image ⢠Updated ⢠110k ⢠⢠962 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text ⢠78B ⢠Updated ⢠426k ⢠216
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text ⢠8B ⢠Updated ⢠29.3k ⢠82 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text ⢠35B ⢠Updated ⢠17.9k ⢠151 -
open-r1/OpenR1-Qwen-7B
Text Generation ⢠8B ⢠Updated ⢠1.14k ⢠⢠53 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity ⢠0.5B ⢠Updated ⢠237k ⢠426
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation ⢠406B ⢠Updated ⢠81 ⢠107 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text ⢠73B ⢠Updated ⢠885k ⢠⢠534 -
mistralai/Mistral-Small-24B-Instruct-2501
24B ⢠Updated ⢠193k ⢠939 -
deepseek-ai/Janus-Pro-7B
Any-to-Any ⢠Updated ⢠173k ⢠3.48k
-
ostris/Flex.1-alpha
Text-to-Image ⢠Updated ⢠1.11k ⢠473 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification ⢠73B ⢠Updated ⢠1.36k ⢠72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text ⢠0.5B ⢠Updated ⢠131k ⢠170 -
deepseek-ai/DeepSeek-R1
Text Generation ⢠685B ⢠Updated ⢠472k ⢠⢠12.7k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠58.1k ⢠539 -
Qwen/QwQ-32B-Preview
Text Generation ⢠33B ⢠Updated ⢠110k ⢠⢠1.74k -
nvidia/Hymba-1.5B-Base
Text Generation ⢠2B ⢠Updated ⢠524 ⢠147 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval ⢠Updated ⢠1.44k ⢠53
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification ⢠Updated ⢠99 ⢠58 -
microsoft/LLM2CLIP-EVA02-B-16
Updated ⢠63 ⢠10 -
PleIAs/common_corpus
Viewer ⢠Updated ⢠470M ⢠14.7k ⢠305 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation ⢠33B ⢠Updated ⢠132k ⢠⢠1.92k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper ⢠2409.11402 ⢠Published ⢠75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper ⢠2404.07204 ⢠Published ⢠19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper ⢠2403.18814 ⢠Published ⢠48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper ⢠2409.17146 ⢠Published ⢠122
-
Runtime error102102
LOTUS Normal
šGenerate high-quality predictions from images
-
Runtime error7676
LOTUS Depth
šGenerate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation ⢠Updated ⢠18.2k ⢠24 -
jingheya/lotus-depth-d-v1-0
Depth Estimation ⢠Updated ⢠392 ⢠5
-
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠780k ⢠92 -
google/flan-t5-xl
3B ⢠Updated ⢠368k ⢠517 -
google/siglip-large-patch16-384
Zero-Shot Image Classification ⢠0.7B ⢠Updated ⢠13.5k ⢠8 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction ⢠0.6B ⢠Updated ⢠23.7k ⢠21
-
facebook/deit-base-distilled-patch16-384
Image Classification ⢠0.1B ⢠Updated ⢠632 ⢠6 -
facebook/convnextv2-base-1k-224
Image Classification ⢠0.1B ⢠Updated ⢠156 ⢠⢠3 -
facebook/deit-base-distilled-patch16-224
Image Classification ⢠Updated ⢠16k ⢠⢠27 -
google/vit-base-patch32-384
Image Classification ⢠0.1B ⢠Updated ⢠3.81k ⢠⢠23
-
facebook/maskformer-swin-large-coco
Image Segmentation ⢠0.2B ⢠Updated ⢠2.5k ⢠⢠26 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation ⢠0.0B ⢠Updated ⢠190k ⢠⢠164 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation ⢠0.0B ⢠Updated ⢠29 ⢠⢠3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation ⢠Updated ⢠217k ⢠⢠30
-
timbrooks/instruct-pix2pix
Image-to-Image ⢠Updated ⢠43.5k ⢠1.14k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image ⢠Updated ⢠3.97k ⢠52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image ⢠Updated ⢠4.41k ⢠76 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image ⢠Updated ⢠959 ⢠572
-
Salesforce/blip-image-captioning-large
Image-to-Text ⢠0.5B ⢠Updated ⢠1.2M ⢠1.4k -
Salesforce/blip-image-captioning-base
Image-to-Text ⢠Updated ⢠2.07M ⢠771 -
microsoft/trocr-base-handwritten
Image-to-Text ⢠0.3B ⢠Updated ⢠699k ⢠434 -
microsoft/git-large-coco
Image-to-Text ⢠0.4B ⢠Updated ⢠7.67k ⢠104
-
Running8989
Grounding DINO Demo
š»Cutting edge open-vocabulary object detection app
-
Running9191
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Runtime error4141
BLIP2 with transformers
šBLIP2 (cutting edge image captioning) in š¤transformers
-
Build error377377
IDEFICS Playground
šØ
-
Running9191
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
ā”Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
š¦Search and detect objects in images using text queries
-
Running on Zero104104
OWLSAM
š»State-of-the-art open-vocabulary image segmentation ā”ļø
-
Improved Baselines with Visual Instruction Tuning
Paper ⢠2310.03744 ⢠Published ⢠39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper ⢠2403.05525 ⢠Published ⢠47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper ⢠2308.12966 ⢠Published ⢠9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper ⢠2404.01331 ⢠Published ⢠28
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation ⢠Updated ⢠9.57k ⢠71 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation ⢠Updated ⢠95.4k ⢠119 -
Running on Zero515515
Depth Anything V2
šGenerate depth maps from images
-
depth-anything/DA-2K
Viewer ⢠Updated ⢠1.04k ⢠477 ⢠12
-
Running170170
Vidore Leaderboard
š„Explore visual document retrieval benchmark results
-
Running on CPU Upgrade869869
Open VLM Leaderboard
šVLMEvalKit Evaluation Results Collection
-
Running557557
Vision Arena (Testing VLMs side-by-side)
š¼Display image analysis results
-
Running8585
SEED-Bench Leaderboard
šSubmit model evaluation results to leaderboard
-
vidore/colpali-v1.2
Visual Document Retrieval ⢠Updated ⢠27.2k ⢠110 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠438k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠2.89M ⢠444 -
Qwen/Qwen2-72B-Instruct
Text Generation ⢠73B ⢠Updated ⢠46.2k ⢠⢠716
-
stepfun-ai/step3
Image-Text-to-Text ⢠321B ⢠Updated ⢠24.9k ⢠149 -
nunchaku-tech/nunchaku-flux.1-krea-dev
Text-to-Image ⢠Updated ⢠42.1k ⢠100 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation ⢠8B ⢠Updated ⢠9.84k ⢠⢠36 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video ⢠Updated ⢠15.6k ⢠71
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation ⢠33B ⢠Updated ⢠3.29k ⢠⢠110 -
ByteDance-Seed/Seed-X-RM-7B
Translation ⢠Updated ⢠340 ⢠30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation ⢠32B ⢠Updated ⢠129k ⢠249 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval ⢠Updated ⢠7.27k ⢠87
-
Qwen/WorldPM-72B
Text Classification ⢠73B ⢠Updated ⢠1.29k ⢠76 -
Running on ZeroMCP1.18k1.18k
LTX Video Fast
š„ultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer ⢠Updated ⢠27.2M ⢠13.6k ⢠50 -
BLIP3o/BLIP3o-Model-8B
14B ⢠Updated ⢠1.42k ⢠101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text ⢠0.9B ⢠Updated ⢠89.2k ⢠5 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text ⢠2B ⢠Updated ⢠11.6k ⢠3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠19.4k ⢠9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text ⢠15B ⢠Updated ⢠3.17k
-
deepseek-ai/DeepSeek-V3-0324
Text Generation ⢠685B ⢠Updated ⢠345k ⢠⢠3.05k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any ⢠11B ⢠Updated ⢠171k ⢠1.77k -
google/txgemma-27b-chat
Text Generation ⢠27B ⢠Updated ⢠575 ⢠54 -
Running343343
Qwen2.5 Omni 7B Demo
šGenerate text and speech from text, audio, images, and videos
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠438k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠2.89M ⢠444 -
CohereLabs/aya-vision-8b
Image-Text-to-Text ⢠9B ⢠Updated ⢠58.9k ⢠⢠309 -
CohereLabs/aya-vision-32b
Image-Text-to-Text ⢠33B ⢠Updated ⢠180 ⢠⢠215
-
Running on Zero257257
Qwen2-VL-7B
š„Generate text from an image and question
-
Running5858
UI-TARS
šSelect coordinates on an image based on instructions
-
Running8888
Qwen2.5-1M Demo
š»Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation ⢠15B ⢠Updated ⢠16.3k ⢠⢠317
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation ⢠71B ⢠Updated ⢠521k ⢠⢠2.48k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text ⢠73B ⢠Updated ⢠1.15k ⢠80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text ⢠3B ⢠Updated ⢠391k ⢠156 -
tencent/HunyuanVideo
Text-to-Video ⢠Updated ⢠4.02k ⢠⢠2.03k
-
mistralai/Pixtral-Large-Instruct-2411
Updated ⢠51 ⢠420 -
microsoft/orca-agentinstruct-1M-v1
Viewer ⢠Updated ⢠1.05M ⢠2.26k ⢠450 -
Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text ⢠11B ⢠Updated ⢠3.5k ⢠155 -
jinaai/jina-clip-v2
Feature Extraction ⢠0.9B ⢠Updated ⢠62.5k ⢠276
-
ibm-granite/granite-3.0-8b-instruct
Text Generation ⢠8B ⢠Updated ⢠29.9k ⢠201 -
ibm-granite/granite-3.0-2b-instruct
Text Generation ⢠3B ⢠Updated ⢠4.78k ⢠46 -
CohereLabs/aya-expanse-8b
Text Generation ⢠8B ⢠Updated ⢠15.4k ⢠⢠399 -
CohereLabs/aya-expanse-32b
Text Generation ⢠32B ⢠Updated ⢠7.64k ⢠⢠267
-
microsoft/resnet-50
Image Classification ⢠0.0B ⢠Updated ⢠160k ⢠⢠436 -
google/vit-base-patch16-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠5.68M ⢠367 -
google/vit-base-patch32-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠55.7k ⢠19 -
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠780k ⢠92
-
facebook/detr-resnet-50
Object Detection ⢠0.0B ⢠Updated ⢠454k ⢠⢠891 -
facebook/detr-resnet-101-dc5
Object Detection ⢠0.1B ⢠Updated ⢠2.35k ⢠19 -
facebook/detr-resnet-50-dc5
Object Detection ⢠0.0B ⢠Updated ⢠1.87k ⢠6 -
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification ⢠0.4B ⢠Updated ⢠9.14M ⢠1.85k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification ⢠Updated ⢠16.8M ⢠752 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification ⢠Updated ⢠143k ⢠289 -
kakaobrain/align-base
Zero-Shot Image Classification ⢠Updated ⢠23.5k ⢠26
-
microsoft/xclip-base-patch32
Video Classification ⢠0.2B ⢠Updated ⢠148k ⢠98 -
facebook/timesformer-base-finetuned-k400
Video Classification ⢠Updated ⢠24.6k ⢠42 -
facebook/timesformer-base-finetuned-k600
Video Classification ⢠Updated ⢠8.07k ⢠12 -
google/vivit-b-16x2
Video Classification ⢠Updated ⢠344 ⢠11
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image ⢠Updated ⢠2.17M ⢠⢠6.89k -
warp-ai/wuerstchen
Text-to-Image ⢠Updated ⢠148 ⢠175 -
Deci/DeciDiffusion-v1-0
Text-to-Image ⢠Updated ⢠21 ⢠139 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image ⢠Updated ⢠559k ⢠1.96k
-
Running on Zero7171
Draw To Search Art
šDraw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2323
Compare Clip Siglip
šCompare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
š¢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation ⢠Updated ⢠46 ⢠48
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
Running2121
Video Llava
šØGenerate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠106k ⢠107 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠1.47k ⢠9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠1.21k ⢠7
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text ⢠15B ⢠Updated ⢠84 ⢠15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text ⢠15B ⢠Updated ⢠1.17k ⢠28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text ⢠9B ⢠Updated ⢠5.01k ⢠26 -
Running on Zero6464
Eagle X5 13B Chat
šCombine text and images to generate responses
-
microsoft/VibeVoice-1.5B
Text-to-Speech ⢠3B ⢠Updated ⢠107k ⢠1.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠5.83k ⢠56 -
apple/FastVLM-1.5B
Text Generation ⢠2B ⢠Updated ⢠786 ⢠21 -
stepfun-ai/Step-Audio-2-mini
8B ⢠Updated ⢠737 ⢠152
-
openai/gpt-oss-120b
Text Generation ⢠120B ⢠Updated ⢠2.43M ⢠⢠3.69k -
openai/gpt-oss-20b
Text Generation ⢠22B ⢠Updated ⢠9.04M ⢠⢠3.36k -
openai/BrowseCompLongContext
Viewer ⢠Updated ⢠295 ⢠3.39k ⢠40 -
baichuan-inc/Baichuan-M2-32B
Text Generation ⢠33B ⢠Updated ⢠99.9k ⢠⢠85
-
stepfun-ai/step3
Image-Text-to-Text ⢠321B ⢠Updated ⢠24.9k ⢠149 -
nunchaku-tech/nunchaku-flux.1-krea-dev
Text-to-Image ⢠Updated ⢠42.1k ⢠100 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation ⢠8B ⢠Updated ⢠9.84k ⢠⢠36 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video ⢠Updated ⢠15.6k ⢠71
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video ⢠Updated ⢠10.7k ⢠⢠277 -
allenai/olmOCR-7B-0725
Image-Text-to-Text ⢠8B ⢠Updated ⢠11.7k ⢠53 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video ⢠Updated ⢠12.9k ⢠⢠261 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation ⢠235B ⢠Updated ⢠44k ⢠⢠334
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation ⢠33B ⢠Updated ⢠3.29k ⢠⢠110 -
ByteDance-Seed/Seed-X-RM-7B
Translation ⢠Updated ⢠340 ⢠30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation ⢠32B ⢠Updated ⢠129k ⢠249 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval ⢠Updated ⢠7.27k ⢠87
-
HuggingFaceTB/SmolLM3-3B
Text Generation ⢠3B ⢠Updated ⢠163k ⢠⢠678 -
moonshotai/Kimi-K2-Instruct
Text Generation ⢠Updated ⢠410k ⢠⢠2.12k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image ⢠Updated ⢠1.51k ⢠⢠45 -
Alibaba-NLP/WebSailor-3B
3B ⢠Updated ⢠169 ⢠71
-
nari-labs/Dia-1.6B-0626
Text-to-Speech ⢠2B ⢠Updated ⢠52.2k ⢠93 -
google/gemma-3n-E4B-it
Image-Text-to-Text ⢠8B ⢠Updated ⢠140k ⢠744 -
ByteDance/XVerse
Text-to-Image ⢠Updated ⢠194 ⢠90 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval ⢠4B ⢠Updated ⢠2.17k ⢠39
-
opendatalab/OmniDocBench
Viewer ⢠Updated ⢠984 ⢠5.28k ⢠34 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text ⢠4B ⢠Updated ⢠314k ⢠1.49k -
echo840/MonkeyOCR
Image-Text-to-Text ⢠Updated ⢠1.29k ⢠505 -
Running on ZeroMCP126126
OCR2
š»nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any ⢠15B ⢠Updated ⢠760 ⢠1.12k -
mistralai/Devstral-Small-2505
24B ⢠Updated ⢠13.2k ⢠844 -
ByteDance/Dolphin
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠87k ⢠471 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text ⢠1B ⢠Updated ⢠18.7k ⢠53
-
Qwen/WorldPM-72B
Text Classification ⢠73B ⢠Updated ⢠1.29k ⢠76 -
Running on ZeroMCP1.18k1.18k
LTX Video Fast
š„ultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer ⢠Updated ⢠27.2M ⢠13.6k ⢠50 -
BLIP3o/BLIP3o-Model-8B
14B ⢠Updated ⢠1.42k ⢠101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text ⢠0.9B ⢠Updated ⢠89.2k ⢠5 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text ⢠2B ⢠Updated ⢠11.6k ⢠3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠19.4k ⢠9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text ⢠15B ⢠Updated ⢠3.17k
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text ⢠16B ⢠Updated ⢠4.94k ⢠438 -
agentica-org/DeepCoder-14B-Preview
Text Generation ⢠15B ⢠Updated ⢠45.4k ⢠⢠674 -
HiDream-ai/HiDream-I1-Full
Text-to-Image ⢠Updated ⢠110k ⢠⢠962 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text ⢠78B ⢠Updated ⢠426k ⢠216
-
deepseek-ai/DeepSeek-V3-0324
Text Generation ⢠685B ⢠Updated ⢠345k ⢠⢠3.05k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any ⢠11B ⢠Updated ⢠171k ⢠1.77k -
google/txgemma-27b-chat
Text Generation ⢠27B ⢠Updated ⢠575 ⢠54 -
Running343343
Qwen2.5 Omni 7B Demo
šGenerate text and speech from text, audio, images, and videos
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠438k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠2.89M ⢠444 -
CohereLabs/aya-vision-8b
Image-Text-to-Text ⢠9B ⢠Updated ⢠58.9k ⢠⢠309 -
CohereLabs/aya-vision-32b
Image-Text-to-Text ⢠33B ⢠Updated ⢠180 ⢠⢠215
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text ⢠8B ⢠Updated ⢠29.3k ⢠82 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text ⢠35B ⢠Updated ⢠17.9k ⢠151 -
open-r1/OpenR1-Qwen-7B
Text Generation ⢠8B ⢠Updated ⢠1.14k ⢠⢠53 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity ⢠0.5B ⢠Updated ⢠237k ⢠426
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation ⢠406B ⢠Updated ⢠81 ⢠107 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text ⢠73B ⢠Updated ⢠885k ⢠⢠534 -
mistralai/Mistral-Small-24B-Instruct-2501
24B ⢠Updated ⢠193k ⢠939 -
deepseek-ai/Janus-Pro-7B
Any-to-Any ⢠Updated ⢠173k ⢠3.48k
-
Running on Zero257257
Qwen2-VL-7B
š„Generate text from an image and question
-
Running5858
UI-TARS
šSelect coordinates on an image based on instructions
-
Running8888
Qwen2.5-1M Demo
š»Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation ⢠15B ⢠Updated ⢠16.3k ⢠⢠317
-
ostris/Flex.1-alpha
Text-to-Image ⢠Updated ⢠1.11k ⢠473 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification ⢠73B ⢠Updated ⢠1.36k ⢠72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text ⢠0.5B ⢠Updated ⢠131k ⢠170 -
deepseek-ai/DeepSeek-R1
Text Generation ⢠685B ⢠Updated ⢠472k ⢠⢠12.7k
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation ⢠71B ⢠Updated ⢠521k ⢠⢠2.48k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text ⢠73B ⢠Updated ⢠1.15k ⢠80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text ⢠3B ⢠Updated ⢠391k ⢠156 -
tencent/HunyuanVideo
Text-to-Video ⢠Updated ⢠4.02k ⢠⢠2.03k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠58.1k ⢠539 -
Qwen/QwQ-32B-Preview
Text Generation ⢠33B ⢠Updated ⢠110k ⢠⢠1.74k -
nvidia/Hymba-1.5B-Base
Text Generation ⢠2B ⢠Updated ⢠524 ⢠147 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval ⢠Updated ⢠1.44k ⢠53
-
mistralai/Pixtral-Large-Instruct-2411
Updated ⢠51 ⢠420 -
microsoft/orca-agentinstruct-1M-v1
Viewer ⢠Updated ⢠1.05M ⢠2.26k ⢠450 -
Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text ⢠11B ⢠Updated ⢠3.5k ⢠155 -
jinaai/jina-clip-v2
Feature Extraction ⢠0.9B ⢠Updated ⢠62.5k ⢠276
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification ⢠Updated ⢠99 ⢠58 -
microsoft/LLM2CLIP-EVA02-B-16
Updated ⢠63 ⢠10 -
PleIAs/common_corpus
Viewer ⢠Updated ⢠470M ⢠14.7k ⢠305 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation ⢠33B ⢠Updated ⢠132k ⢠⢠1.92k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper ⢠2409.11402 ⢠Published ⢠75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper ⢠2404.07204 ⢠Published ⢠19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper ⢠2403.18814 ⢠Published ⢠48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper ⢠2409.17146 ⢠Published ⢠122
-
ibm-granite/granite-3.0-8b-instruct
Text Generation ⢠8B ⢠Updated ⢠29.9k ⢠201 -
ibm-granite/granite-3.0-2b-instruct
Text Generation ⢠3B ⢠Updated ⢠4.78k ⢠46 -
CohereLabs/aya-expanse-8b
Text Generation ⢠8B ⢠Updated ⢠15.4k ⢠⢠399 -
CohereLabs/aya-expanse-32b
Text Generation ⢠32B ⢠Updated ⢠7.64k ⢠⢠267
-
Runtime error102102
LOTUS Normal
šGenerate high-quality predictions from images
-
Runtime error7676
LOTUS Depth
šGenerate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation ⢠Updated ⢠18.2k ⢠24 -
jingheya/lotus-depth-d-v1-0
Depth Estimation ⢠Updated ⢠392 ⢠5
-
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠780k ⢠92 -
google/flan-t5-xl
3B ⢠Updated ⢠368k ⢠517 -
google/siglip-large-patch16-384
Zero-Shot Image Classification ⢠0.7B ⢠Updated ⢠13.5k ⢠8 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction ⢠0.6B ⢠Updated ⢠23.7k ⢠21
-
microsoft/resnet-50
Image Classification ⢠0.0B ⢠Updated ⢠160k ⢠⢠436 -
google/vit-base-patch16-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠5.68M ⢠367 -
google/vit-base-patch32-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠55.7k ⢠19 -
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠780k ⢠92
-
facebook/deit-base-distilled-patch16-384
Image Classification ⢠0.1B ⢠Updated ⢠632 ⢠6 -
facebook/convnextv2-base-1k-224
Image Classification ⢠0.1B ⢠Updated ⢠156 ⢠⢠3 -
facebook/deit-base-distilled-patch16-224
Image Classification ⢠Updated ⢠16k ⢠⢠27 -
google/vit-base-patch32-384
Image Classification ⢠0.1B ⢠Updated ⢠3.81k ⢠⢠23
-
facebook/detr-resnet-50
Object Detection ⢠0.0B ⢠Updated ⢠454k ⢠⢠891 -
facebook/detr-resnet-101-dc5
Object Detection ⢠0.1B ⢠Updated ⢠2.35k ⢠19 -
facebook/detr-resnet-50-dc5
Object Detection ⢠0.0B ⢠Updated ⢠1.87k ⢠6 -
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137
-
facebook/maskformer-swin-large-coco
Image Segmentation ⢠0.2B ⢠Updated ⢠2.5k ⢠⢠26 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation ⢠0.0B ⢠Updated ⢠190k ⢠⢠164 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation ⢠0.0B ⢠Updated ⢠29 ⢠⢠3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation ⢠Updated ⢠217k ⢠⢠30
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification ⢠0.4B ⢠Updated ⢠9.14M ⢠1.85k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification ⢠Updated ⢠16.8M ⢠752 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification ⢠Updated ⢠143k ⢠289 -
kakaobrain/align-base
Zero-Shot Image Classification ⢠Updated ⢠23.5k ⢠26
-
timbrooks/instruct-pix2pix
Image-to-Image ⢠Updated ⢠43.5k ⢠1.14k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image ⢠Updated ⢠3.97k ⢠52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image ⢠Updated ⢠4.41k ⢠76 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image ⢠Updated ⢠959 ⢠572
-
microsoft/xclip-base-patch32
Video Classification ⢠0.2B ⢠Updated ⢠148k ⢠98 -
facebook/timesformer-base-finetuned-k400
Video Classification ⢠Updated ⢠24.6k ⢠42 -
facebook/timesformer-base-finetuned-k600
Video Classification ⢠Updated ⢠8.07k ⢠12 -
google/vivit-b-16x2
Video Classification ⢠Updated ⢠344 ⢠11
-
Salesforce/blip-image-captioning-large
Image-to-Text ⢠0.5B ⢠Updated ⢠1.2M ⢠1.4k -
Salesforce/blip-image-captioning-base
Image-to-Text ⢠Updated ⢠2.07M ⢠771 -
microsoft/trocr-base-handwritten
Image-to-Text ⢠0.3B ⢠Updated ⢠699k ⢠434 -
microsoft/git-large-coco
Image-to-Text ⢠0.4B ⢠Updated ⢠7.67k ⢠104
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image ⢠Updated ⢠2.17M ⢠⢠6.89k -
warp-ai/wuerstchen
Text-to-Image ⢠Updated ⢠148 ⢠175 -
Deci/DeciDiffusion-v1-0
Text-to-Image ⢠Updated ⢠21 ⢠139 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image ⢠Updated ⢠559k ⢠1.96k
-
Running8989
Grounding DINO Demo
š»Cutting edge open-vocabulary object detection app
-
Running9191
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Runtime error4141
BLIP2 with transformers
šBLIP2 (cutting edge image captioning) in š¤transformers
-
Build error377377
IDEFICS Playground
šØ
-
Running9191
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
ā”Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
š¦Search and detect objects in images using text queries
-
Running on Zero104104
OWLSAM
š»State-of-the-art open-vocabulary image segmentation ā”ļø
-
Running on Zero7171
Draw To Search Art
šDraw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2323
Compare Clip Siglip
šCompare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
š¢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation ⢠Updated ⢠46 ⢠48
-
Improved Baselines with Visual Instruction Tuning
Paper ⢠2310.03744 ⢠Published ⢠39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper ⢠2403.05525 ⢠Published ⢠47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper ⢠2308.12966 ⢠Published ⢠9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper ⢠2404.01331 ⢠Published ⢠28
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠119k ⢠137 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠6.99k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠34.2k ⢠26 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠17.6k ⢠27
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation ⢠Updated ⢠9.57k ⢠71 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation ⢠Updated ⢠95.4k ⢠119 -
Running on Zero515515
Depth Anything V2
šGenerate depth maps from images
-
depth-anything/DA-2K
Viewer ⢠Updated ⢠1.04k ⢠477 ⢠12
-
Running170170
Vidore Leaderboard
š„Explore visual document retrieval benchmark results
-
Running on CPU Upgrade869869
Open VLM Leaderboard
šVLMEvalKit Evaluation Results Collection
-
Running557557
Vision Arena (Testing VLMs side-by-side)
š¼Display image analysis results
-
Running8585
SEED-Bench Leaderboard
šSubmit model evaluation results to leaderboard
-
Running2121
Video Llava
šØGenerate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠106k ⢠107 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠1.47k ⢠9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠1.21k ⢠7
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text ⢠15B ⢠Updated ⢠84 ⢠15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text ⢠15B ⢠Updated ⢠1.17k ⢠28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text ⢠9B ⢠Updated ⢠5.01k ⢠26 -
Running on Zero6464
Eagle X5 13B Chat
šCombine text and images to generate responses
-
vidore/colpali-v1.2
Visual Document Retrieval ⢠Updated ⢠27.2k ⢠110 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠438k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠2.89M ⢠444 -
Qwen/Qwen2-72B-Instruct
Text Generation ⢠73B ⢠Updated ⢠46.2k ⢠⢠716