-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2406.08407
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 16 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 89 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 32
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 9 -
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 20 -
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Paper • 2406.09411 • Published • 20
-
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 9 -
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Paper • 2406.08407 • Published • 28 -
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Paper • 2408.03361 • Published • 86
-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • 2404.07973 • Published • 32 -
RegionGPT: Towards Region Understanding Vision Language Model
Paper • 2403.02330 • Published • 2 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 -
Pegasus-v1 Technical Report
Paper • 2404.14687 • Published • 32
-
Can large language models explore in-context?
Paper • 2403.15371 • Published • 33 -
GaussianCube: Structuring Gaussian Splatting using Optimal Transport for 3D Generative Modeling
Paper • 2403.19655 • Published • 19 -
WavLLM: Towards Robust and Adaptive Speech Large Language Model
Paper • 2404.00656 • Published • 11 -
Enabling Memory Safety of C Programs using LLMs
Paper • 2404.01096 • Published • 1