-
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 25 -
OmniFusion Technical Report
Paper • 2404.06212 • Published • 75 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 13 -
WildGaussians: 3D Gaussian Splatting in the Wild
Paper • 2407.08447 • Published • 8
Collections
Discover the best community collections!
Collections including paper arxiv:2404.16030
-
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
Paper • 2403.19888 • Published • 12 -
Measuring Style Similarity in Diffusion Models
Paper • 2404.01292 • Published • 17 -
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Paper • 1804.07461 • Published • 4 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 13
-
The Curious Case of Neural Text Degeneration
Paper • 1904.09751 • Published • 3 -
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 31 -
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Paper • 1905.10044 • Published • 1 -
PIQA: Reasoning about Physical Commonsense in Natural Language
Paper • 1911.11641 • Published • 2
-
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper • 2403.18978 • Published • 15 -
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Paper • 2404.02733 • Published • 21 -
OmniFusion Technical Report
Paper • 2404.06212 • Published • 75 -
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper • 2404.07448 • Published • 12
-
LIMA: Less Is More for Alignment
Paper • 2305.11206 • Published • 23 -
Garment3DGen: 3D Garment Stylization and Texture Generation
Paper • 2403.18816 • Published • 23 -
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
Paper • 2403.18118 • Published • 12 -
The Unreasonable Ineffectiveness of the Deeper Layers
Paper • 2403.17887 • Published • 79
-
Demystifying CLIP Data
Paper • 2309.16671 • Published • 20 -
Model Stock: All we need is just a few fine-tuned models
Paper • 2403.19522 • Published • 10 -
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Paper • 2404.01367 • Published • 21 -
On the Scalability of Diffusion-based Text-to-Image Generation
Paper • 2404.02883 • Published • 18
-
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Paper • 2308.10110 • Published • 2 -
HyperFormer: Enhancing Entity and Relation Interaction for Hyper-Relational Knowledge Graph Completion
Paper • 2308.06512 • Published • 2 -
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts
Paper • 2309.04354 • Published • 14 -
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper • 2212.05055 • Published • 5
-
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts
Paper • 2009.10622 • Published • 1 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 51 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 70 -
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
Paper • 2401.14361 • Published • 2
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 41 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
Paper • 2401.17270 • Published • 36 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 17 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 27