-
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 26 -
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 41 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 126 -
Autoregressive Video Generation without Vector Quantization
Paper • 2412.14169 • Published • 14
Collections
Discover the best community collections!
Collections including paper arxiv:2401.14405
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 36 -
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 93
-
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
Paper • 2312.06134 • Published • 3 -
Efficient Monotonic Multihead Attention
Paper • 2312.04515 • Published • 7 -
Contrastive Decoding Improves Reasoning in Large Language Models
Paper • 2309.09117 • Published • 38 -
Exploring Format Consistency for Instruction Tuning
Paper • 2307.15504 • Published • 8
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
Paper • 2401.17270 • Published • 36 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 17 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 27
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Paper • 2401.13919 • Published • 30 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper • 2403.03163 • Published • 94 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Paper • 2407.06358 • Published • 19 -
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Paper • 2412.07825 • Published • 12
-
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Paper • 2312.08344 • Published • 10 -
Diffusion Priors for Dynamic View Synthesis from Monocular Videos
Paper • 2401.05583 • Published • 10 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Paper • 2501.07301 • Published • 89