-
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 26 -
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper • 2412.11768 • Published • 41 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 135 -
Autoregressive Video Generation without Vector Quantization
Paper • 2412.14169 • Published • 14
Collections
Discover the best community collections!
Collections including paper arxiv:2410.02073
-
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
Paper • 2410.13218 • Published • 4 -
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Paper • 2410.10818 • Published • 17 -
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations
Paper • 2410.08049 • Published • 8 -
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Paper • 2410.05363 • Published • 45
-
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Paper • 2410.01804 • Published • 7 -
DreamCinema: Cinematic Transfer with Free Camera and 3D Character
Paper • 2408.12601 • Published • 30 -
Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation
Paper • 2408.14819 • Published • 21 -
DressRecon: Freeform 4D Human Reconstruction from Monocular Video
Paper • 2409.20563 • Published • 9
-
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
Paper • 2410.00201 • Published -
Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems
Paper • 2409.19804 • Published -
Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Paper • 2409.15156 • Published -
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Paper • 2409.04927 • Published
-
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper • 2405.20340 • Published • 20 -
Spectrally Pruned Gaussian Fields with Neural Compensation
Paper • 2405.00676 • Published • 10 -
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper • 2404.18212 • Published • 29 -
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper • 2405.00732 • Published • 120
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 29 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 38
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 9 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 27 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 34 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 29
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 610 -
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 55 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 34