Submitted by zwq2018 100 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining · 9 authors 7
Submitted by xichenhku 51 VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control · 6 authors 3
Submitted by akhaliq 50 CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings · 17 authors 6
Submitted by CircleRadon 41 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM · 12 authors 2
Submitted by xiazhi 37 Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models · 2 authors 2
Submitted by dongguanting 25 ProgCo: Program Helps Self-Correction of Large Language Models · 6 authors 2
Submitted by mahirlabibdihan 22 MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models · 8 authors 2
Submitted by tyleryzhu 21 Unifying Specialized Visual Encoders for Video Language Models · 6 authors 2
Submitted by orpatashnik 11 Nested Attention: Semantic-aware Attention Values for Concept Personalization · 6 authors 2
Submitted by Iceclear 11 SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration · 7 authors 2
Submitted by mahirlabibdihan 10 MapQaTor: A System for Efficient Annotation of Map Query Datasets · 3 authors 2
Submitted by peihaowang 7 Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing · 7 authors 2
Submitted by lanczos 6 Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding · 6 authors 4
Submitted by Harold328 5 SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization · 6 authors 2