new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

Daily Papers

byAK and the research community

Mar 18

Submitted by

lixiaochuan

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

·
13 authors

1

Submitted by

tellarin

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

·
9 authors

1

Submitted by

limuloo1999

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

·
4 authors

2

Submitted by

Orannue

Edit Transfer: Learning Image Editing via Vision In-Context Relations

·
4 authors

4

Submitted by

huanngzh

Personalize Anything for Free with Diffusion Transformer

·
5 authors

3

Submitted by

yyyyyyjjjjzzz

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

·
8 authors

2

Submitted by

ZyZcuhk

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

·
9 authors

1

Submitted by

akhaliq

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

·
7 authors

Submitted by

Lingaaaaaaa

WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

·
8 authors

1

Submitted by

jmhb

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

·
23 authors

1

Submitted by

Gh0stAR

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

·
7 authors

1

Submitted by

ZhaofengWu

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

·
6 authors

1

Submitted by

lwpyh

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

·
6 authors

1

Submitted by

fpoiesi

Free-form language-based robotic reasoning and grasping

·
8 authors

1

Submitted by

KevinQHLin

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

·
4 authors

1

Submitted by

ysy31415926

MTV-Inpaint: Multi-Task Long Video Inpainting

·
7 authors

Submitted by

Luo-Yihong

Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

·
5 authors

1

Submitted by

Buzz-lightyear

Long-Video Audio Synthesis with Multi-Agent Collaboration

·
5 authors

2

Submitted by

k-nick

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

·
8 authors

1

Submitted by

soarhigh

Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

·
7 authors

1

Submitted by

zeeshanp

Training Video Foundation Models with NVIDIA NeMo

·
29 authors

Submitted by

JesseTNRoberts

Basic Category Usage in Vision Language Models

·
3 authors

1

Submitted by

JesseTNRoberts

Investigating Human-Aligned Large Language Model Uncertainty

·
4 authors

1

Submitted by

FQiao

GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching

·
4 authors

1

Submitted by

zxbsmk

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

·
12 authors

1

Submitted by

Sckathach

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

·
3 authors

1