arxiv:2508.13618

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Published on Aug 19

· Submitted by

Authors:

Abstract

TalkVid, a large-scale, high-quality, and diverse dataset, improves audio-driven talking head synthesis by enhancing generalization across human diversity and revealing subgroup performance disparities.

AI-generated summary

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

View arXiv page View PDF Project page GitHub 72 Add to collection

Community

Shunian

Paper submitter 1 day ago

•

edited 1 day ago

TalkVid is a large-scale and diversified open-source dataset for audio-driven talking head synthesis, featuring:

Scale: 7,729 unique speakers with over 1,244 hours of HD/4K footage
Diversity: Covers 15 languages and wide age range (0–60+ years)
Quality: High-resolution videos (1080p & 2160p) with comprehensive quality filtering
Rich Context: Full upper-body presence unlike head-only datasets
Annotations: High-quality captions and comprehensive metadata

Download Link: 🤗 Hugging Face