Papers
arxiv:2508.13618

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Published on Aug 19
· Submitted by Shunian on Sep 1
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

TalkVid, a large-scale, high-quality, and diverse dataset, improves audio-driven talking head synthesis by enhancing generalization across human diversity and revealing subgroup performance disparities.

AI-generated summary

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

Community

Paper submitter
edited 1 day ago

TalkVid is a large-scale and diversified open-source dataset for audio-driven talking head synthesis, featuring:

  • Scale: 7,729 unique speakers with over 1,244 hours of HD/4K footage
  • Diversity: Covers 15 languages and wide age range (0–60+ years)
  • Quality: High-resolution videos (1080p & 2160p) with comprehensive quality filtering
  • Rich Context: Full upper-body presence unlike head-only datasets
  • Annotations: High-quality captions and comprehensive metadata

Download Link: 🤗 Hugging Face

Paper submitter

Data Samples:
Case_study_samples.png

Comparison with existing datasets:

image.png

Data Filtering Process:
image.png

Dataset Statistics:

image.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.13618 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.13618 in a Space README.md to link it from this page.

Collections including this paper 1