TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Abstract
TalkVid, a large-scale, high-quality, and diverse dataset, improves audio-driven talking head synthesis by enhancing generalization across human diversity and revealing subgroup performance disparities.
Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid
Community
TalkVid is a large-scale and diversified open-source dataset for audio-driven talking head synthesis, featuring:
- Scale: 7,729 unique speakers with over 1,244 hours of HD/4K footage
- Diversity: Covers 15 languages and wide age range (0–60+ years)
- Quality: High-resolution videos (1080p & 2160p) with comprehensive quality filtering
- Rich Context: Full upper-body presence unlike head-only datasets
- Annotations: High-quality captions and comprehensive metadata
Download Link: 🤗 Hugging Face
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation (2025)
- Multi-human Interactive Talking Dataset (2025)
- Democratizing High-Fidelity Co-Speech Gesture Video Generation (2025)
- InfinityHuman: Towards Long-Term Audio-Driven Human (2025)
- JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 (2025)
- Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering (2025)
- AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper