Name Key Information Year Source Reference Owner
VoxBlink2 An extensive audio-visual speaker recognition dataset featuring approximately 10 million utterances from over 110,000 speakers. 2024 Access Link Cite VoxBlink2 Team
CN-Celeb A large-scale speaker recognition dataset collected “in the wild,” comprising two subsets: CN-Celeb1 with over 130,000 utterances from 1,000 Chinese celebrities, and CN-Celeb2 with over 520,000 utterances from 2,000 celebrities. Covers 11 different genres. 2020 Access Link Cite Chinese Academy of Sciences​
Common Voice It has collected over 30,000 hours of speech data through a crowdsourcing model, encompassing more than 180 languages. 2020 Access Link Cite Mozilla Foundation
The VoxCeleb2 Dataset It contains over 1 million utterances from 6,112 celebrities. These utterances are all extracted from videos uploaded to YouTube. 2018 Access Link Cite University of Oxford
The VoxCeleb1 Dataset It contains over 100,000 utterances from 1,251 celebrities, which are extracted from videos uploaded to YouTube. 2017 Access Link Cite University of Oxford
ASVspoof 2017 The dataset contains 4,800 speech samples, simulating attack scenarios across 10 types of consumer devices. 2017 Access Link Cite University of Edinburgh
LibriSpeech ASR corpus It is a large-scale corpus of read English speech, with a duration of approximately 1000 hours and a sampling rate of 16kHz. 2015 Access Link Cite The Johns Hopkins University
ASVspoof 2015 The dataset comprises genuine speech from 106 speakers and nearly 200,000 spoofed speech samples. 2015 Access Link Cite University of Edinburgh
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus This dataset contains 6,300 sentences, with each of the 630 speakers from 8 major dialect regions in the United States. 1993 Access Link Cite NIST