Name Key Information Year Source Reference Owner
VoxBlink2 An extensive audio-visual speaker recognition dataset featuring approximately 10 million utterances from over 110,000 speakers. 2024 Access Link Cite VoxBlink2 Team
GigaSpeech 2 GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech. 2024 Access Link Cite SpeechColab
3D-Speaker 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects 2023 Access Link Cite Alibaba DAMO Academy
VoxTube It contains more than 5.000 speakers with more than 4 million utterances pronounced in more than 10 languages. 2023 Access Link Cite ID R&D Inc.
AliMeeting The dataset contains 118.75 hours of speech data in total. 2022 Access Link Cite Alibaba
MAGICDATA Mandarin Chinese Conversational Speech Corpus The corpus contains 180 hours of speech data, which is all mobile recorded data. 2022 Access Link Cite Magic Data Technology
CN-Celeb A large-scale speaker recognition dataset collected “in the wild,” comprising two subsets: CN-Celeb1 with over 130,000 utterances from 1,000 Chinese celebrities, and CN-Celeb2 with over 520,000 utterances from 2,000 celebrities. Covers 11 different genres. 2020 Access Link Cite Chinese Academy of Sciences​
Common Voice It has collected over 30,000 hours of speech data through a crowdsourcing model, encompassing more than 180 languages. 2020 Access Link Cite Mozilla Foundation
HI-MIA It contains recordings of 340 people in rooms designed for the far-field scenario. 2019 Access Link Cite AIShell Tech
The VoxCeleb2 Dataset It contains over 1 million utterances from 6,112 celebrities. These utterances are all extracted from videos uploaded to YouTube. 2018 Access Link Cite University of Oxford
The VoxCeleb1 Dataset It contains over 100,000 utterances from 1,251 celebrities, which are extracted from videos uploaded to YouTube. 2017 Access Link Cite University of Oxford
ASVspoof 2017 The dataset contains 4,800 speech samples, simulating attack scenarios across 10 types of consumer devices. 2017 Access Link Cite University of Edinburgh
LibriSpeech ASR corpus It is a large-scale corpus of read English speech, with a duration of approximately 1000 hours and a sampling rate of 16kHz. 2015 Access Link Cite The Johns Hopkins University
ASVspoof 2015 The dataset comprises genuine speech from 106 speakers and nearly 200,000 spoofed speech samples. 2015 Access Link Cite University of Edinburgh
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus This dataset contains 6,300 sentences, with each of the 630 speakers from 8 major dialect regions in the United States. 1993 Access Link Cite NIST