Voiceprint Datasets

Name	Key Information	Year	Source	Reference	Owner
VoxBlink2	An extensive audio-visual speaker recognition dataset featuring approximately 10 million utterances from over 110,000 speakers.	2024	Access Link	Cite	VoxBlink2 Team
GigaSpeech 2	GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech.	2024	Access Link	Cite	SpeechColab
3D-Speaker	3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects	2023	Access Link	Cite	Alibaba DAMO Academy
VoxTube	It contains more than 5.000 speakers with more than 4 million utterances pronounced in more than 10 languages.	2023	Access Link	Cite	ID R&D Inc.
AliMeeting	The dataset contains 118.75 hours of speech data in total.	2022	Access Link	Cite	Alibaba
MAGICDATA Mandarin Chinese Conversational Speech Corpus	The corpus contains 180 hours of speech data, which is all mobile recorded data.	2022	Access Link	Cite	Magic Data Technology
CN-Celeb	A large-scale speaker recognition dataset collected “in the wild,” comprising two subsets: CN-Celeb1 with over 130,000 utterances from 1,000 Chinese celebrities, and CN-Celeb2 with over 520,000 utterances from 2,000 celebrities. Covers 11 different genres.	2020	Access Link	Cite	Chinese Academy of Sciences
Common Voice	It has collected over 30,000 hours of speech data through a crowdsourcing model, encompassing more than 180 languages.	2020	Access Link	Cite	Mozilla Foundation
HI-MIA	It contains recordings of 340 people in rooms designed for the far-field scenario.	2019	Access Link	Cite	AIShell Tech
The VoxCeleb2 Dataset	It contains over 1 million utterances from 6,112 celebrities. These utterances are all extracted from videos uploaded to YouTube.	2018	Access Link	Cite	University of Oxford
The VoxCeleb1 Dataset	It contains over 100,000 utterances from 1,251 celebrities, which are extracted from videos uploaded to YouTube.	2017	Access Link	Cite	University of Oxford
ASVspoof 2017	The dataset contains 4,800 speech samples, simulating attack scenarios across 10 types of consumer devices.	2017	Access Link	Cite	University of Edinburgh
LibriSpeech ASR corpus	It is a large-scale corpus of read English speech, with a duration of approximately 1000 hours and a sampling rate of 16kHz.	2015	Access Link	Cite	The Johns Hopkins University
ASVspoof 2015	The dataset comprises genuine speech from 106 speakers and nearly 200,000 spoofed speech samples.	2015	Access Link	Cite	University of Edinburgh
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus	This dataset contains 6,300 sentences, with each of the 630 speakers from 8 major dialect regions in the United States.	1993	Access Link	Cite	NIST