VoxBlink2 |
An extensive audio-visual speaker recognition dataset featuring approximately 10 million utterances from over 110,000 speakers. |
2024 |
Access Link |
Cite |
VoxBlink2 Team |
CN-Celeb |
A large-scale speaker recognition dataset collected “in the wild,” comprising two subsets: CN-Celeb1 with over 130,000 utterances from 1,000 Chinese celebrities, and CN-Celeb2 with over 520,000 utterances from 2,000 celebrities. Covers 11 different genres. |
2020 |
Access Link |
Cite |
Chinese Academy of Sciences |
Common Voice |
It has collected over 30,000 hours of speech data through a crowdsourcing model, encompassing more than 180 languages. |
2020 |
Access Link |
Cite |
Mozilla Foundation |
The VoxCeleb2 Dataset |
It contains over 1 million utterances from 6,112 celebrities. These utterances are all extracted from videos uploaded to YouTube. |
2018 |
Access Link |
Cite |
University of Oxford |
The VoxCeleb1 Dataset |
It contains over 100,000 utterances from 1,251 celebrities, which are extracted from videos uploaded to YouTube. |
2017 |
Access Link |
Cite |
University of Oxford |
ASVspoof 2017 |
The dataset contains 4,800 speech samples, simulating attack scenarios across 10 types of consumer devices. |
2017 |
Access Link |
Cite |
University of Edinburgh |
LibriSpeech ASR corpus |
It is a large-scale corpus of read English speech, with a duration of approximately 1000 hours and a sampling rate of 16kHz. |
2015 |
Access Link |
Cite |
The Johns Hopkins University |
ASVspoof 2015 |
The dataset comprises genuine speech from 106 speakers and nearly 200,000 spoofed speech samples. |
2015 |
Access Link |
Cite |
University of Edinburgh |
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus |
This dataset contains 6,300 sentences, with each of the 630 speakers from 8 major dialect regions in the United States. |
1993 |
Access Link |
Cite |
NIST |