Awesome Speech Dataset

This repository provides information and resources related to open source speech datasets. These datasets provide diverse and high-quality speech data covering various domains such as conversational, academic, political, and more. They are widely used for tasks like automatic speech recognition (ASR), speaker identification, emotion recognition, and other speech processing applications.

Tables of Datasets

2023-25

Index	Dataset	Automatic Speech Recognition (ASR)	Download	Multilingual	Source	Version	Description
1	Casual Conversations
2	Common Voice	True	Common Voice	True	Mozilla Foundation	21	Massive multilingual, crowd-sourced speech corpus with 20,408+ hours across 124 languages (CC0 licensed).
3	Emilia
4	Vibravox
5	VoxBlink
6	VoxTube
7	YODAS

Timeless

Index	Dataset	Automatic Speech Recognition (ASR)	Speaker Recognition	Emotion Recognition	Speaker Identification	Speaker Verification	Speech Separation	Speaker Diarisation (Diarization)	Voice Activity Detection (VAD) / Speech Activity Detection (SAD) / Speech Detection	Speech Enhancement	Spoken NER	Dialogue Act Recognition	Keyword Spotting	Audio-Visual(AV)	Download	Multilingual	Source	Version	Paper	Interspeech	Description
1	AMI Corpus	True		True				True			True	True			AMI Corpus		University of Edinburgh		RECOGNITION AND UNDERSTANDING OF MEETINGS THE AMI AND AMIDA PROJECTS		The AMI Corpus is a publicly available 100-hour multimodal dataset of English four-person meetings recorded in instrumented rooms with synchronized audio, video, and pen/whiteboard streams, richly annotated for orthographic transcripts, dialogue acts, topic segmentation, summarization, named entities, gestures, and more.
2	ATIS (Air Travel Information System)
3	CHiME	True						True	True	True					CHiME-6		University of Sheffield	6	CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings	True	A series of datasets focusing on speech in noisy environments (streets, cafés, homes). Includes CHiME-4 and CHiME-5/6, used for robust, far‐field ASR research.
4	Europarl (European Parliament Proceedings Parallel Corpus)
5	IEMOCAP
6	LibriSpeech	True													LibriSpeech ASR corpus		Johns Hopkins University		LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS		LibriSpeech is a 1,000-hour read English speech corpus derived from public-domain audiobooks, freely available under a CC BY 4.0 license for training and evaluating automatic speech recognition systems.
7	LibriVox	True													The LibriVox Free Audiobook Collection		Hugh McGuire & Worldwide Volunteers				LibriVox is a volunteer-driven project founded in 2005 to make all public domain books freely available in audio format, with recordings read and shared by volunteers worldwide
8	Speech Commands												True		torchaudio.datasets.SPEECHCOMMANDS		Google	2	Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition		The Speech Commands dataset is a publicly available collection of one-second English audio clips of 35 distinct spoken words, designed to train and benchmark small-footprint, on-device keyword-spotting models.
9	VoxCeleb	True	True	True	True	True	True							True	VoxCeleb	True	University of Oxford	2	VoxCeleb2: Deep Speaker Recognition	True	Over 1 million utterances from 6,112 speakers (~2,442 hours) for state-of-the-art speaker recognition research.
10	MUSAN
11	VCTK (CSTR VCTK Corpus)

Other (by year)

1995-2000

2000-2005

2005-2010

2010-2015

Index	Dataset	Main Task	Sub-Task(s)	Multilingual	Source	Year	Derived
	aGender				Deutsche Telekom AG Laboratories	2010

aGender

Published in 2010 by researchers at Deutsche Telekom Laboratories, this paper presents a 47-hour corpus of German telephone speech whose 954 speakers are carefully balanced across seven age-and-gender categories (children, young, middle-aged, seniors; male/female where applicable). Collected through six mobile-phone calls per participant and augmented by a 659-speaker “VoiceClass” IVR set, the database provides short commands, dates, numbers and free responses that mirror real customer-service dialogs. Its primary purpose is to train and evaluate automatic age and gender detection for voice portals, but the authors also foresee secondary uses such as persona adaptation in dialog systems, target-group advertising, market research and game-style applications. As a purpose-built, monolingual German resource—yet methodologically compatible with SpeechDat—it supplies the balanced demographic coverage that earlier corpora lacked, enabling more reliable speaker-classification research and an open challenge for the community. A Database of Age and Gender Annotated Telephone Speech

2015-2020

Dataset	Main Task	Source	Year
Arabic Speech Corpus	Speech Synthesis	University of Southampton	2016
AudioSet		Google Inc.	2017
AESDD	Speech Emotion Recognition (SER)	Aristotle University of Thessaloniki	2018

Arabic Speech Corpus

The Arabic Speech Corpus (≈ 1.5 GB) is an original Modern Standard Arabic dataset created at the University of Southampton by PhD researcher Nawar Halabi for high-quality text-to-speech synthesis. Comprising over 3.7 hours of professionally recorded Damascene-accent speech, it supplies time-aligned phoneme-level labels and explicit word-stress annotations, allowing researchers not only to build natural-sounding Arabic TTS voices (as already demonstrated) but also to explore auxiliary tasks such as phoneme segmentation, stress and prosody modelling, and other Arabic speech studies. The recordings are newly produced in-house, so the corpus is not derived from any prior dataset, and it is monolingual Arabic. Modern Standard Arabic Phonetics for Speech Synthesis
AudioSet

Presented at ICASSP 2017 by researchers at Google, Audio Set delivers a 4,971-hour corpus of more than 1.78 million human-labeled 10-second YouTube excerpts that span 632 sound-event categories arranged in a carefully designed six-level ontology. Built expressly to push the state of large-scale audio-event recognition, the dataset provides a balanced train/test split and a baseline benchmark, while its hierarchical labels support research on multi-label classification, ontology-aware learning and broader acoustic-scene analysis. Because the clips are drawn from public YouTube content, the collection is inherently language-agnostic—speech in many languages appears, but language itself is not annotated—making the resource suitable for any task where the acoustic signature, rather than linguistic content, is paramount. Audio Set therefore fills for sound the role that ImageNet played for vision, giving the community an open, comprehensive foundation from which to train and compare high-performance audio recognition systems. Audio Set: An ontology and human-labeled dataset for audio events
AESDD

Published in 2018 in the Journal of the Audio Engineering Society, the study from Aristotle University of Thessaloniki presents the Acted Emotional Speech Dynamic Database (AESDD)—an acted-speech corpus recorded for core speech-emotion recognition (SER) research and benchmarked against the English-language SAVEE set. AESDD was initially built in Greek, but the authors stress that the “dynamic” repository is already being enriched with new recordings in both Greek and English, enabling broader cross-lingual experiments while remaining predominantly Greek at this stage. Beyond supplying training data, the paper demonstrates live SER that can drive emotion-aware stage lighting, interactive actor training and audience-engagement tools, as well as automated archiving and retrieval of theatrical performances, illustrating the dataset’s dual technical and creative value. Speech Emotion Recognition for Performance Interaction

2020-2025

Dataset	Main Task	Sub-Task(s)	Source	Year	Derived
Att-HACK	Speech Synthesis	Speech Emotion Recognition (SER)	STMS-Lab – IRCAM / CNRS / Sorbonne Université	2020
AliMeeting	Automatic Speech Recognition (ASR)	Speaker Verification, Speech Enhancement, Speech Separation, Speech Segmentation, Overlap Handling	Alibaba Group Speech Lab (China & Singapore), Beijing Shell Tech, AISHELL Foundation	2022
Audiocite		Automatic Speech Recognition (ASR), Speaker Verification	Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG	2024	audiocite.net

Att-HACK

Att-HACK is a freely available French database released in 2020 by the STMS-Lab (IRCAM / CNRS / Sorbonne Université) to push research on expressive speech and social attitudes beyond the usual “basic-emotion” corpora. It contains roughly 30 hours of studio-quality speech in which 25 actors each read 100 short French sentences while deliberately portraying four interpersonal attitudes – friendly, seductive, dominant, and distant – with 3-5 prosodically varied repetitions per sentence, yielding more than 22 000 utterances so far. Besides waveform audio, the release includes orthographic text, forced phonetic alignments and raw F0 tracks. While primarily intended for modelling and synthesising prosody in socially-expressive TTS systems, the rich, repeated renditions also suit side tasks such as social-attitude or emotion recognition, prosody analysis, and neutral-to-expressive voice conversion. The corpus is entirely original (not derived from earlier data) and is monolingual French, distributed for academic use under a Creative-Commons licence. Att-HACK: An Expressive Speech Database with Social Attitudes
AliMeeting

The M2MeT Challenge, introduced at ICASSP 2022, provides the first large-scale public benchmark for multi-channel, multi-speaker Mandarin meeting transcription. Developed jointly by Alibaba Group’s Speech Labs, Beijing Shell Tech, and the AISHELL Foundation, the release centres on AliMeeting, a 120-hour corpus of real meetings recorded with an 8-microphone circular array plus parallel headset tracks, covering 2-4 participants, diverse rooms and a high (≈42 %) overlap ratio. The challenge defines two core tasks—speaker diarization and multi-speaker automatic speech recognition (ASR)—with sub-tracks that either restrict or allow external data (notably AISHELL-4 and CN-Celeb). By supplying time-aligned transcriptions, per-speaker head-set audio and far-field array recordings, the dataset also supports front-end enhancement and separation research. Although multilingual corpora exist for English, M2MeT fills a critical gap for Mandarin, enabling reproducible research on real-world meeting processing and advancing “who-spoke-what-when” technology in Chinese. M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge
Audiocite

Published in 2024, Audiocite.net: A Large Spoken Read Dataset in French introduces a 6,682-hour corpus of volunteer-read audiobooks harvested from the Audiocité platform by researchers at Université Grenoble Alpes / CNRS / Grenoble INP / LIG. Created to fuel self-supervised pre-training for French speech models, the monolingual French dataset also supports downstream tasks such as ASR and speaker verification, and—despite lacking transcriptions—is suitable for topic modelling, signal reconstruction and speech synthesis research. Entirely sourced from Creative-Commons licensed audiobooks, Audiocite.net fills the size gap between English and French resources and has already been used to boost the 14 k-hour LeBenchmark models, demonstrating its practical impact on French speech technology. Audiocite.net: A Large Spoken Read Dataset in French

AudioMNIST
BAVED
BibleTTS
CALLHOME American English Speech
Café
ClovaCall
CML-TTS
CMU-MOSEI
CN-CELEB
Common Phone
Coswara
CoVoST
CoVoST2
CVSS
DAPS
DCASE 2014
DEEP-VOICE
DEMoS
Earnings-21
EasyCom
Europarl-ST
EMOVO
Emo-DB
EmoSynth
EmoV-DB
EPIC-KITCHENS-100
EPIC-SOUNDS
EMNS
EmoFilm
eNTERFACE05
Fisher English Training Speech
Flickr Audio Caption Corpus
FMFCC-A
Free Spoken Digit Dataset
FSD50K
GEMEP corpus
GigaST
Golos
Hi-Fi TTS (Hi-Fi Multi-Speaker English TTS Dataset)
HowTo100M
Hume-VB
HumBug Zooniverse
IBM Voicemail Corpus
ICSI Corpus
IISc-MILE Kannada ASR Corpus
IISc-MILE Tamil ASR Corpus
InfantMarmosetsVox
Infobip AMD
Interview
ISOLET
JL corpus
KazakhTTS
KSC (Kazakh Speech Corpus)
Keio-ESD
Kosp2e
LEGO Spoken Dialogue Corpus
Libri-Adapt
Libri-Mixed-Speakers
LibriCSS
LibriMix
LibriTTS
LibriTTS-R
LJSpeech
LJSpeech-1.1
MaSS
MeerKAT: Meerkat Kalahari Audio Transcripts
Mini LibriSpeech
MobvoiHotwords
Mohammed
MOSI
MRDA (ICSI Meeting Recorder Dialog Act Corpus)
MSP Podcast Corpus
MSNER
Mudestreda (Mudestreda Multimodal Device State Recognition Dataset)
Multimodal PISA (Multimodal Piano Skills Assessment)
MuSe-CAR
Nepali Text-to-Speech Data (Male and Female)
NTIMIT
OGVC
ParlamentParla
PartialSpoof
PC-GITA
PCVC (Persian Consonant Vowel Combination)
PodcastFillers
PromptSpeech
PromptTTS
Puebla-Nahuatl
RECOLA
ReefSet
Respiratory and Drug Actuation Dataset
ReVerb
RuLS (Russian LibriSpeech)
Samrómur Mimic 22.09
SASPEECH
SAVEE
SEWA
SEMAINE
SEOUL CORPUS
SHALCAS22A
ShEMO
Silbo Gomero Speech Corpus
SINGA:PURA (SINGApore: Polyphonic URban Audio)
SingFake
SLUE
SparseLibriMix
SpeechMatrix
Speech Accent Archive
Speech Wikimedia
SPEECH-COCO
Speech-MASSIVE
Spiking Heidelberg Digits (SHD)
Spiking Speech Commands (SSC)
SPGISpeech
Spotify Podcast Datase
Spoken-SQuAD
TAU Urban Acoustic Scenes 2019
TAU-NIGENS Spatial Sound
Tatoeba
TESS
THCHS-30
Thorsten-Voice
MC Speech Dataset
TIMIT
TUDA
UGIF
VCTK-2Mix
VGG-Sound
VGGSound-Sparse
VIVAE
VocalSound
VOICES
Yoloxóchitl-Mixtec
YouTube-8M
Wavix Voicemail
WHAMR!
Wikimedia Commons
XBMU-AMDO31
Zeroth-Korean
DeToxy
EasyCall
REAL-M
RTASC
ReMASC
Talking With Hands 16.2M
Timers and Such
ASR-GLUE
EMOVIE
LibriVoxDeEn
NusaCrowd
RESD
SpokenSTS
TaL Corpus (The Tongue and Lips Corpus)
AV Digits Database
BD-4SK-ASR
CI-AVSR
JVS-MuSiC
LaboroTVSpeech
MASRI-HEADSET
MAVS
NPSC
MultiSV
NeuroVoz
RyanSpeech
SDN (Situated Dialogue Navigation)
AVA-Speech
AVASpeech-SMAD
Arabic Speech Commands Dataset
DR-VCTK
EVI
EmoSpeech
FT Speech
Greek Parliament Proceedings
JSS Dataset (Jejueo Single Speaker Speech)
THVD (Talking Head Video Dataset)
Kinect-WSJ
LibriS2S
MediBeng
Persian Preschool Cognition Speech
Quechua-SER
RUSLAN
VedantaNY-10M
MCCSD (Mandarin Chinese Cued Speech Dataset)
TurkicASR
UrbanSound8K
CMUARCTIC
QUESST 2014
SNIPS
YESNO
AccentDB
Free Spoken Digit Dataset (FSDD)
Libri-Light
LRS3-TED
CAS-VSR-W1k (LRW-1000)
GLips
DIRHA
BERSt
CANDOR
MSP-Podcast
EmoDB
LSSED
Doc2Dial
Switchboard-1
CPED (Chinese Personalized and Emotional Dialogue)
LRW (Lip Reading in the Wild)
CSS10
iKala
FKD (Football Keywords Dataset)
mDRT
BABEL Speech Corpus
WiLI-2018
Common Language
NLI-PT
FUSS (Free Universal Sound Separation)
Auto-KWS
AVMIT (Audiovisual Moments in Time)
Lingala Read Speech Corpus
Congolese Speech Radio Corpus
Zambezi Voice
Friends-MMC
Laboro-ASV (LaboroTVSpeech-ASV)
CAVES (Cantonese Audio-Visual Emotional Speech)
BANSpEmo
MDER
EMOVOME
Spanish MEACorpus 2023
LibriheavyMix
Echo2Mix
RATS Low Speech Density
BhasaAnuvaad
AVMuST-TED
RoDia
NLSpeech
Balinese TTS
Rasa
IndicVoices-R
RASwDA (Re-Aligned Switchboard Dialog Act Corpus)
MOCKS
WenetPhrase
MDSC
LIP-RTVE
SlideAVSR
OLKAVS
AVA Datasets
DipCo (Dinner Party Corpus)
Samanantar
SEP-28k (Stuttering Events in Podcasts)
GUM
speechocean762
MagicData-RAMC
SwissDials
Europarl-ASR
Vāksañcayaḥ (Sanskrit Speech Corpus by IIT Bombay)
ADIMA
Samrómur L2 22.09
MediaSpeech
Totonac Resources
ASCEND
NISP
NISQA Speech Quality Corpus
Silent Speech EMG
VESUS
DDS (Device-Degraded Speech)
WSJ0-2mix
VoxForge
VOCASET
JVS corpus
GRID
CMU Wilderness Multilingual Speech Dataset
MuST-C
LRS2 (Lip Reading Sentences 2)
MELD (Multimodal EmotionLines Dataset)
MSP-IMPROV
CREMA-D
RAVDESS
AVA (Atomic Visual Actions)
Fluent Speech Commands
MIR Corpora
NIST SRE (SRE Data)
SITW
DIHARD
Voicebank DEMAND
SLURP
Tatoeba
CMUDict
Switchboard Dialog Act Corpus (SwDA)
SGD (Schema-Guided Dialogue)
AVSpeech
MIT (Moments in Time Dataset)
Multilingual LibriSpeech (MLS)
AISHELL (4)
ESD (Emotional Speech Database)
WenetSpeech
BEAT (Body-Expression-Audio-Text)
BSTC (Baidu Speech Translation Corpus)
SOMOS
DAPS (Device and Produced Speech)
GigaSpeech
MS-SNSD (Microsoft Scalable Noisy Speech Dataset)
Multilingual TEDx
People's Speech
Spoken Wikipedia Corpora
TED-LIUM
VoxConverse
VoxPopuli
WHAM!
Clarin-PL EMU (Studio Corpus)
Turkish Speech Corpus
Multilingual Spoken Words Corpus
Turkish Neural Voice (turkishvoicedataset)
VOTE400
M-AILABS Speech Dataset
FLEURS
Czech Parliament Plenary
SIWIS French Speech Synthesis Database
MELD-ST
ETHOS
Skit-S2I
DailyTalk
RedPen
ASED (Amharic Speech Emotion Dataset)
GreThE
HERDPhobia
ASMDD (Arabic Speech Mispronunciation Detection Dataset)
TEET
PodcastMix
NHSS
HateXplain
KeSpeech
BembaSpeech
Crowd-Sourced Speech Corpora
EVBCorpus
Modality Corpus
SDS-200
Lahjoita Puhetta
MDCC (Multi-Domain Cantonese Corpus)
3MASSIV
MGB
QASR
LRS2-BBC
LRS3-Lang
JSpeech
L2-ARCTIC
MyST Children's Conversational Speech
National Speech Corpus
DiDiSpeech
RVTE database
KsponSpeech
Fearless Steps
Bundestag
UserLibri
ReazonSpeech
Chinese Mandarin Lip Reading (CMLR)
ParlaSpeech-HR
VoxLingua107
JTubeSpeech
Primewords
ST-CMDS
NST Danish ASR Database
NST Swedish ASR Database
NST Norwegian ASR Database
NorGovPCC (The Norwegian Government Press Conference Speech Corpus)
ARU Speech Corpus
Althingi Parliamentary Speech Corpus
Pansori
ALFFA (African Languages in the Field: speech Fundamentals and Automation)
Hey Snips
ACAV100M
Mead
PACS
MAD
Speech2Gesture
VideoCC
DeepMine
BookTubeSpeech
CSSD
Carnatic Varnam Dataset
Clotho
CFAD: A Chinese Dataset for Fake Audio Detection
FestCat
USPDATRO
FPT Open Speech Dataset (FOSD) - Vietnamese
FOSD Female Speech Dataset
How2
KdConv
Libriheavy
MuAViC
RealMAN
WaveFake
DECRO
Chichewa
Middle East Technical University Turkish Microphone Speech
Turkish Broadcast News Speech and Transcripts
Apollo Corpus
Half-Truth
LaFresCat
Sagalee
SMIIP-TV dataset
Pragmatic Similarity Judgments
Kallaama
VietMed
Neural Audio Fingerprint Dataset
Jam-ALT
CAS-VSR-S101
CUCO Database
Emozionalmente
DreamVoice
AnglistikVoices
MSNER
SpeechBrown
United-MedSyn
Watch Your Mouth: Point Clouds based Speech Recognition Dataset
InaGVAD
SONICS
FakeMusicCaps
Granary
OpenLID
GlotLID
MSR-86K
KazEmoTTS
KBES
Dusha
M5SER
Divide and Remaster v3 (DnR v3)
ITALIC
FalAI
TextrolSpeech
MMCSG
Lombard-GRID-2mix
MCAS
TextrolMix
Hi, KIA
KAN-AV
Facestar
RVTALL
AVE-Speech
MSceneSpeech
StoryTTS
Speak & Improve Corpus
Unsupervised People’s Speech
Helsinki Speech Challenge 2024 open audio dataset
nEMO
ODSS
TIMIT-TTS
BBS-S2T
SIFT-50M
MIVIA Speech Command
TunSwitch
DiffSSD
OOD-Speech
AS-70
DisfluencySpeech
Boli
SPIRE-SIES
NaturalVoices
ArmanTTS
KSoF (Kassel State of Fluency)
RIRs (Room Impulse Responses)
STAIR Captions
EmoSeC
RescueSpeech
ClArTTS
CORAAL
Audio-FLAN
VocalMind
GTSinger
Fair-speech Dataset
3D-Speaker
EARS
EdAcc (Edinburgh International Accents of English Corpus)
ShiftySpeech
SlideSpeech
SpeechCraft
COVYT
CitySpeechMix

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Speech Dataset

Table of Contents

Tables of Datasets

2023-25

Timeless

Other (by year)

1995-2000

2000-2005

2005-2010

2010-2015

2015-2020

2020-2025

Hub / Database / Library

Open Source

Closed Source

Contain Both Open and Closed Sources

References

About

Uh oh!

Releases

Packages

License

bunyaminergen/awesome-speech-dataset

Folders and files

Latest commit

History

Repository files navigation

Awesome Speech Dataset

Table of Contents

Tables of Datasets

2023-25

Timeless

Other (by year)

1995-2000

2000-2005

2005-2010

2010-2015

2015-2020

2020-2025

Hub / Database / Library

Open Source

Closed Source

Contain Both Open and Closed Sources

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages