New Speech Datasets: Road to the Robust Speech Rocognition
Introducing new speech corpora for ML that can contribute to building an environment-invariant speech recognition model
Since the rise of the era of the artificial intelligence, the research communities of various tasks, such as speech recognition, speaker diarization, have well been benefitted from the growing number of the speech/audio datasets being released every year.
However, a large portion of them lost sight of lots of essential nature of the sound itself, for example, a reverberation, distance from the source, that one’s babysat speech recognition model is subject to substantial deterioration in the real world. Say, reverberant and far-field speech are the most typical examples of the sources of performance loss.
To help you cope with the possible robustness issues alongside the speech/audio tasks, Deeply Inc., a sound AI start-up, collected new speech datasets with unprecedented uniqueness:
0. Things that were considered for both datasets
- Reverberation in the various types of the room
- Distance between the receiver and the source
- Recording device(receiver)
1. Parent-Child Interaction Dataset[GitHub Link]
- Unique subject: Vocal interaction between a parent and his/her child
- Unique features:
- Developmental linguistics
- Spontaneous utterances
2. Korean Read Speech Corpus[GitHub Link]
- Unique subject: Utterances with the differed valence of both text and voice
- Unique features:
- Accordance and discordance of the text and voice sentiments
The Recording Environment
The common recording environments of both datasets concerning the robustness are as follows.
The recordings had taken place in 3 different types of room, anechoic chamber, studio apartment, and dance studio, of which the level of the reverberation differs. And to examine the effect of the receiving device and the distance between the receiver and the source, every recording was recorded which 6 different receivers: 2 types of smartphones(44100Hz, 16 bit, Linear PCM, 1 ch) at 3 different distances from the source.
Parent-Child Interaction Dataset
The Parent-Child Interaction dataset consists of almost 300 hours of audio clips of various types of vocal interactions between a parent and his/her child: reading fairy tales, singing children’s songs, lullabies, conversing, crying, and others. The dataset also includes metadata such as sex and age of the speakers and the presence of the noise.
The children subjects were limited to those whose age is less than or equal to 6. As a result, the majority of the parents are in their 30s. And they weren’t given any specific instructions about the interaction other than the total amount of it due to the limited amount of time.
The dataset is expected to be utilized in speaker diarization between parents and children, child development study, and a lot more to come.
Korean Read Speech Corpus
The Korean Read Speech Corpus consists of almost 300 hours of audio clips of reading the script categorized into 3 different text sentiments with 3 different voice sentiments. The sentiment was grouped into ‘negative’, ‘neutral’, ‘positive’. Each line in the script with the different sentiment(text sentiment) was vocalized three times with different sentiment(voice sentiment).
For that reason, you can observe 9 combinations of text and voice sentiment, which allow you to investigate the discrepancy between those two orthogonal sentiments. For example, you can characterize someone being sarcastic if you take a closer look into the samples with the positive text sentiment and negative voice sentiment.
The detailed information, including more statistics and the audio/voice samples, is ready to use in the official GitHub.
- Parent-Child Interaction Dataset
- Korean Read Speech Corpus
Journalist: Hongseok Oh, Deeply Inc.
We give meaning to sound, Deeply Inc.