New Speech Datasets: Road to the Robust Speech Rocognition

Introducing new speech corpora for ML that can contribute to building an environment-invariant speech recognition model

Deeply
4 min readApr 16, 2021

Since the rise of the era of the artificial intelligence, the research communities of various tasks, such as speech recognition, speaker diarization, have well been benefitted from the growing number of the speech/audio datasets being released every year.

However, a large portion of them lost sight of lots of essential nature of the sound itself, for example, a reverberation, distance from the source, that one’s babysat speech recognition model is subject to substantial deterioration in the real world. Say, reverberant and far-field speech are the most typical examples of the sources of performance loss.

A machine learning model without robustness is like driving a car on thin ice.

To help you cope with the possible robustness issues alongside the speech/audio tasks, Deeply Inc., a sound AI start-up, collected new speech datasets with unprecedented uniqueness:

0. Things that were considered for both datasets

  • Reverberation in the various types of the room
  • Distance between the receiver and the source
  • Recording device(receiver)

1. Parent-Child Interaction Dataset[GitHub Link]

  • Unique subject: Vocal interaction between a parent and his/her child
  • Unique features:
    - Developmental linguistics
    - Spontaneous utterances

2. Korean Read Speech Corpus[GitHub Link]

  • Unique subject: Utterances with the differed valence of both text and voice
  • Unique features:
    - Accordance and discordance of the text and voice sentiments

The Recording Environment

The common recording environments of both datasets concerning the robustness are as follows.

The recordings had taken place in 3 different types of room, anechoic chamber, studio apartment, and dance studio, of which the level of the reverberation differs. And to examine the effect of the receiving device and the distance between the receiver and the source, every recording was recorded which 6 different receivers: 2 types of smartphones(44100Hz, 16 bit, Linear PCM, 1 ch) at 3 different distances from the source.

Actual recording environments: Anechoic chamber, dance studio, studio apartment, respectively.

Parent-Child Interaction Dataset

The Parent-Child Interaction dataset consists of almost 300 hours of audio clips of various types of vocal interactions between a parent and his/her child: reading fairy tales, singing children’s songs, lullabies, conversing, crying, and others. The dataset also includes metadata such as sex and age of the speakers and the presence of the noise.

The children subjects were limited to those whose age is less than or equal to 6. As a result, the majority of the parents are in their 30s. And they weren’t given any specific instructions about the interaction other than the total amount of it due to the limited amount of time.

The dataset is expected to be utilized in speaker diarization between parents and children, child development study, and a lot more to come.

Korean Read Speech Corpus

The Korean Read Speech Corpus consists of almost 300 hours of audio clips of reading the script categorized into 3 different text sentiments with 3 different voice sentiments. The sentiment was grouped into ‘negative’, ‘neutral’, ‘positive’. Each line in the script with the different sentiment(text sentiment) was vocalized three times with different sentiment(voice sentiment).

For that reason, you can observe 9 combinations of text and voice sentiment, which allow you to investigate the discrepancy between those two orthogonal sentiments. For example, you can characterize someone being sarcastic if you take a closer look into the samples with the positive text sentiment and negative voice sentiment.

The detailed information, including more statistics and the audio/voice samples, is ready to use in the official GitHub.
- Parent-Child Interaction Dataset
- Korean Read Speech Corpus

Journalist: Hongseok Oh, Deeply Inc.
We give meaning to sound, Deeply Inc.

--

--