A New Audio Dataset: Nonverbal Vocalization, Key to Communicate

Introducing a brand new audio dataset about human nonverbal vocalization for machine learning. It’s often been neglected, however, turns out to be the hidden gem of communication

3 min readApr 16, 2021

In an endeavor to make machines understand human language, the research community has devoted to solving many problems, including speech recognition, spoken language understanding, natural language understanding(e.g. SQuAD, CHiME-6). And they set a new state-of-the-art record in many of them every year.

In spite of the fact that we’re on the cutting edge of machine listening and reading comprehension, only a limited extent of interpersonal communication can be explained by verbal content. However, most of the time, a nonverbal cue is the hidden gem of communication.

The nonverbal cue is the key to success!

However, there are just so many components of nonverbal communication that it’s almost impossible to list them. For example, there are tone of voice, eye contact, gestures, posture, facial expression, and the list goes on.

Since we’re an AI start-up focusing on sound, we decided to focus on the vocal components of nonverbal communication and realized that there weren’t just enough data. Most of the preexisting datasets either were too small or lacked validity in that they were web-scraped and not verified if they’re well labeled.

So, we did all the hassle for you! We collected the dataset containing human nonverbal vocalization sound[GitHub Link], including 16 sounds such as coughing, laughing that were expected to deliver tons of information. Followings are what make the nonverbal vocalization dataset very distinct from other datasets:

  • Large volume: ~60 hours
  • Various types of sound: 16 types of human nonverbal vocalization
  • Language independent subject
  • 100% Human validation

What’s Inside the Dataset?

The Nonverbal vocalization dataset consists of almost 60 hours of short clips from 1419 general public of South Korea, 16 classes include ‘teeth-chattering, ‘teeth-grinding’, ‘tongue-clicking’, ‘nose-blowing’, ‘coughing’, ‘yawning’, ‘throat clearing’, ‘sighing’, ‘lip-popping’, ‘lip-smacking’, ‘panting’, ’crying’, ‘laughing’, ‘sneezing’, ‘moaning’, and ‘screaming’. And the dataset also contains metadata like the age and sex of the speakers and others.

More detailed information and statistics are curated on our official GitHub.


Here are some audio samples of coughing, sighing, and yawning. Enjoy listening!

You can build machine learning projects such as tracking someone’s respiratory condition, detecting a sign of depression, or a sleep disorder with these examples! And many more with all the 16 types of human nonverbal vocalization.

The detailed information, including more statistics and the audio/voice samples, is ready to use in the official GitHub.
- Nonverbal Vocalization Dataset

Journalist: Hongseok Oh, Deeply Inc.
We give meaning to sound, Deeply Inc.