tinyML Talks: The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset



tinyML Talks
The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset
Mark Mazumder
PhD Student, Harvard University

This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.

source

180 thoughts on “tinyML Talks: The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset

Leave a Reply

Your email address will not be published.