VentureBeat | MLCommons releases open source datasets for speech recognition

MLCommons releases open source datasets for speech recognition
By Kyle Wiggers

MLCommons, the nonprofit consortium dedicated to creating open AI development tools and resources, today announced the release of the People’s Speech Dataset and the Multilingual Spoken Words Corpus. The consortium claims that the People’s Speech Dataset is among the world’s most comprehensive English speech datasets licensed for academic and commercial usage, with tens of thousands of hours of recordings, and that the Multilingual Spoken Words Corpus (MSWC) is one of the largest audio speech datasets with keywords in 50 languages.

No-cost datasets such as TED-LIUM and LibriSpeech have long been available for developers to train, test, and benchmark speech recognition systems. But some, like Fisher and Switchboard, require licensing or relatively high one-time payments. This puts even well-resourced organizations at a disadvantage compared with tech giants such as Google, Apple, and Amazon, which can gather large amounts of training data through devices like smartphones and smart speakers. For example, four years ago, when researchers at Mozilla began developing the English-language speech recognition system DeepSpeech, the team had to reach out to TV and radio stations and language departments at universities to supplement the public speech data that they were able to find.

With the release of the People’s Speech Dataset and the MSWC, the hope is that more developers will be able to build their own speech recognition systems with fewer budgetary and logistical constraints than previously, according to Keith Achorn. Achorn, a machine learning engineer at Intel, is one of the researchers who’s overseen the curation of the People’s Speech Dataset and the MSWC over the past several years.

“Modern machine learning models rely on vast quantities of data to train. Both ‘The People’s Speech’ and ‘MSWC’ are among the largest datasets in their respective classes. MSWC is of particular interest for its inclusion of 50 languages,” Achorn told VentureBeat via email. “In our research, most of these 50 languages had no keyword-spotting speech datasets publicly available until now, and even those which did had very limited vocabularies.”

How Can We HelP?