siliconANGLE | MLCommons releases open-source datasets for training speech recognition models

December 14, 2021 Liz Bazini

MLCommons releases open-source datasets for training speech recognition models
By Mike Wheatley

The MLCommons Association, a nonprofit consortium that aims to improve machine learning for the public good, today announced the release of two key new datasets that it says can be leverage by organizations to develop superior artificial intelligence models.

The consortium said the People’s Speech Dataset is one of the world’s most comprehensive collections of English language speeches that’s licensed for academic and commercial use. Meanwhile, the Multilingual Spoken Words Corpus is said to be one of the largest audio speech datasets in the world, with keywords spoken in 50 languages.

What MLCommons is trying to do is level the playing field in AI development. It notes that smaller organizations have a distinct disadvantage when trying to develop models for speech recognition, because the most comprehensive datasets available have always had high licensing costs. Added to that, tech giants such as Google LLC and Apple Inc. can gather large amounts of free training data through devices such as smartphones.

MLCommons points out that when researchers from the Mozilla Foundation began developing its DeepSpeech English language speech recognition tool, it was forced to reach out to TV and radio stations to acquire enough public speech data to train it.

The People’s Speech Dataset is meant to remedy that problem. It provides more than 30,000 hours of supervised conversational audio released under a Creative Commons licenses, meaning it can be used to create voice recognition models that power voice assistants and transcription software.

As for the MSWC dataset, it has more than 340,000 keywords with upwards of 23.4 million examples spanning languages spoken by more than 5 billion people. MLCommons said it can be used to train machine learning models for applications such as call centers and smart devices.

Constellation Research Inc. analyst Holger Mueller said MLCommons’ datasets will be welcomed by a developer community that struggles to obtain the high-quality training data it needs to build effective AI models in speech recognition. Speech data, he said, is very hard to capture due to matters around privacy and consent.

“A standardized dataset also opens things up for performance benchmarks as well, so we will see what these two datasets can do to improve the quality of AI models,” Mueller said. “Nothing improves AI quality more than competitions based on standardized datasets.”

Both of the datasets come with permissive licensing terms, including commercial fair use, which is not allowed with many other speech training libraries.

Keith Achorn, a machine learning engineer at Intel Corp. who helped oversee the curation of the datasets, said the hope is that it will help more developers to build speech recognition systems without budgetary constraints.