MLCommons And Hugging Face Launch Huge Speech Dataset With More Than A Million Hours Of Audio

An extensive multilingual speech dataset from MLCommons and Hugging Face offers over one million hours of audio, setting a new standard for AI-driven speech innovation.

MLCommons, in partnership with Hugging Face, has released an extensive collection of over one million hours of public domain voice recordings spanning at least 89 languages.

The dataset, titled Unsupervised People’s Speech, was compiled from audio files on Archive.org and is designed to advance research in speech recognition, voice synthesis, and language modeling.

In the official announcement, the organization explained, “Supporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally,” and added,

“We anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.”

These declarations establish the project’s objective while noting that, due to the source of the recordings, the dataset predominantly features American-accented English—a factor that may affect model performance when processing other dialects.

Achievements and Challenges

The Unsupervised People’s Speech project addressed significant technical obstacles in managing and processing a vast volume of data.

The MLCommons team engineered custom scripts and employed a Git Large File Storage (Git LFS)–backed upload process to transfer over 48 terabytes of data to cloud storage efficiently. Git LFS replaces large files with text pointers, allowing efficient version control for high-volume assets.

Advanced data pipelines that integrate Silero’s Voice Activity Detection (VAD) and Nvidia’s adaptation of OpenAI’s Whisper model were implemented to extract approximately 821,412 hours of clear speech. Voice Activity Detection is a method that identifies segments containing human speech, filtering out silence and background noise to optimize data processing.

These refined techniques illustrate the rigorous processing required to transform raw, user-uploaded data into a structured resource.

Despite these successes, a reliance on uncurated uploads usually introduces challenges such as inherent data bias and potential licensing discrepancies—a concern also noted in a MIT analysis on dataset transparency.

Technical Details of The Unsupervised People’s Speech Dataset

The accompanying dataset card on Hugging Face outlines a robust file organization that enhances reproducibility and legal compliance. Audio files are stored in tar archives—each averaging about 5GB—and organized into two directories (“audio” and “audio2”).

A licenses.jsonl file accompanies the dataset to document the licensing terms (CC-BY and CC-BY-SA, with the dataset licensed under Creative Commons BY-SA 4.0) for each audio clip.

Most recordings last between 1 and 10 minutes, with only 14 files exceeding 100 hours, and 99% of the audio is sampled at 44.1kHz while the remaining files use alternative sample rates such as 16kHz, 24kHz, or 48kHz.

To maximize the utility of the dataset, MLCommons has provided a training pipeline designed to facilitate self-supervised learning using models like Wav2Vec2.

This approach employs techniques where segments of the audio are masked and the model is trained using contrastive loss to learn robust latent representations.

Self-supervised learning allows models to identify patterns in raw, unlabeled data, reducing the need for extensive manual annotations—a crucial advantage for low-resource languages. For those seeking further technical details, the Transformers documentation for Wav2Vec2 offers comprehensive guidance. The availability of this training pipeline reinforces the dataset’s potential to drive advances in speech recognition technology and facilitate fine-tuning across diverse linguistic settings.

Ethical Considerations and Community Engagement

The dataset’s reliance on publicly available, user-uploaded content raises important ethical and licensing concerns. Ed Newton-Rex, the CEO of Fairly Trained, a non-profit certifying generative AI companies for fairer training data practices, highlighted these challenges last year, stating,

“Creators should not be required to opt out of gen AI training. Many creators (e.g. Squarespace users) have no meaningful way of opting out. For creators who ‘can’ opt out, there are multiple overlapping opt-out methods, which are (i) incredibly confusing and (ii) woefully incomplete in their coverage.

Even if a perfect universal opt-out existed (nowhere close), it would be hugely unfair to put the opt out burden on creators, given that gen AI uses their work to compete with them – many would simply not realise they could opt out. And, of course, a lack of transparency / audit requirements means AI companies can simply ignore opt-outs.”

This ethical perspective is critical in understanding the broader implications of using such datasets.

What’s Next?

MLCommons invites collaboration from researchers worldwide, including experts fluent in over 130 languages, to contribute to ongoing benchmarks and validation efforts. This community-driven approach promises to enhance dataset diversity and improve model performance over time.

Looking ahead, the Unsupervised People’s Speech dataset is positioned to accelerate progress in unsupervised speech representation learning and robust model development.

The integration of advanced training pipelines and community collaboration lays a foundation for refining models to capture linguistic diversity and cultural nuances more effectively.

While current challenges—such as data bias and ethical considerations—remain, the dynamic nature of this initiative offers a pathway to continuous improvement. Future iterations may incorporate enhanced preprocessing methods, more comprehensive licensing audits, and adaptive training strategies that combine adversarial and naturally collected data.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x