Meta Platforms has advanced the field of voice replication with the introduction of Audiobox, its new research model. Audiobox, a novel foundation for audio generation, allows for the imitation of a person's unique vocal traits and the creation of sound effects using natural language prompts. Developed by the Facebook AI Research lab, this model builds on their prior project, Voicebox.
Innovative Technology for Voice Cloning
Produced by Meta's FAIR lab, Audiobox represents a breakthrough in voice cloning technology. It facilitates the generation of lifelike voices and soundscapes, utilizing input voices and text prompts. By recording one's voice, users can then type sentences they wish to hear, and Audiobox replicates these using the cloned voice signature. Furthermore, new voice styles can be generated simply by describing the desired vocal characteristics through text.
The initiative reflects an ongoing interest in AI-generated sound across the industry, with companies like ElevenLabs securing significant investment for their work in the sector. Audiobox, however, takes this to a new level with its self-supervised learning (SSL) foundation, a technique which allows AI to learn and label audio data without explicit guidance.
Meta's Approach to Self-Supervised Learning
The SSL model underpinning Audiobox eschews the need for labeled data, such as transcripts or captions, by instead leveraging vast amounts of unlabeled audio. The FAIR team trained the model with more than 160,000 hours of speech, primarily in English, which span a broad spectrum of recordings, including audiobooks, podcasts, and in-the-wild captures, among others. The speech database draws from over 150 countries and encompasses over 200 different languages, thereby aiming to uphold inclusivity and representativeness in the generated outputs.
Despite the broad scope of data, the provenance of this data remains a critical point of consideration, especially as issues around consent and copyright have led to litigation against AI entities for unauthorized training material usage. Meta has been contacted for clarification on this aspect and will provide updates accordingly.
Current Limitations and Future Prospects
Audiobox's launch includes a variety of interactive demos that demonstrate the technology's current capabilities. Users are invited to record and clone their voices, generate new voice styles, and even replicate sound effects, such as dogs barking. Nonetheless, these demos come with a disclaimer: they are not for commercial use and are unavailable to residents in Illinois or Texas due to specific state laws.
Unlike previous AI tools from Meta, Audiobox is not an open-source offering, and inquiries regarding its potential release as open source are pending a response. While the use cases are limited by current restrictions, the rapid progression of AI technology suggests that a commercial version may become available in the foreseeable future, either from Meta or other companies in the space.