ElevenLabs, a leading developer of AI voice technology, today announced the launch of its AI Speech Classifier, a first-of-its-kind verification mechanism that lets users upload any audio sample to identify if it contains AI-generated audio.
The AI Speech Classifier is a critical step forward in ElevenLabs mission to develop efficient tracking for AI-generated media.
The AI Speech Classifier is available now to all of the company's users. To use the tool, simply upload an audio sample to the ElevenLabs website here. The AI Speech Classifier will then analyze the sample and return a verdict on whether it contains ElevenLabs AI-generated audio.
ElevenLabs points out that the AI Speech Classifier is up to 99% accurate when dealing with one audio sample. This is because the algorithm scans the characteristics of that single sample. However, the company admits Codec or reverb transformations reduce the accuracy to over 90% accuracy. Furthermore, the accuracy reduces the more times the content is post-processed, like additional audiotracks.
As it developers AI Speech Classifier, ElevenLabs says it is committed to producing “safe tools that can create remarkable content. We believe that our status as an organization gives us the ability to build and enforce the safeguards which are often lacking in open source models.”
Understanding AI Speech Recognition
AI speech recognition is the task of assigning labels to audio signals based on their content, such as the words spoken, the speaker's identity, the emotion conveyed, the intent expressed, etc. It is a subfield of audio understanding and natural language processing that has many applications in domains such as customer service, health care, education and entertainment.
Steps involved in classifying speech with AI
- Preprocessing the audio signals to extract relevant features, such as spectrograms, that represent the frequency and intensity of the sound over time.
- Building and training a machine learning model, such as a deep neural network, that can learn from the features and labels of a large dataset of audio signals.
- Evaluating the model's performance on unseen audio signals and improving it by tuning the parameters or using different architectures.
How AI Speech Classification will Impact Services Like Microsoft's VALL-E
Earlier this year, Microsoft introduced its new VALL-E model, which could transform AI speech accuracy and quality. VALL-E is essentially an AI synthesis tool that aims to remove the uncanny valley from AI speech while also being able to mimic human speech. Microsoft used 60,000 hours of English speech information to train the AI and is now showing the results in a research paper in collaboration with Cornell University.
A demonstration of VALL-E on GitHub shows audio sampler that go through a range of quality to unnatural to near perfection. The system also needs very little input to produce convincing results and may one day be able to learn how to mimic human voices. As AI becomes sophisticated, tools like AI Speech Classifier will become increasingly important to tell users which content is AI generated or not.
This extends to all branches of generative AI. Microsoft this week started putting watermarks on images created by its Bing Image Creator model. These images are very realistic and hard to recognize as artificial. The watermark also acknowledges Microsoft's role in creating them. Microsoft announced this watermarking feature at its Build 2023 event last month.