Meta AI´s New Spirit LM Model Can Detect Emotions in Speech – Here’s How It Works

Spirit LM by Meta brings emotional intelligence to AI, enabling machines to detect and replicate human emotions like happiness and anger in speech.

Meta’s latest AI model, Spirit LM, aims to change how computers talk by making them sound less like robots. Created by Meta’s Fundamental AI Research (FAIR) team, this new release brings text and speech together into one model. What sets Spirit LM apart is its ability to understand and express human emotions, like happiness or anger, making AI conversations feel more natural.

The release, available to researchers under a non-commercial license, could push AI-powered customer service and virtual assistants into a whole new realm. Meta’s been pushing for better speech models, and Spirit LM might be the closest we’ve come to having AI that sounds genuinely human. Alongside the launch, Meta has also previously published a paper on the underpinning technology for the model. 

Two Versions of Spirit LM Available

Meta’s latest model comes in two flavors. Spirit LM Base focuses on the fundamentals of speech processing using phonetic tokens to recognize and generate speech. For something with more emotional depth, there’s Spirit LM Expressive, which adds layers of pitch and tone recognition. The second version can recognize whether the person speaking is excited or sad and mirror that in its speech.

What makes Spirit LM stand out is how it blends text and speech together, unlike older systems that process speech first, then text. This could make it better at conversations that require a lot of emotional nuance—think customer service bots that need to empathize with frustrated users. In July, Benjamin Muller, Tu-Anh Nguyen, and Bokai Yu from Meta AI detailed Spirit LM in an interview with Twelve Labs. 

Microsoft’s AI Speech Updates in Comparison

Meta isn’t the only one playing around with advanced speech models. Over at Microsoft, updates to its Azure AI Speech platform have also introduced voices that respond to emotional cues. Just a few weeks ago, Microsoft rolled out HD neural voices designed to sound more lifelike, adapting to the context of the conversation in real-time.

Microsoft’s newly developed voice models exhibit a significant enhancement in their capacity to discern and react to the emotional nuances present within the text they process. These models, underpinned by autoregressive transformer architectures, meticulously analyze the sentiment conveyed within the text and subsequently adjust their vocal tone to align with the identified emotional context. 

While Meta’s model brings a unique blend of text and speech, Microsoft’s focus has been on improving how AI voices adapt their tone based on the sentiment of the conversation. In July 2024, Microsoft added new multilingual voices and avatars to Azure AI, aiming to give businesses more realistic voices for things like call centers. Both companies are racing to make AI speech more dynamic, and it’s interesting to see how different their approaches are.

Meta’s Growing Focus on Multimodal AI Models

Meta has been experimenting with AI models that handle more than just text or speech for a while now. Back in June 2024, Meta introduced JASCO, a text-to-music model that lets users generate music based on written descriptions. JASCO, an acronym for Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation, is a versatile tool designed to facilitate the composition and modification of music through textual commands. 

The company also launched AudioSeal, which helps verify whether a piece of audio was created by AI or a human. AudioSeal model embeds watermarks into AI-generated speech, making it easier to detect synthetic audio. Spirit LM, with its mix of text and speech, feels like a natural evolution of Meta’s earlier projects. It’s another step in the company’s plan to create AI that handles multiple types of input at once, from text to music, and now, human-like speech.

Applications and Future Potential

Spirit LM’s potential goes far beyond customer service. It could enhance virtual assistants, allowing them to carry on conversations that feel more real, thanks to its ability to pick up on and convey emotions. Imagine a virtual assistant that can detect if you’re frustrated and adjust its tone to be more understanding. This capability can also benefit other sectors like healthcare, where empathetic AI could improve patient interactions.

Last Updated on November 7, 2024 2:26 pm CET

SourceMeta
Luke Jones
Luke Jones
Luke has been writing about Microsoft and the wider tech industry for over 10 years. With a degree in creative and professional writing, Luke looks for the interesting spin when covering AI, Windows, Xbox, and more.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x