Meta’s MoCha AI Animates Characters From Just Voice and Text

Meta has introduced MoCha, an AI system that generates full-body character videos from just voice and text, with no need for reference footage.

Meta, in collaboration with researchers at the University of Waterloo, has introduced an AI system that generates fully animated, speech-synchronized characters without requiring a camera, reference images, or motion capture.

The system, called MoCha, short for “movie-grade character animation,” constructs entire scenes—facial expressions, gestures, and turn-taking dialogue—from just a voice recording and a script. The model was introduced in a research paper published March 30.

MoCha defines a new benchmark task it calls Talking Characters: generating full-body performances from audio and text. The model features a module known as Speech-Video Window Attention, which ensures synchronization between audio and animation by aligning keyframes to speech rhythm. It also uses a joint speech-text training strategy to capture emotion and character context across multiple speakers in a scene.

MoCha is designed for narrative flow rather than isolated speaking clips. Its multi-character support enables back-and-forth conversations, where each character’s timing and gesture are informed by turn-taking logic. To evaluate its performance, the team developed MoCha-Bench, a benchmark suite testing sync accuracy, expressive motion, and emotional fidelity.

The model outperforms previous methods such as EMO and Hallo-3 across metrics like Sync-C (sync confidence), FID (Fréchet Inception Distance), and emotional classification accuracy.

Prompt-Based Storytelling, No Reference Input Needed

Where models like ByteDance’s OmniHuman-1 use a reference image, pose data, and audio to generate animation, MoCha skips visual inputs altogether. OmniHuman-1, launched February 4, applies a Diffusion Transformer and pose-guided animation system.

It combines audio with pose heatmaps and a 3D Variational Autoencoder (VAE), offering fine-grained gesture control. The system was trained on over 19,000 hours of video and applies classifier-free guidance to improve realism and diversity.

MoCha, in contrast, uses a fully generative pipeline. It handles both body and facial motion using only speech and text conditioning, with no external visual anchors. This reference-free design removes the need for complex camera setups or detailed motion scripting, offering creators a streamlined path to synthetic storytelling. The model also features non-autoregressive decoding, improving efficiency by predicting full motion frames in parallel instead of one step at a time.

Notably, the MoCha paper does not disclose the size of its training data, unlike OmniHuman’s extensive dataset. This leaves questions about its generalization capacity, though performance benchmarks suggest high-quality results even with unseen data.

Facial Realism via Smartphones: Runway’s Alternate Route

While MoCha constructs entire scenes from scratch, other systems are betting on creator-driven realism. In October 2024, Runway released Act-One, a feature that allows users to record their own facial expressions using a smartphone, then map those performances onto animated characters. This bypasses traditional motion capture and is integrated into Runway’s video generation models.

Act-One supports a variety of animation styles and allows creators to animate micro-expressions, eye movements, and emotional subtleties without professional gear. However, it assumes the user is willing to perform the scene. MoCha requires no performance. It generates expression and movement from text prompts alone.

This distinction matters. Runway’s tools are optimized for creative control and realism rooted in physical inputs. MoCha automates performance, creating characters that can act out scripts independently. It’s especially suited for narrative-heavy content like explainer videos, digital dialogue scenes, and voice-driven storytelling where camera setups are impractical.

Positioning Mocha in the AI Video Landscape

On March 31—just one day after the MoCha paper was released—Runway introduced its Gen-4 model, expanding its cinematic control tools. Gen-4 supports scene-level prompting, dynamic camera paths, lighting control, and real-time feedback for visual edits. These features allow creators to build scenes with more precision, but they also raise hardware demands for high-resolution rendering.

Gen-4 streamlines how users coordinate different scene components” and merges prior tools like Act-One into a single workflow. For creators aiming to replicate studio-level cinematography, Gen-4 offers detailed visual control—but requires GPU power to match. MoCha, in contrast, prioritizes low-friction creation. It doesn’t offer camera tuning or lighting, but delivers narrative cohesion without extensive prompt engineering.

Other players are also building video tools at scale. OpenAI launched Sora in December 2024, supporting long-form video generation via ChatGPT. Google followed with Veo 2, which adds 4K resolution and invisible watermarks. In February 2025, Alibaba launched Wan 2.1, an open-source video model designed to increase accessibility for developers and smaller studios.

MoCha distinguishes itself by focusing on performance and dialogue. Rather than building environments or cinematic polish, it concentrates on character behavior, delivery, and emotional expression—all from a script and voice.

MoCha’s role in Meta’s wider AI strategy

MoCha’s development reflects Meta’s expanding focus on generative content tools. In September 2024, the company introduced an AI dubbing tool that can automatically translate and sync videos while preserving the original voice and facial motion. The system maintains voice and lip movement synchronization across languages.

Meta is also exploring the integration of AI-generated personas on its social platforms. Tese virtual profiles could post content, interact with users, and simulate influencer activity. The idea is to populate platforms with AI-driven characters that blur the line between entertainment and user engagement.

Meanwhile, leadership is shifting. Joelle Pineau, Meta’s head of AI research and a key figure behind open-source models like LLaMA, will step down at the end of May. During her tenure, Meta advanced generative AI for both research and commercial use, including models now powering Meta AI features across platforms.

Despite MoCha’s public release as a research paper, the team has not announced whether the model will become openly available or integrated into Meta’s consumer-facing tools. For now, it stands as a prototype of what script-based character animation could look like in the near future—fully generated performances, no actors or cameras involved.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x