Better Fake, Better Make

Stylized Animation via Generative Data

Arxiv 2026
1Google, 2University of Rochester,

[PLEASE TURN-ON AUDIO]

Our system generates co-speech 3D facial animation and head motion under multi-modal condition, such as personalized style codes and text embeddings. At its core is a face fly-wheel engine that synthesizes lip-sync facial videos from predefined prompts spanning diverse emotions and head poses, obviating costly lab capture. We will release about 60 hours of synthesized footage with corresponding text prompts to support training and evaluation (as shown in background).

The Overview of Synthesis Dataset [Selected]

S = gentleman | E = frown | I = Nan

S = female | E = sad | I = medium

S = young women | E = smile | I = vivid

S = little boy | E = cry | I = Nan

S = female | E = sad | I = medium

S = young women | E = smile | I = vivid

S = woman | E = angry | I = strong

S = man | E = fearful | I = medium

S = male professor | E = sad | I = stuble

S = young lady | E = singing | I = Nan

S = lady | E = happy | I = strong

S = woman | E = angry | I = intense



The overview of generated results

[We use different speech (and music) and different prompt for style control]

[We use same speech and different prompt for style control]

[We use same prompt and different speech for animation]

Abstract

Multi-modality guidance for 3D facial animation has drawn growing interest, yet progress is constrained by the scarcity of high-quality, rich-annotated and style-balanced facial video dataset. As a result, prior methods often trade off realism and flexible conditioning. We tackle these issues with a distillation-base pipeline that leverages world generative models. Our approach has two components: data engine and style generator. In data engine, we construct prompts balanced across facial emotions and head motions, synthesize about 60 hours of facial videos using multiple foundation Text-2-Video (T2V) models, which leverages the powerful generative capabilities of T2V models. For style generator, we learn multi-modality style embeddings (person-specific or language-specific) that are aligned with text (from constructed prompts) embeddings and utilize them as the conditioning signals for a diffusion model to produce stylized 3D facial animation with audio-lip sync. Extensive experiments show that our method delivers high-fidelity, controllable, and style-adaptive facial animation, substantially expanding expressiveness while retaining precise conditional control through generative models distillation.

Dataset

Up part: Dataset construction overview. Top three rows: visualizations stratified by prompt attributes (subject, intensity, emotion), as the Stage I and II in Section 3.1. Row 4: 3D face reconstruction from video (Stage IV). Row 5: the 3D facial motion augmentation via pitch (nod) and yaw (shake), prompt metadata is updated with nod or shake accordingly (Stage V). Bottom part: Prompts meta-attribute distribution overview. We meta-segment the prompts, each prompt is composed of three attributes (Subject: S, Emotion: E and Intensity: I), and under each attribute we include as many descriptions as possible. We provide more prompts for visualization at here.


Generated-Results Overview

Up part: Dataset construction overview. Top three rows: visualizations stratified by prompt attributes (subject, intensity, emotion), as the Stage I and II in Section 3.1. Row 4: 3D face reconstruction from video (Stage IV). Row 5: the 3D facial motion augmentation via pitch (nod) and yaw (shake), prompt metadata is updated with nod or shake accordingly (Stage V). Bottom part: Prompts meta-attribute distribution overview. We meta-segment the prompts, each prompt is composed of three attributes (Subject: S, Emotion: E and Intensity: I), and under each attribute we include as many descriptions as possible.



Compare With Baselines

We provide a visual comparison against baseline methods, with all results reported in the main paper. EmoTalk exhibits the weakest audio–lip synchronization, likely due to limited training data, whereas our approach delivers the most faithful text-style preservation and the strongest audio-lip synchronization among all compared methods.



Ablation Studies

We present visualizations from a series of ablation studies, including comparisons of models trained on different datasets, analyses of style variation induced by modifying a single word, and ablations over text–audio configurations and model scales. Across these settings, our method consistently exhibits precise audio-text synchronization on our dataset and achieves the most faithful text-style mapping among the evaluated variants.



Limitions

We also visualize the limitations of our framework in the main paper, highlighting both our enhancements and the remaining constraints in head motion (inherited from the underlying T2V model), as well as the imperfect coupling between music and semantic labels.



Application [Follow-Speaking-Style]

[We embed the codes of different videos as a condition-driven generation method to obtain speech-driven animations.]

We demonstrate style transfer from reference facial videos by directly conditioning the diffusion model on VQ-encoded codes. In this setting, we select reference faces with markedly different speaking styles to drive the animation, while keeping the input audio fixed to enable a clearer, controlled comparison.


BibTeX

@misc{
      Arxiv 2026 Submission,
      #TBD
}