Better Fake, Better Make

Stylized Animation via Generative Data

Arxiv 2026

Luchuan Song^1,2, Shichen Liu¹, Ziqian Bai¹, Stylianos Moschoglou¹, Feitong Tan¹, Chenliang Xu², Zeng Huang¹, Timo Bolkart¹, Sean Fanello¹, Yinda Zhang¹,

¹Google, ²University of Rochester,

[PLEASE TURN-ON AUDIO]

ArXiv Code Video Data

Our system generates co-speech 3D facial animation and head motion under multi-modal condition, such as personalized style codes and text embeddings. At its core is a face fly-wheel engine that synthesizes lip-sync facial videos from predefined prompts spanning diverse emotions and head poses, obviating costly lab capture. We will release about 60 hours of synthesized footage with corresponding text prompts to support training and evaluation (as shown in background).

The Overview of Synthesis Dataset [Selected]

S = gentleman | E = frown | I = Nan

S = female | E = sad | I = medium

S = young women | E = smile | I = vivid

S = little boy | E = cry | I = Nan

S = female | E = sad | I = medium

S = young women | E = smile | I = vivid

S = woman | E = angry | I = strong

S = man | E = fearful | I = medium

S = male professor | E = sad | I = stuble

S = young lady | E = singing | I = Nan

S = lady | E = happy | I = strong

S = woman | E = angry | I = intense

The overview of generated results

[We use different speech (and music) and different prompt for style control]

A male teacher speaks with lips curling into a well-defined smile between phrases.

A male gamer speaks in short bursts, a strong surprise expression on the face.

A man speaks quietly, tears gathering again, the crying expression plain to see.

A man exhibits a subtle zest expression, strong smile present and cheeks rounded.

A man speaks with a low happy expression, keeping his gaze forward and steady.

A female singer sings with crisp phrasing over a gentle pulse.

A female athlete paired with a subtle surprise look.

A male doctor shares a thought, under a clear disappointed expression.

A teen girl sings with delicate head tone on the high notes.

A female student sings while tapping the meter with fingertips.

A male presenter sings a clipped tag to end the phrase cleanly.

[We use same speech and different prompt for style control]

A man talks with a strong delighted expression.

A person speaks with a faint cheerful expression.

He speaks with a strong fearful demeanor.

A male presenter talks with a high stressed demeanor.

Her eyebrows lowering into a clear sad.

A girl continues speaking with a calm expression.

A person is speaking with a clear shock expression.

He has a slight laugh expression on the face.

He is holding a clear jealous expression while speaking.

A woman talks on with eyes wide.

He speaks with measured tempo and natural expression.

A woman is carrying a noticeable unhappy expression.

She speaks with a exuberant smile on expression.

She is carrying a noticeable annoyed tone.

He is showing a slight irritated expression.

She speaks with a high-energy resentful expression.

[We use same prompt and different speech for animation]

A person speaks with a low happy expression, keeping his gaze forward and steady.

Abstract

Multi-modality guidance for 3D facial animation has drawn growing interest, yet progress is constrained by the scarcity of high-quality, rich-annotated and style-balanced facial video dataset. As a result, prior methods often trade off realism and flexible conditioning. We tackle these issues with a distillation-base pipeline that leverages world generative models. Our approach has two components: data engine and style generator. In data engine, we construct prompts balanced across facial emotions and head motions, synthesize about 60 hours of facial videos using multiple foundation Text-2-Video (T2V) models, which leverages the powerful generative capabilities of T2V models. For style generator, we learn multi-modality style embeddings (person-specific or language-specific) that are aligned with text (from constructed prompts) embeddings and utilize them as the conditioning signals for a diffusion model to produce stylized 3D facial animation with audio-lip sync. Extensive experiments show that our method delivers high-fidelity, controllable, and style-adaptive facial animation, substantially expanding expressiveness while retaining precise conditional control through generative models distillation.

Dataset

Up part: Dataset construction overview. Top three rows: visualizations stratified by prompt attributes (subject, intensity, emotion), as the Stage I and II in Section 3.1. Row 4: 3D face reconstruction from video (Stage IV). Row 5: the 3D facial motion augmentation via pitch (nod) and yaw (shake), prompt metadata is updated with nod or shake accordingly (Stage V). Bottom part: Prompts meta-attribute distribution overview. We meta-segment the prompts, each prompt is composed of three attributes (Subject: S, Emotion: E and Intensity: I), and under each attribute we include as many descriptions as possible. We provide more prompts for visualization at here.

Generated-Results Overview

Compare With Baselines

We provide a visual comparison against baseline methods, with all results reported in the main paper. EmoTalk exhibits the weakest audio–lip synchronization, likely due to limited training data, whereas our approach delivers the most faithful text-style preservation and the strongest audio-lip synchronization among all compared methods.

Ablation Studies

We present visualizations from a series of ablation studies, including comparisons of models trained on different datasets, analyses of style variation induced by modifying a single word, and ablations over text–audio configurations and model scales. Across these settings, our method consistently exhibits precise audio-text synchronization on our dataset and achieves the most faithful text-style mapping among the evaluated variants.

Limitions

We also visualize the limitations of our framework in the main paper, highlighting both our enhancements and the remaining constraints in head motion (inherited from the underlying T2V model), as well as the imperfect coupling between music and semantic labels.

Application [Follow-Speaking-Style]

[We embed the codes of different videos as a condition-driven generation method to obtain speech-driven animations.]

We demonstrate style transfer from reference facial videos by directly conditioning the diffusion model on VQ-encoded codes. In this setting, we select reference faces with markedly different speaking styles to drive the animation, while keeping the input audio fixed to enable a clearer, controlled comparison.

BibTeX

@misc{
      Arxiv 2026 Submission,
      #TBD
}