S = gentleman | E = frown | I = Nan
S = female | E = sad | I = medium
S = young women | E = smile | I = vivid
S = little boy | E = cry | I = Nan
S = female | E = sad | I = medium
S = young women | E = smile | I = vivid
S = woman | E = angry | I = strong
S = man | E = fearful | I = medium
S = male professor | E = sad | I = stuble
S = young lady | E = singing | I = Nan
S = lady | E = happy | I = strong
S = woman | E = angry | I = intense
[We use different speech (and music) and different prompt for style control]
A male teacher speaks with lips curling into a well-defined smile between phrases.
A male gamer speaks in short bursts, a strong surprise expression on the face.
A man speaks quietly, tears gathering again, the crying expression plain to see.
A man exhibits a subtle zest expression, strong smile present and cheeks rounded.
A man speaks with a low happy expression, keeping his gaze forward and steady.
A female singer sings with crisp phrasing over a gentle pulse.
A female athlete paired with a subtle surprise look.
A male doctor shares a thought, under a clear disappointed expression.
A teen girl sings with delicate head tone on the high notes.
A female student sings while tapping the meter with fingertips.
A male presenter sings a clipped tag to end the phrase cleanly.
[We use same speech and different prompt for style control]
A man talks with a strong delighted expression.
A person speaks with a faint cheerful expression.
He speaks with a strong fearful demeanor.
A male presenter talks with a high stressed demeanor.
Her eyebrows lowering into a clear sad.
A girl continues speaking with a calm expression.
A person is speaking with a clear shock expression.
He has a slight laugh expression on the face.
He is holding a clear jealous expression while speaking.
A woman talks on with eyes wide.
He speaks with measured tempo and natural expression.
A woman is carrying a noticeable unhappy expression.
She speaks with a exuberant smile on expression.
She is carrying a noticeable annoyed tone.
He is showing a slight irritated expression.
She speaks with a high-energy resentful expression.
[We use same prompt and different speech for animation]
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
A person speaks with a low happy expression, keeping his gaze forward and steady.
Multi-modality guidance for 3D facial animation has drawn growing interest, yet progress is constrained by the scarcity of high-quality, rich-annotated and style-balanced facial video dataset. As a result, prior methods often trade off realism and flexible conditioning. We tackle these issues with a distillation-base pipeline that leverages world generative models. Our approach has two components: data engine and style generator. In data engine, we construct prompts balanced across facial emotions and head motions, synthesize about 60 hours of facial videos using multiple foundation Text-2-Video (T2V) models, which leverages the powerful generative capabilities of T2V models. For style generator, we learn multi-modality style embeddings (person-specific or language-specific) that are aligned with text (from constructed prompts) embeddings and utilize them as the conditioning signals for a diffusion model to produce stylized 3D facial animation with audio-lip sync. Extensive experiments show that our method delivers high-fidelity, controllable, and style-adaptive facial animation, substantially expanding expressiveness while retaining precise conditional control through generative models distillation.
Up part: Dataset construction overview. Top three rows: visualizations stratified by prompt attributes (subject, intensity, emotion), as the Stage I and II in Section 3.1. Row 4: 3D face reconstruction from video (Stage IV). Row 5: the 3D facial motion augmentation via pitch (nod) and yaw (shake), prompt metadata is updated with nod or shake accordingly (Stage V). Bottom part: Prompts meta-attribute distribution overview. We meta-segment the prompts, each prompt is composed of three attributes (Subject: S, Emotion: E and Intensity: I), and under each attribute we include as many descriptions as possible.
We provide a visual comparison against baseline methods, with all results reported in the main paper. EmoTalk exhibits the weakest audio–lip synchronization, likely due to limited training data, whereas our approach delivers the most faithful text-style preservation and the strongest audio-lip synchronization among all compared methods.
We present visualizations from a series of ablation studies, including comparisons of models trained on different datasets, analyses of style variation induced by modifying a single word, and ablations over text–audio configurations and model scales. Across these settings, our method consistently exhibits precise audio-text synchronization on our dataset and achieves the most faithful text-style mapping among the evaluated variants.
We also visualize the limitations of our framework in the main paper, highlighting both our enhancements and the remaining constraints in head motion (inherited from the underlying T2V model), as well as the imperfect coupling between music and semantic labels.
[We embed the codes of different videos as a condition-driven generation method to obtain speech-driven animations.]
We demonstrate style transfer from reference facial videos by directly conditioning the diffusion model on VQ-encoded codes. In this setting, we select reference faces with markedly different speaking styles to drive the animation, while keeping the input audio fixed to enable a clearer, controlled comparison.
@misc{
Arxiv 2026 Submission,
#TBD
}