The prompt attributions for each video are shown on the above
The man faces the camera in tight framing, his head taking the frame, and speaks with barely noticeable grief.
A man confronts the camera in a head-filling close-up and speaks, a powerful nervous vibe.
The lady have a slight smile forms and lingers, and the expression is calm.
A old female customer support agent framed by a medium disgust expression. Her head taking the frame.
He keeps looking at the camera, and muted contempt in his gaze, a pronounced happy gaze.
He centers himself to the camera and carries a measured sad look and an intense angry look.
The man holds clear surprise on his face, then with a marked happy look.
A female speaks in a warm tone, punctuating phrases with a confidence expression.
We also include some videos data with synchronized audio and video [please turn on the audio]
His expression start from confuse then shifts to surprise as he begins to speak.
The man begins with a soft, closed-mouth smile. As he starts to speak, his smile widens slightly.
The young girl has a pleasant and engaged expression with wide, curious eyes and a slightly open mouth.
The woman has a pleasant and engaged expression, smiling warmly. Her head moves subtly and naturally.
Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
@misc{
CVPR conference 2026,
# TBD
}