3DMM Language Model

Bridging Facial Understanding and Animation via Language Models

CVPR 2026

Luchuan Song¹, Pinxin Liu¹, Haiyang Liu², Zhenchao Jin, Yolo Yunlong Tang¹, Zichong Xu¹, Susan Liang¹, Jing Bi¹, Jason J Corso^3,4, Chenliang Xu¹,

¹University of Rochester, ²University of Tokyo, ³University of Michigan, ⁴Voxel51

arXiv Code Data

Figure. 1. Overview of the proposed Open3DFaceVid dataset and 3D facial understanding/animation pipeline. The left panel visualizes the Open3DFaceVid corpus, which covers a wide range of identities, emotions, and speaking styles generated via text-to-video (T2V) models. The right panel illustrates our interactive 3D facial interface: given a 3DMM sequence, the user prompts the agent to describe expressions and head motion in natural language, and the agent returns fine-grained, parameter-based interpretations. In the reverse direction, the agent is able to condition on user prompts to generate new 3DMM trajectories with controllable emotion and pose.

The Overview of Open3DFaceVid

The prompt attributions for each video are shown on the above

The man faces the camera in tight framing, his head taking the frame, and speaks with barely noticeable grief.

A man confronts the camera in a head-filling close-up and speaks, a powerful nervous vibe.

The lady have a slight smile forms and lingers, and the expression is calm.

A old female customer support agent framed by a medium disgust expression. Her head taking the frame.

He keeps looking at the camera, and muted contempt in his gaze, a pronounced happy gaze.

He centers himself to the camera and carries a measured sad look and an intense angry look.

The man holds clear surprise on his face, then with a marked happy look.

A female speaks in a warm tone, punctuating phrases with a confidence expression.

We also include some videos data with synchronized audio and video [please turn on the audio]

His expression start from confuse then shifts to surprise as he begins to speak.

The man begins with a soft, closed-mouth smile. As he starts to speak, his smile widens slightly.

The young girl has a pleasant and engaged expression with wide, curious eyes and a slightly open mouth.

The woman has a pleasant and engaged expression, smiling warmly. Her head moves subtly and naturally.

Open3DFaceVid Dataset

Figure. 2. The analysis of the Open3DFaceVid dataset and the overview of the dataset (include the corresponding Flame geometry). We summarize the control categories induced by prompts and their corresponding video counts, broken down by underlying T2V backbones. We further visualize the vocabulary with word clouds, separately for emotion-related terms and for full-text prompts, to highlight the diversity and saliency of affective descriptors.

Method

Figure. 3. (a) Motion2Language. Geometry sequences are encoded into discrete facial tokens by the geometry encoder and fed, together with text tokens from the user prompt, into a LLM. Conditioned only on these geometry tokens, the agent generates natural-language descriptions of expression/head motion, enabling interactive question to answering about 3D facial behavior. (b) Language2Motion. The user provides a natural-language description of the desired facial behavior (top left). The text tokenizer converts the prompt into word-level tokens, while a paired 3D facial sequence is encoded into discrete geometry tokens by the geometry encoder. The autoregressive transformer predicts future geometry tokens conditioned on the text prefix.

3DMM Language Model

Bridging Facial Understanding and Animation via Language Models

The Overview of Open3DFaceVid

Abstract

Open3DFaceVid Dataset

Method

Language2Motion-Results

Motion2Language-Results

Language2Motion-Results [with Audio Condition]

BibTeX