3DMM Language Model

Bridging Facial Understanding and Animation via Language Models

CVPR 2026
1University of Rochester, 2University of Tokyo, 3University of Michigan, 4Voxel51

Figure. 1. Overview of the proposed Open3DFaceVid dataset and 3D facial understanding/animation pipeline. The left panel visualizes the Open3DFaceVid corpus, which covers a wide range of identities, emotions, and speaking styles generated via text-to-video (T2V) models. The right panel illustrates our interactive 3D facial interface: given a 3DMM sequence, the user prompts the agent to describe expressions and head motion in natural language, and the agent returns fine-grained, parameter-based interpretations. In the reverse direction, the agent is able to condition on user prompts to generate new 3DMM trajectories with controllable emotion and pose.

The Overview of Open3DFaceVid

The prompt attributions for each video are shown on the above

We also include some videos data with synchronized audio and video [please turn on the audio]

Abstract

Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

Open3DFaceVid Dataset

Figure. 2. The analysis of the Open3DFaceVid dataset and the overview of the dataset (include the corresponding Flame geometry). We summarize the control categories induced by prompts and their corresponding video counts, broken down by underlying T2V backbones. We further visualize the vocabulary with word clouds, separately for emotion-related terms and for full-text prompts, to highlight the diversity and saliency of affective descriptors.


Method

Figure. 3. (a) Motion2Language. Geometry sequences are encoded into discrete facial tokens by the geometry encoder and fed, together with text tokens from the user prompt, into a LLM. Conditioned only on these geometry tokens, the agent generates natural-language descriptions of expression/head motion, enabling interactive question to answering about 3D facial behavior. (b) Language2Motion. The user provides a natural-language description of the desired facial behavior (top left). The text tokenizer converts the prompt into word-level tokens, while a paired 3D facial sequence is encoded into discrete geometry tokens by the geometry encoder. The autoregressive transformer predicts future geometry tokens conditioned on the text prefix.

Language2Motion-Results

Motion2Language-Results

Language2Motion-Results [with Audio Condition]



BibTeX

@misc{
      CVPR conference 2026,
      # TBD
}