Tri2 -plane: Thinking Head Avatar via Feature Pyramid

1University of Rochester, 2Sony AI, 3USTC

Figure. 1. We present Tri2-plane, a method designed for high-fidelity head avatar reconstruction from a short monocular video. The top row illustrates the novel view avatar synthesis (interpolation of viewpoints ranging from [-40°, +40°]) with facial expressions, and the bottom row displays the canonical appearance at corresponding.

Abstract

Recent years have witnessed considerable achievements in facial avatar reconstruction with neural volume rendering. Despite notable advancements, the reconstruction of complex and dynamic head movements from monocular videos still suffers from capturing and restoring fine-grained details. In this work, we propose a novel approach, named Tri2-plane, for monocular photo-realistic volumetric head avatar reconstructions. Distinct from the existing works that rely on a single tri-plane deformation field for dynamic facial modeling, the proposed Tri2-plane leverages the principle of feature pyramids and three top-to-down lateral connections tri-planes for details improvement. It samples and renders facial details at multiple scales, transitioning from the entire face to specific local regions and then to even more refined sub-regions. Moreover, we incorporate a camera-based geometry-aware sliding window method as an augmentation in training, which improves the robustness beyond the canonical space, with a particular improvement in cross-identity generation capabilities. Experimental outcomes indicate that the Tri2-plane not only surpasses existing methodologies but also achieves superior performance across quantitative and qualitative assessments.

Method

Figure. 2. Overview of Tri2-plane. The pipeline steps include four components: (1) parametric facial tracking and zero-pose rendering are applied to generate mean texture and normal maps (shown as Front-View); (2) a facial condition embedding from inputs (βt, γt and encoded It); (3) the multiple tri-plane for voxel rendering (as Tri2-plane), accommodating various facial scales while employing shared MLP weights and (4) the resulting images are refined with a super-resolution model (not depicted in the figure). Furthermore, we have introduced the geometry-aware sliding window for training data augmentation to improve robustness, which incorporates the camera parameters (cI, cE) with the tracked translation values to form the training pair.

Demo Video

Most of the presentation videos come from public data or datasets. We are very grateful to the authors who provided the videos.

[Please turn on the volume for voice over]

Demo Video [Baselines in Appendix]

BibTeX

@article{song2024tri,
      title={Tri $\^{}$\{$2$\}$ $-plane: Volumetric Avatar Reconstruction with Feature Pyramid},
      author={Song, Luchuan and Liu, Pinxin and Chen, Lele and Yin, Guojun and Xu, Chenliang},
      journal={arXiv preprint arXiv:2401.09386},
      year={2024}
    }