GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction

ACMMM 2023

1The University of Sydney, Australia 2JD Explore Academy, China 3Max Plank Institute for Intelligent Systems, Germany

GraMMaR: 3D motion in the camera view is misleading. A representative optimization method HuMoR produces correct poses under camera view but physically implausible poses in world view when faced with ambiguity (Row1) and noise (Row2). In contrast, our method provides a ground-aware motion, thereby ensuring physical plausibility across all views. Body torso direction and contacts for HuMoR and ours are highlighted. GT in Row1 is reconstructed from multi-view images.

Abstract

Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions.

Video

Method

Our model GraMMaR architecture. In training, given the previous state I_{t-1} and current state I_t , we obtain the motion state x_{t-1}, x_t , and interaction state g_{t-1}, g_t . Our model learns the transition of motion and interaction state changes separately by two priors and reconstructs x^t , g^t by sampling from the two distributions and decoding them conditioned on both x_{t-1} and g_{t-1}.

BibTeX


      @inproceedings{ma2023grammar,
        title={GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction},
        author={Ma, Sihan and Cao, Qiong and Yi, Hongwei and Zhang, Jing and Tao, Dacheng},
        booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
        pages={2817--2828},
        year={2023}
      }