Learning 3D Human Dynamics from Video


Jason Zhang (adg) in collaboration with:
Angjoo Kanazawa, Panna Felsen, and Jitendra Malik

[Project Webpage]
[Paper]
[Video]



From a video of a human, our model (blue) can predict 3D meshes that are more temporally consistent than a method that only uses a single view (pink).
From a single image (purple), our model can recovers the current 3D mesh as well as the past and future 3D poses.


Abstract

From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation can recover smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. However, annotated data is always limited. On the other hand, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on them with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning.


Overview


Given a temporal sequence of images, we first extract per-image features $\phi_t$. We train a temporal encoder $f_{\text{movie}}$ that learns a representation of 3D human dynamics $\Phi_t$ over the temporal window centered at frame $t$, illustrated in the blue region. From $\Phi_t$, we predict the 3D human pose and shape $\Theta_t$, as well as the change in pose in the nearby $\pm \Delta t$ frames. The primary loss is 2D reprojection error, with an adversarial prior to make sure that the recovered poses are valid. We incorporate 3D losses when 3D annotations are available. We also train a hallucinator $h$ that takes a single image feature $\phi_t$ and learns to hallucinate its temporal representation $\tilde{\Phi}_t$. At test time, the hallucinator can be used to predict dynamics from a single image.


Paper

Kanazawa*, Zhang*, Felsen*, and Malik.

Learning 3D Human Dynamics from Video.

arXiv, 2018.

[pdf]     [Bibtex]



Project Video




Acknowledgements

Please see the main project webpage for full acknowledgements. This webpage template was borrowed from some colorful folks.