Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos

CVPR 2021

Oral Presentation

Best Paper Honorable Mention Award

University of Minnesota

Abstract

A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of the other image at a different time instant. With the transformation, the predicted geometry can be self-supervised by the warped geometry from the other image. In addition, we jointly learn the depth along with the surface normals, which are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.

TikTok Dataset

We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.

Download TikTok Dataset:

  • The dataset can be viewed and downloaded from the Kaggle page. (you need to make an account in Kaggle to be able to download the data. It is free!)

  • Or you can download it directly from the following google drive:

    • The dataset can be downloaded from here (42 GB). The dataset resolution is: (1080 x 604)

    • The original YouTube videos corresponding to each sequence and the dance name can be downloaded from here (2.6 GB).

    • Please contact me if the links are broken or you cannot access the Kaggle page: yasamin@umn.edu

TikTok Dataset Directory Structure:

TikTok_dataset

|_ 00001 (Sequence#)

| |_ images

| | |_ 0001.png (Frame#)

| | |_ 0002.png

| | |_ ....

| |_ masks

| | |_ 0001.png

| | |_ ....

| |_ densepose

| | |_ 0001.png

| | |_ ....

|_ 00002

|_ ...

|_ 00340


TikTok_Raw_Videos

|_ seq_00001_00009 (first sequence#_last sequence#)

| |_ dance_name.txt

| |_ video_link.txt

| |_ YouTube.mp4

|_ ...

|_ seq_00329_00340

Terms of usage and License:

  • The code and the TikTok dataset is supplied with no warranty and University of Minnesota or the authors will not be held responsible for the correctness of the code and data.

  • The code and the data will not be transferred to outside parties without the authors' permission and will be used only for research purposes. In particular, the code or TikTok dataset will not be included as part of any commercial software package or product of this institution.

More Results

Input Image

Colored Reconstruction from Different Views

Surface Reconstruction from Different Views

More details of Mona Lisa Reconstruction's Smile

News:

Aug 17th 2021: The Google Colab version of the code is added to the GitHub page.

June 21st 2021: The paper won the Best Paper Honorable Mention.

June 16th 2021: The TikTok dataset is added to the Kaggle page.

June 15th 2021: The MATLAB visualization code is added to the GitHub page.

June 12th 2021: The paper is chosen as a CVPR best paper candidate.

June 8th 2021: The training code is added to the GitHub page.

Apr 9th 2021: More results of web images is added to the project page.

Mar 11th 2021: The problem with the TikTok dataset seq 231-240 is fixed and the link above is updated.

Mar 9th 2021: The Inference code for the paper is added to the GitHub page.

Mar 3rd 2021: The paper is accepted for oral presentation in CVPR 2021.

Citation

If you found this work useful, please consider citing us:

@InProceedings{Jafarian_2021_CVPR_TikTok,

author = {Jafarian, Yasamin and Park, Hyun Soo},

title = {Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos},

booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

month = {June},

year = {2021},

pages = {12753-12762}}


@ARTICLE{Jafarian_Self-supervised_3D_Representation,

title={Self-supervised 3D Representation Learning of Dressed Humans from Social Media Videos},

author={Y. Jafarian and H. Park},

journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},

year={2022},

doi = {10.1109/TPAMI.2022.3231558},

publisher = {IEEE Computer Society},

address = {Los Alamitos, CA, USA}}