Learning High Fidelity Depths of Dressed Humans  by Watching Social Media Dance Videos 

CVPR 2021

Oral Presentation 

Best Paper Honorable Mention Award

University of Minnesota


A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of the other image at a different time instant. With the transformation, the predicted geometry can be self-supervised by the warped geometry from the other image. In addition, we jointly learn the depth along with the surface normals, which are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image.  We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images. 

TikTok Dataset

We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur.   For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.  

Download TikTok Dataset:

TikTok Dataset Directory Structure:


|_ 00001                         (Sequence#)

|_ images

| | |_ 0001.png      (Frame#)

| | |_ 0002.png

| | |_ ....

|_ masks

| | |_ 0001.png

| | |_ ....

|_ densepose

| | |_ 0001.png

| | |_ ....

|_ 00002

|_ ...

|_ 00340


|_ seq_00001_00009              (first sequence#_last sequence#)

|_ dance_name.txt

|_ video_link.txt

|_ YouTube.mp4

|_ ...

|_ seq_00329_00340

Terms of usage and License:

More Results

Input Image

Colored Reconstruction from Different Views

Surface Reconstruction from Different Views

More details of Mona Lisa Reconstruction's Smile


Aug 17th 2021: The Google Colab version of the code is added to the GitHub page.

June 21st 2021: The paper won the Best Paper Honorable Mention.

June 16th 2021: The TikTok dataset is added to the Kaggle page.

June 15th 2021: The MATLAB visualization code is added to the GitHub page.

June 12th 2021: The paper is chosen as a CVPR best paper candidate.

June 8th 2021: The training code is added to the GitHub page.

Apr 9th 2021: More results of web images is added to the project page.

Mar 11th 2021: The problem with the TikTok dataset seq 231-240 is fixed and the link above is updated.

Mar 9th 2021: The Inference code for the paper is added to the GitHub page.

Mar 3rd 2021: The paper is accepted for oral presentation in CVPR 2021.


If you found this work useful, please consider citing us: 


    author    = {Jafarian, Yasamin and Park, Hyun Soo},

    title     = {Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos},

    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

    month     = {June},

    year      = {2021},

    pages     = {12753-12762}} 


    title={Self-supervised 3D Representation Learning of Dressed Humans from Social Media Videos}, 

    author={Y. Jafarian and H. Park},

    journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},


    doi = {10.1109/TPAMI.2022.3231558},

    publisher = {IEEE Computer Society}, 

    address = {Los Alamitos, CA, USA}}