Yasamin Jafarian

Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos

CVPR 2021

Oral Presentation

Best Paper Honorable Mention Award

University of Minnesota

Yasamin Jafarian

Hyun Soo Park

[Paper]

[Code]

[Dataset]

Abstract

A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of the other image at a different time instant. With the transformation, the predicted geometry can be self-supervised by the warped geometry from the other image. In addition, we jointly learn the depth along with the surface normals, which are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.

TikTok Dataset

We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.

Download TikTok Dataset:

The dataset can be viewed and downloaded from the Kaggle page. (you need to make an account in Kaggle to be able to download the data. It is free!)

TikTok Dataset Directory Structure:

TikTok_dataset

|_ 00001 (Sequence#)

| |_ images

| | |_ 0001.png (Frame#)

| | |_ 0002.png

| | |_ ....

| |_ masks

| | |_ 0001.png

| | |_ ....

| |_ densepose

| | |_ 0001.png

| | |_ ....

|_ 00002

|_ ...

|_ 00340

TikTok_Raw_Videos

|_ seq_00001_00009 (first sequence#_last sequence#)

| |_ dance_name.txt

| |_ video_link.txt

| |_ YouTube.mp4

|_ ...

|_ seq_00329_00340

Terms of usage and License:

The code and the TikTok dataset is supplied with no warranty and University of Minnesota or the authors will not be held responsible for the correctness of the code and data.
The code and the data will not be transferred to outside parties without the authors' permission and will be used only for research purposes. In particular, the code or TikTok dataset will not be included as part of any commercial software package or product of this institution.