GitHub - ljzycmd/SCD: Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng

In this paper, we explore the generation of consistent human-centric visual content through a spatial conditioning strategy. We frame consistent reference-based controllable human image and video synthesis as a spatial inpainting task, where the desired content is spatially inpainted under the conditioning of a reference human image. Additionally, we propose a causal spatial conditioning strategy that constrains the interaction between reference and target features causally, thereby preserving the appearance information of the reference images for enhanced consistency. By leveraging the inherent capabilities of the denoising network for appearance detail extraction and conditioned generation, our approach is both straightforward and effective in maintaining fine-grained appearance details and the identity of the reference human image.

Main Architecture

Core idea: Utilizing the denoising U-Net for reference feature extraction and target image synthesis to ensure content consistency.

Results of Human Animation

Trained with the TikTok dataset (350 videos), UBCFashion (500 videos), and a self-gathered dance video dataset (3,500 dance videos featuring about 200 humans).

More Applications

Our method can also be applied to the visual try-on task to generate garment-consistent human images. During training, we only add noise to the garment region in the human image:

Correspondingly, a regional loss is applied to the denoising U-Net's prediction during loss calculation. The results of the model trained on VTON-HD dataset:

Paired setting

Unpaired setting

Results with Diffusion Transformer-based Model

Our proposed method can also be integrated into diffusion Transformer-based models, such as SD3 and FLUX, to enhance synthesis quality. In this approach, the reference image is utilized as additional tokens during training, with the loss computation restricted to the noisy tokens. To demonstrate the effectiveness of our method, we present results of FLUX trained on the VTON-HD dataset using SCD framework:

Tryon results of different base models

Contact

If you have any comments or questions, please feel free to contact Mingdeng Cao.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Main Architecture

Results of Human Animation

More Applications

Results with Diffusion Transformer-based Model

Contact

About

Releases

Packages

ljzycmd/SCD

Folders and files

Latest commit

History

Repository files navigation

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Main Architecture

Results of Human Animation

More Applications

Results with Diffusion Transformer-based Model

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages