Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng
In this paper, we explore the generation of consistent human-centric visual content through a spatial conditioning strategy. We frame consistent reference-based controllable human image and video synthesis as a spatial inpainting task, where the desired content is spatially inpainted under the conditioning of a reference human image. Additionally, we propose a causal spatial conditioning strategy that constrains the interaction between reference and target features causally, thereby preserving the appearance information of the reference images for enhanced consistency. By leveraging the inherent capabilities of the denoising network for appearance detail extraction and conditioned generation, our approach is both straightforward and effective in maintaining fine-grained appearance details and the identity of the reference human image.
Core idea: Utilizing the denoising U-Net for reference feature extraction and target image synthesis to ensure content consistency.
Trained with the TikTok dataset (350 videos), UBCFashion (500 videos), and a self-gathered dance video dataset (3,500 dance videos featuring about 200 humans).
Our method can also be applied to the visual try-on task to generate garment-consistent human images. During training, we only add noise to the garment region in the human image:
Correspondingly, a regional loss is applied to the denoising U-Net's prediction during loss calculation. The results of the model trained on VTON-HD dataset:
Our proposed method can also be integrated into diffusion Transformer-based models, such as SD3 and FLUX, to enhance synthesis quality. In this approach, the reference image is utilized as additional tokens during training, with the loss computation restricted to the noisy tokens. To demonstrate the effectiveness of our method, we present results of FLUX trained on the VTON-HD dataset using SCD framework:
If you have any comments or questions, please feel free to contact Mingdeng Cao.