The official implementation of "2024 Dynamic Diffusion Transformer".
Wangbo Zhao1, Yizeng Han2, Jiasheng Tang2,3, Kai Wang1, Yibing Song2,3, Gao Huang4, Fan Wang2, Yang You1
1National University of Singapore, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4Tsinghua University
DiT.vs.DyDiT.mp4
We compare the generation speed of original DiT and the proposed DyDiT with
Images generated by DyDiT with
Abstract: Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
2025.01.23
DyTDyT is accepted by ICLR 2025!!! We will update the code and paper soon.2024.12.19:
We release the code for inference.2024.10.04:
Our paper is released.
-
Release the code for inference.
-
Release the code for training.
-
Release the code for applying our method to additional models (e.g., U-ViT, SiT).
-
Release the code for applying our method to text-to-image and text-to-video generation diffusion models.
(a) The loss difference between DiT-S and DiT-XL across all diffusion timesteps (T = 1000). The difference is slight at most timesteps.
(b) Loss maps (normalized to the range [0, 1]) at different timesteps, show that the noise in different patches has varying levels of difficulty to predict.
(c) Difference of the inference paradigm between the static DiT and the proposed DyDiT
Overview of the proposed dynamic diffusion transformer (DyDiT). It reduces the computational redundancy in DiT from both timestep and spatial dimensions.
We provide an environment.yml file to help create the Conda environment in our experiments. Other environments may also works well.
git clone https://github.com/NUS-HPC-AI-Lab/Dynamic-Diffusion-Transformer.git
conda env create -f environment.yml
conda activate DyDiT
Currently, we provide a pre-trained checkpoint of DyDiT
model | FLOPs (G) | FID | download |
---|---|---|---|
DiT | 118.69 | 2.27 | - |
DyDiT |
84.33 | 2.12 | 🤗 |
DyDiT |
- | - | in progress |
Run sample_0.7.sh to sample images and evaluate the performance.
bash sample_0.7.sh
The sample_ddp.py script which samples 50,000 images in parallel. It generates a folder of samples as well as a .npz file which can be directly used with ADM's TensorFlow evaluation suite to compute FID, Inception Score and other metrics. Please follow its instructions to download the reference batch VIRTUAL_imagenet256_labeled.npz.
If you found our work useful, please consider citing us.
@article{zhao2024dynamic,
title={Dynamic diffusion transformer},
author={Zhao, Wangbo and Han, Yizeng and Tang, Jiasheng and Wang, Kai and Song, Yibing and Huang, Gao and Wang, Fan and You, Yang},
journal={arXiv preprint arXiv:2410.03456},
year={2024}
}
If you're interested in collaborating with us, feel free to reach out via email at wangbo.zhao96@gmail.com.