Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] add hunyuandit performance #235

Merged
merged 1 commit into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,9 @@ The overview of xDiT is shown as follows.

1. [Flux Performance](./docs/performance/flux.md)

2. [Pixart-Alpha Legacy Performance](./docs/performance/pixart_alpha_legacy.md)
2. [HunyuanDiT Performance](./docs/performance/hunyuandit.md)

3. [Pixart-Alpha Legacy Performance](./docs/performance/pixart_alpha_legacy.md)


<h2 id="QuickStart">🚀 QuickStart</h2>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/performance/hunuyuandit/L40-HunyuanDiT.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/performance/hunuyuandit/T4-HunyuanDiT.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions docs/performance/hunyuandit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## HunyuanDiT Performance
[Chinese Version](./hunyuandit_zh.md)

On an 8xA100 (NVLink) machine, the optimal parallelization scheme varies with the number of GPUs used, highlighting the importance of diverse and hybrid parallelism. The best parallel strategies for different GPU scales are as follows: with 2 GPUs, use `ulysses_degree=2`; with 4 GPUs, use `ulysses_degree=2, cfg_parallel=2`; with 8 GPUs, use `pipefusion_parallel=8`.

<div align="center">
<img src="../../assets/performance/hunuyuandit/A100-HunyuanDiT.png"
alt="latency-hunyuandit_a100">
</div>

The latency on 8xL40 (PCIe) is shown in the figure below. Similarly, the optimal parallel strategy differs for different GPU scales.

<div align="center">
<img src="../../assets/performance/hunuyuandit/L40-HunyuanDiT.png"
alt="latency-hunyuandit_l40">
</div>

On both A100 and L40, using `torch.compile` significantly enhances computational performance.

The acceleration on 8xV100 is shown in the figure below.

<div align="center">
<img src="../../assets/performance/hunuyuandit/V100-HunyuanDiT.png"
alt="latency-hunyuandit_v100">
</div>

The acceleration on 4xT4 is shown in the figure below.

<div align="center">
<img src="../../assets/performance/hunuyuandit/T4-HunyuanDiT.png"
alt="latency-hunyuandit_t4">
</div>

⚠️ We have not tested `torch.compile` on T4 and V100.
35 changes: 35 additions & 0 deletions docs/performance/hunyuandit_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## HunyuanDiT性能

在8xA100(NVLink)机器上,在使用不同GPU数目时,最佳的并行方案都是不同的。这说明了多种并行和混合并行的重要性。
最佳的并行策略在不同GPU规模时分别是:在2个GPU上,使用ulysses_degree=2;在4个GPU上,使用ulysses_degree=2, cfg_parallel=2;在8个GPU上,使用pipefusion_parallel=8。


<div align="center">
<img src="../../assets/performance/hunuyuandit/A100-HunyuanDiT.png"
alt="latency-hunyuandit_a100">
</div>

在8xL40 (PCIe)上的延迟情况如下图所示。同样,不同GPU规模,最佳并行策略都是不同的。

<div align="center">
<img src="../../assets/performance/hunuyuandit/L40-HunyuanDiT.png"
alt="latency-hunyuandit_l40">
</div>

在A100和L40上,使用torch.compile会带来计算性能的显著提升。

在8xV100上的加速下如下图所示。

<div align="center">
<img src="../../assets/performance/hunuyuandit/V100-HunyuanDiT.png"
alt="latency-hunyuandit_v100">
</div>

在4xT4上的加速下如下图所示。

<div align="center">
<img src="../../assets/performance/hunuyuandit/T4-HunyuanDiT.png"
alt="latency-hunyuandit_t4">
</div>

⚠️ 我们还没有在V100和T4上测试torch.compile的效果。
2 changes: 1 addition & 1 deletion examples/flux_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def main():
image_rank = dp_group_index * dp_batch_size + i
image.save(f"./results/flux_result_{parallel_info}_{image_rank}.png")
print(
f"image {i} saved to ./results/flux_result_{parallel_info}_{image_rank}.png"
f"image {i} saved to ./results/flux_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
)

if get_world_group().rank == get_world_group().world_size - 1:
Expand Down
4 changes: 2 additions & 2 deletions examples/hunyuandit_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,10 @@ def main():
for i, image in enumerate(output.images):
image_rank = dp_group_index * dp_batch_size + i
image.save(
f"./results/hunyuandit_result_{parallel_info}_{image_rank}.png"
f"./results/hunyuandit_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
)
print(
f"image {i} saved to ./results/hunyuandit_result_{parallel_info}_{image_rank}.png"
f"image {i} saved to ./results/hunyuandit_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
)

if get_world_group().rank == get_world_group().world_size - 1:
Expand Down