xdit-project · feifeibear · Aug 27, 2024 · Aug 27, 2024
diff --git a/README.md b/README.md
@@ -111,7 +111,9 @@ The overview of xDiT is shown as follows.
 
 1. [Flux Performance](./docs/performance/flux.md)
 
-2. [Pixart-Alpha Legacy Performance](./docs/performance/pixart_alpha_legacy.md)
+2. [HunyuanDiT Performance](./docs/performance/hunyuandit.md)
+
+3. [Pixart-Alpha Legacy Performance](./docs/performance/pixart_alpha_legacy.md)
 
 
 <h2 id="QuickStart">🚀 QuickStart</h2>

diff --git a/assets/performance/hunuyuandit/A100-HunyuanDiT.png b/assets/performance/hunuyuandit/A100-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/L40-HunyuanDiT.png b/assets/performance/hunuyuandit/L40-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/T4-HunyuanDiT.png b/assets/performance/hunuyuandit/T4-HunyuanDiT.png
diff --git a/assets/performance/hunuyuandit/V100-HunyuanDiT.png b/assets/performance/hunuyuandit/V100-HunyuanDiT.png
diff --git a/docs/performance/hunyuandit.md b/docs/performance/hunyuandit.md
@@ -0,0 +1,34 @@
+## HunyuanDiT Performance
+[Chinese Version](./hunyuandit_zh.md)
+
+On an 8xA100 (NVLink) machine, the optimal parallelization scheme varies with the number of GPUs used, highlighting the importance of diverse and hybrid parallelism. The best parallel strategies for different GPU scales are as follows: with 2 GPUs, use `ulysses_degree=2`; with 4 GPUs, use `ulysses_degree=2, cfg_parallel=2`; with 8 GPUs, use `pipefusion_parallel=8`.
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/A100-HunyuanDiT.png" 
+    alt="latency-hunyuandit_a100">
+</div>
+
+The latency on 8xL40 (PCIe) is shown in the figure below. Similarly, the optimal parallel strategy differs for different GPU scales.
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/L40-HunyuanDiT.png" 
+    alt="latency-hunyuandit_l40">
+</div>
+
+On both A100 and L40, using `torch.compile` significantly enhances computational performance.
+
+The acceleration on 8xV100 is shown in the figure below.
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/V100-HunyuanDiT.png" 
+    alt="latency-hunyuandit_v100">
+</div>
+
+The acceleration on 4xT4 is shown in the figure below.
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/T4-HunyuanDiT.png" 
+    alt="latency-hunyuandit_t4">
+</div>
+
+⚠️ We have not tested `torch.compile` on T4 and V100.
diff --git a/docs/performance/hunyuandit_zh.md b/docs/performance/hunyuandit_zh.md
@@ -0,0 +1,35 @@
+## HunyuanDiT性能
+
+在8xA100（NVLink）机器上，在使用不同GPU数目时，最佳的并行方案都是不同的。这说明了多种并行和混合并行的重要性。
+最佳的并行策略在不同GPU规模时分别是：在2个GPU上，使用ulysses_degree=2；在4个GPU上，使用ulysses_degree=2, cfg_parallel=2；在8个GPU上，使用pipefusion_parallel=8。
+
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/A100-HunyuanDiT.png" 
+    alt="latency-hunyuandit_a100">
+</div>
+
+在8xL40 (PCIe)上的延迟情况如下图所示。同样，不同GPU规模，最佳并行策略都是不同的。
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/L40-HunyuanDiT.png" 
+    alt="latency-hunyuandit_l40">
+</div>
+
+在A100和L40上，使用torch.compile会带来计算性能的显著提升。
+
+在8xV100上的加速下如下图所示。
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/V100-HunyuanDiT.png" 
+    alt="latency-hunyuandit_v100">
+</div>
+
+在4xT4上的加速下如下图所示。
+
+<div align="center">
+    <img src="../../assets/performance/hunuyuandit/T4-HunyuanDiT.png" 
+    alt="latency-hunyuandit_t4">
+</div>
+
+⚠️ 我们还没有在V100和T4上测试torch.compile的效果。
diff --git a/examples/flux_example.py b/examples/flux_example.py
@@ -67,7 +67,7 @@ def main():
                 image_rank = dp_group_index * dp_batch_size + i
                 image.save(f"./results/flux_result_{parallel_info}_{image_rank}.png")
                 print(
-                    f"image {i} saved to ./results/flux_result_{parallel_info}_{image_rank}.png"
+                    f"image {i} saved to ./results/flux_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
                 )
 
     if get_world_group().rank == get_world_group().world_size - 1:

diff --git a/examples/hunyuandit_example.py b/examples/hunyuandit_example.py
@@ -57,10 +57,10 @@ def main():
             for i, image in enumerate(output.images):
                 image_rank = dp_group_index * dp_batch_size + i
                 image.save(
-                    f"./results/hunyuandit_result_{parallel_info}_{image_rank}.png"
+                    f"./results/hunyuandit_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
                 )
                 print(
-                    f"image {i} saved to ./results/hunyuandit_result_{parallel_info}_{image_rank}.png"
+                    f"image {i} saved to ./results/hunyuandit_result_{parallel_info}_{image_rank}_tc_{engine_args.use_torch_compile}.png"
                 )
 
     if get_world_group().rank == get_world_group().world_size - 1: