Question about varlen ring attention #6

TechxGenus · 2025-01-15T07:43:08Z

Thanks for sharing this great resource.
I am studying the varlen form of ring attention mentioned in the paper, which can indeed avoid the padding problem in TE implementation. However, it seems to me that it will have load imbalance problems, and some CP nodes may require too much computation. How do you consider this problem?

MiniMax-AI-Dev · 2025-01-16T08:26:30Z

Yes, we have noticed that load imbalance can occur in Varlen Ring Attention. However, this issue is not specific to Varlen Ring Attention alone but is rather a result of the "data-packing + varlen" approach. When ring attention is not used, this approach can lead to load imbalance in data parallelism (DP), where some DPs may end up with concatenated of short sequences while others have complete long sequences. This causes the short-sequence DPs to be forced to wait during synchronization.

In the case of Varlen Ring Attention, this impact extends to the synchronization communication of context parallelism (CP). To address this problem, it is necessary to avoid mixing long and short sequences within the same micro-batch training process. Theoretically, if needed, one could manually adjust the training order of samples with different sequence lengths within the global batch to prevent load imbalance. However, in practice, this adjustment can be challenging because the total number of tokens in the global batch is fixed. In scenarios with long sequences, the number of samples is very small, leaving little room to adjust the load. Therefore, solving this issue requires collaboration with the data side as well.

TechxGenus · 2025-01-16T12:23:44Z

Thanks, it does help to unify the distribution of sequence lengths in each mini-batch.
While my question is about the imbalance caused by causal mask when training longer sequences, which is a separate issue of the ring attention mechanism (ref: zhuzilin/ring-flash-attention#2), and becomes difficult to handle when combined with packing.

MiniMax-AI-Dev · 2025-01-17T03:02:49Z

In this context, the implementation approach we are referring to is the Zig-Zag method, which is also used in TransformerEngine.

Implementing this method in the context of data-packing is indeed troublesome.

TechxGenus · 2025-01-17T06:30:52Z

Get it. Thanks for the detailed answer.

hhaAndroid · 2025-01-17T11:46:32Z

Does implementing this feature require directly modifying the source code in Flash Attention, or can it be achieved by calling internal interfaces? Thank you

Infi-zc · 2025-02-12T07:09:40Z

Does implementing this feature require directly modifying the source code in Flash Attention, or can it be achieved by calling internal interfaces? Thank you

Hi, @hhaAndroid have you reproduced this scheduling algorithm? How is the effect?

MiniMax-AI-Dev closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about varlen ring attention #6

Question about varlen ring attention #6

TechxGenus commented Jan 15, 2025

MiniMax-AI-Dev commented Jan 16, 2025

TechxGenus commented Jan 16, 2025

MiniMax-AI-Dev commented Jan 17, 2025

TechxGenus commented Jan 17, 2025

hhaAndroid commented Jan 17, 2025

Infi-zc commented Feb 12, 2025

Question about varlen ring attention #6

Question about varlen ring attention #6

Comments

TechxGenus commented Jan 15, 2025

MiniMax-AI-Dev commented Jan 16, 2025

TechxGenus commented Jan 16, 2025

MiniMax-AI-Dev commented Jan 17, 2025

TechxGenus commented Jan 17, 2025

hhaAndroid commented Jan 17, 2025

Infi-zc commented Feb 12, 2025