You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask why it is necessary to have a two-way Self-Attention. As far as I know, in Transformer, the MHSA of the Encoder part is a two-way concern, refer to BERT. The Mask MHSA in the Decoder part is to cover up the future information and only focus on the past information, refer to GPT. And in your code, the input to Transformer is not transposed, but it is standard practice to transform [batch,channel,length] to [batch,length,channel].
The text was updated successfully, but these errors were encountered:
I would like to ask why it is necessary to have a two-way Self-Attention. As far as I know, in Transformer, the MHSA of the Encoder part is a two-way concern, refer to BERT. The Mask MHSA in the Decoder part is to cover up the future information and only focus on the past information, refer to GPT. And in your code, the input to Transformer is not transposed, but it is standard practice to transform [batch,channel,length] to [batch,length,channel].
The text was updated successfully, but these errors were encountered: