-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AttentionMaskConverter
] ]Fix-mask-inf
#27114
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the offline explanation
Hello! We have pulled this PR from the main branch yesterday. We were having NaN issues with ppo_trainer and llama2-7b-chat. After investigation, we found that the NaN can be reproduced just by generating the 1st token for a batch of 4 sentences and it depends on how we form the batch (i.e. the sentences that fail depend on the size of the batch and which sentences we include in the batch). Including different sentences in the batch changes the padding structure and it seems that the moment you get padding, your risk of NaN increases. Nevertheless, we have seen also NaN with batch=1 (no padding) and float16, so it seems that padding is not the only root of the problem We have observed that the NaN appear in the 31st layer and subsequently in the logits, not in earlier layers. The input_ids and attention mask that generate get seem correct. The example code uses bfloat16 because it seems to alleviate the issue, which is more frequent with float16.
|
Hi @toritospartan -- any chance you could reproduce the issue with an open-access model OR privately share your model with us? Otherwise, it will be very challenging for us to nail the cause :) |
We have been able to reproduce the issue with just public data. The reason of it seems to be that, due to a lack of padding token in llama2, we added our own pad token (we added 128 tokens to keep the model efficient as warning said), thinking this token should be ignored anyway. However this seems to produce those NaN in some occasions. We checked the range of the embedding of token 0 (maybe this is the pad token Meta used even if it is not clear in their code or in the export scrip from HF?). The std of this embedding is 0 with mean 0. Our pad token embedding had the std of the _init_weights of the Transformer model (this is expected). Thing is that it is this range that seems to make llama overflow. We have generated a script that makes this happen very often via generating that weight with an exaggerated std, Clear advise on how to manage this situation will make people be less confused because a further question is (are we creating a bias in the model because of which pad token we use?)
|
(cc @ArthurZucker as I have no idea how adding extra tokens works internally :D) |
* fix? * actual fix * fixups * add dataclass to the attention mask converter * refine testing suite * make sure there are no overflows * update the test
Hey! Thanks both, when adding a new token it is recommended to initialize it's embedding to an average of all the embedding of the embedding layer! This explains it best: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. |
@toritospartan, the LLaMA models are unaffected by this PR as they do masking by hand instead of relying on As an alternative solution you can do while waiting for existing models to be fixed I'd suggest adding the following after the line 425 of modeling_llama.py (before the masking): attn_weights = attn_weights - attn_weights.max(dim=-1, keepdim=True)[0] |
@artsobolev Hi,Since the code for modeling_llama has changed, and I'm not sure exactly where you're referring to, I've put in if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): |
Llama, mistral and so on do use |
What does this PR do?
Fixes the
-inf
appearing in the padding mask from the way we create them. Adds@dataclass
decorator toAttentionMaskConverter
as well as en example in the doc.The pad tokens are still attended to for some specific cases, which produce different outputs for flash attention / non flash attention. Might fix #27050, but also related to other Llama issue.
FYI @gante for visibility 😉
Basically instead of
we just mask fill the second with the first. This way we are sure that the mask does not overflow.
Before:
after: