-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix contrastive search to correctly handle input with padding #33507
Fix contrastive search to correctly handle input with padding #33507
Conversation
133f680
to
f311827
Compare
b0f508d
to
f0add62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ducviet00 thank you for identifying the underlying numerical issue, proposing a very reasonable fix, and implementing it with a test 💛
I agree with the suggested fix: by masking cosine_matrix
with a large negative value when the corresponding tokens are masked, degeneration_penalty
will never be related to the masked tokens and, therefore, masked tokens will not have an impact on contrastive_score
.
I've added a few minor suggestions
@gante Thanks for your feedback! I initially thought encoder-decoder models wouldn’t be affected since I’ll push an update with support for encoder-decoder models based on your suggestions soon. |
@gante
It should be:
You can check the code inside |
# Create cosine_matrix_mask based on the attention_mask | ||
cosine_matrix_mask = torch.ones_like(input_ids, dtype=torch.long) | ||
if self.config.is_encoder_decoder: | ||
if "decoder_attention_mask" in model_kwargs and model_kwargs["decoder_attention_mask"] is not None: | ||
cosine_matrix_mask = model_kwargs["decoder_attention_mask"] | ||
else: | ||
cosine_matrix_mask = model_kwargs["attention_mask"] | ||
cosine_matrix_mask = cosine_matrix_mask.repeat_interleave(top_k, dim=0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code initializes a default mask and then updates it based on the model type.
- For encoder-decoder models, if
model_kwargs
contains adecoder_attention_mask
(and it is not None),cosine_matrix_mask
is set to this mask. Ifdecoder_attention_mask
is missing, it falls back to the default mask. - For decoder-only models,
cosine_matrix_mask
is set to theattention_mask
frommodel_kwargs
.
Please let me know if there are any additional logic checks that need to be added
cde4634
to
6a86c0d
Compare
955fdfb
to
b8a08dd
Compare
Hi @gante |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating 💛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive change and tests! Thanks for your PR @ducviet00
@ducviet00 thank you for making |
…gface#33507) * fix: handle padding in contrastive search for decoder-only models * fix: handle padding in contrastive search for encoder-decoder models * tests: move padding contrastive test to test_util, add t5 test * fix: handle if model_kwargs["decoder_attention_mask"] is None * refactor: improve padding input contrastive search generation tests * chore: _ranking_fast to use LongTensor for cosine_matrix_mask
What does this PR do?
This PR fixes contrastive search to correctly handle input padding in decoder-only & encoder-decoder models.
Details
I encountered the issue and discovered that the Contrastive Search implementation is highly sensitive to padded tokens.
For example, with the prompt:
The whispered legends of the haunted mansion spoke
, the output from Hugging Face without padding is:The whispered legends of the haunted mansion spoke of the "souls of the dead" who were "falling out of the sky" and "falling into the sea."\n
However, with padding tokens, the output becomes:
The whispered legends of the haunted mansion spoke of the "soul of the dead" and the "blood of the dead."\nThe ghost of Dr. H. P. Lovecraft was a man
You can check the Colab notebook that demonstrates the issue here
How did I fix the issue
The issue arises when
input_ids
contain padded tokens; thehidden_states
of the model then include embeddings for these padded tokens if the model doesn't eliminate them during processing. The_ranking_fast
function also calculates values based on these padded tokens, leading to incorrect outputs. This is critical because it significantly degrades the model's performance.To fix this, I created a
cosine_matrix_mask
based on theattention_mask
and penalized thecosine_matrix
using this mask (ignoring padding positions) by applying large negative values. After that, I padded thecosine_matrix_mask
with ones to match the output length.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante @ArthurZucker @amyeroberts @Rocketknight1