-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502
Comments
I tested the model with bfloat16, 4bit (nf4) and original precision (float16?), for all the datatypes there is inconsistency in predicted logits. Is there a way I can avoid this? I have seen this issue with |
Hi @Ali1858 👋 Long story short, there is no way to avoid this effect, it is a matter of numerical precision and shape-dependent matmul order of operations. You can read more about it in this comment or in this twitter thread :) The comment is mostly about KV caches, but it applies whenever we modify the shape of the input (e.g. add more padding) |
Hi @gante Is there something I do and retrain the model to minimize this inconsistency? |
Hi @gante I tried inferencing with padding_side="left" and predictions are less inconsistent compared to padding_side="right". They are still not the same but values are not way off. Should I retrain the model with padding_side="left"? |
@Ali1858 All tasks should train with right-padding, as you can see in our references and examples. Sequence Classification should do its inference with right-padding, yes -- only generative models should use left-padding at inference time (see here why). Changing the model variable type will have a major effect on the model predictions, so it's natural that the results are not consistent across types. I'm afraid the issues you're describing are not bugs, but rather modeling challenges. Following our issues guidelines, we reserve GitHub issues for bugs in the repository and/or feature requests. For any other matters, we'd like to invite you to use our forum or our discord 🤗 |
System Info
2023-11-15 00:56:45,576] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.31.0- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@ArthurZucker @gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am encountering an issue with the llama-2 model. I trained a 4bit Lora sequence classification model using Llama-2 with padding_side=right (default value) and during the inference, I noticed that the model produces inconsistent logits for the same input text when the number of padding tokens varies. I suspect the attention mask is not working.
Here's the specific scenario:
I have a model that takes input text sequences with corresponding attention masks. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits.
However, when I provide the same input text with different numbers of padding tokens, the model gives different logits, which is unexpected.
Example:
Input 1 (Fewer Padding Tokens):
Logits for Input 1: [-2.3750]
Logits for Input 2: [-1.7344]
In the example above, Input 1 and Input 2 have the same content text with different numbers of padding tokens. However, the model produces different logits for these inputs, which should not be the case.
I have verified that the attention mask is correctly set to 1 for content tokens and 0 for padding tokens in both cases, so the model should ignore the padding tokens when calculating logits.
I would appreciate any guidance or assistance in understanding and resolving this problem.
This is how I am load the model
This is how I am getting prediction during training and during inference
Expected behavior
When using attention masks and padding tokens, I expect the model to produce consistent logits for the same input text, regardless of the number of padding tokens. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits. Therefore, the model should not be affected by the presence or absence of padding tokens.
The text was updated successfully, but these errors were encountered: