Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Closed
2 of 4 tasks
Ali1858 opened this issue Nov 15, 2023 · 5 comments
Closed
2 of 4 tasks

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Ali1858 opened this issue Nov 15, 2023 · 5 comments

Comments

@Ali1858
Copy link

Ali1858 commented Nov 15, 2023

System Info

2023-11-15 00:56:45,576] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.31.0
  • Platform: Linux-5.15.0-1046-kvm-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.2
  • Accelerate version: 0.21.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 16, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am encountering an issue with the llama-2 model. I trained a 4bit Lora sequence classification model using Llama-2 with padding_side=right (default value) and during the inference, I noticed that the model produces inconsistent logits for the same input text when the number of padding tokens varies. I suspect the attention mask is not working.

Here's the specific scenario:

I have a model that takes input text sequences with corresponding attention masks. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits.
However, when I provide the same input text with different numbers of padding tokens, the model gives different logits, which is unexpected.

Example:
Input 1 (Fewer Padding Tokens):

Input Text: "Hi how are you?"
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Input 2 (More Padding Tokens):
Input Text: "Hi how are you?", "Additional text", "More text", "Even more text"
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]

Logits for Input 1: [-2.3750]
Logits for Input 2: [-1.7344]

In the example above, Input 1 and Input 2 have the same content text with different numbers of padding tokens. However, the model produces different logits for these inputs, which should not be the case.

I have verified that the attention mask is correctly set to 1 for content tokens and 0 for padding tokens in both cases, so the model should ignore the padding tokens when calculating logits.

I would appreciate any guidance or assistance in understanding and resolving this problem.

This is how I am load the model

model_args = {
    "torch_dtype": torch.bfloat16,
    "quantization_config": BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        ),
    "cache_dir": 'cache',
    "device_map": "auto"#{"":0},
}

# Since reward models are trained using the same base model, we should use same model
base_reward_model = transformers.AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=1,**model_args
    )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, cache_dir='cache')


base_reward_model = PeftModel.from_pretrained(
    base_reward_model,
    ranking_adapter_name2,
    adapter_name="rank_ep1",
    is_trainable=False
    )

This is how I am getting prediction during training and during inference

## During training
logits = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=False,
        ).logits
loss = self.loss_fct(logits, cu_lens)
        
## During inference
base_reward_model.eval()
base_reward_model.set_adapter(adapter_name)
logits = base_reward_model(input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            use_cache=False,
           ).logits

Expected behavior

When using attention masks and padding tokens, I expect the model to produce consistent logits for the same input text, regardless of the number of padding tokens. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits. Therefore, the model should not be affected by the presence or absence of padding tokens.

@Ali1858
Copy link
Author

Ali1858 commented Nov 15, 2023

I tested the model with bfloat16, 4bit (nf4) and original precision (float16?), for all the datatypes there is inconsistency in predicted logits. Is there a way I can avoid this?

I have seen this issue with causal_lm also but with sequence classification it is even more noticeable and impacts the accuracy very much. Moreover, it creates instability while doing RLHF, because the reward signal is not consistent.

@gante
Copy link
Member

gante commented Nov 17, 2023

Hi @Ali1858 👋

Long story short, there is no way to avoid this effect, it is a matter of numerical precision and shape-dependent matmul order of operations. You can read more about it in this comment or in this twitter thread :)

The comment is mostly about KV caches, but it applies whenever we modify the shape of the input (e.g. add more padding)

@Ali1858
Copy link
Author

Ali1858 commented Nov 17, 2023

Hi @gante
Thanks for your response and explanation. I see this issue as a common problem. One thing I would like to point out is that I have trained sequence classification with padding_side="right" (default value) not "left". Even in inference, I am using padding_side="right" (default value). I have also tested the model with bfloat16, 4bit (nf4) and original precision (float16?), for all the datatypes there is inconsistency in predicted logits.

Is there something I do and retrain the model to minimize this inconsistency?

@Ali1858
Copy link
Author

Ali1858 commented Nov 17, 2023

Hi @gante

I tried inferencing with padding_side="left" and predictions are less inconsistent compared to padding_side="right". They are still not the same but values are not way off. Should I retrain the model with padding_side="left"?

@gante
Copy link
Member

gante commented Nov 21, 2023

@Ali1858 All tasks should train with right-padding, as you can see in our references and examples. Sequence Classification should do its inference with right-padding, yes -- only generative models should use left-padding at inference time (see here why).

Changing the model variable type will have a major effect on the model predictions, so it's natural that the results are not consistent across types.

I'm afraid the issues you're describing are not bugs, but rather modeling challenges. Following our issues guidelines, we reserve GitHub issues for bugs in the repository and/or feature requests. For any other matters, we'd like to invite you to use our forum or our discord 🤗

@Ali1858 Ali1858 closed this as completed Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants