Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Ali1858 · 2023-11-15T01:21:00Z

System Info

2023-11-15 00:56:45,576] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.31.0
Platform: Linux-5.15.0-1046-kvm-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.2
Accelerate version: 0.21.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am encountering an issue with the llama-2 model. I trained a 4bit Lora sequence classification model using Llama-2 with padding_side=right (default value) and during the inference, I noticed that the model produces inconsistent logits for the same input text when the number of padding tokens varies. I suspect the attention mask is not working.

Here's the specific scenario:

I have a model that takes input text sequences with corresponding attention masks. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits.
However, when I provide the same input text with different numbers of padding tokens, the model gives different logits, which is unexpected.

Example:
Input 1 (Fewer Padding Tokens):

Input Text: "Hi how are you?"
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Input 2 (More Padding Tokens):

Input Text: "Hi how are you?", "Additional text", "More text", "Even more text"
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]

Logits for Input 1: [-2.3750]
Logits for Input 2: [-1.7344]

In the example above, Input 1 and Input 2 have the same content text with different numbers of padding tokens. However, the model produces different logits for these inputs, which should not be the case.

I have verified that the attention mask is correctly set to 1 for content tokens and 0 for padding tokens in both cases, so the model should ignore the padding tokens when calculating logits.

I would appreciate any guidance or assistance in understanding and resolving this problem.

This is how I am load the model

model_args = {
    "torch_dtype": torch.bfloat16,
    "quantization_config": BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        ),
    "cache_dir": 'cache',
    "device_map": "auto"#{"":0},
}

# Since reward models are trained using the same base model, we should use same model
base_reward_model = transformers.AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=1,**model_args
    )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, cache_dir='cache')


base_reward_model = PeftModel.from_pretrained(
    base_reward_model,
    ranking_adapter_name2,
    adapter_name="rank_ep1",
    is_trainable=False
    )

This is how I am getting prediction during training and during inference

## During training
logits = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=False,
        ).logits
loss = self.loss_fct(logits, cu_lens)
        
## During inference
base_reward_model.eval()
base_reward_model.set_adapter(adapter_name)
logits = base_reward_model(input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            use_cache=False,
           ).logits

Expected behavior

When using attention masks and padding tokens, I expect the model to produce consistent logits for the same input text, regardless of the number of padding tokens. The attention mask is correctly set to 1 for content tokens and 0 for padding tokens to ensure that the model ignores padding tokens when calculating logits. Therefore, the model should not be affected by the presence or absence of padding tokens.

The text was updated successfully, but these errors were encountered:

Ali1858 · 2023-11-15T14:26:03Z

I tested the model with bfloat16, 4bit (nf4) and original precision (float16?), for all the datatypes there is inconsistency in predicted logits. Is there a way I can avoid this?

I have seen this issue with causal_lm also but with sequence classification it is even more noticeable and impacts the accuracy very much. Moreover, it creates instability while doing RLHF, because the reward signal is not consistent.

gante · 2023-11-17T12:17:46Z

Hi @Ali1858 👋

Long story short, there is no way to avoid this effect, it is a matter of numerical precision and shape-dependent matmul order of operations. You can read more about it in this comment or in this twitter thread :)

The comment is mostly about KV caches, but it applies whenever we modify the shape of the input (e.g. add more padding)

Ali1858 · 2023-11-17T15:21:41Z

Hi @gante
Thanks for your response and explanation. I see this issue as a common problem. One thing I would like to point out is that I have trained sequence classification with padding_side="right" (default value) not "left". Even in inference, I am using padding_side="right" (default value). I have also tested the model with bfloat16, 4bit (nf4) and original precision (float16?), for all the datatypes there is inconsistency in predicted logits.

Is there something I do and retrain the model to minimize this inconsistency?

Ali1858 · 2023-11-17T17:04:17Z

Hi @gante

I tried inferencing with padding_side="left" and predictions are less inconsistent compared to padding_side="right". They are still not the same but values are not way off. Should I retrain the model with padding_side="left"?

gante · 2023-11-21T09:04:03Z

@Ali1858 All tasks should train with right-padding, as you can see in our references and examples. Sequence Classification should do its inference with right-padding, yes -- only generative models should use left-padding at inference time (see here why).

Changing the model variable type will have a major effect on the model predictions, so it's natural that the results are not consistent across types.

I'm afraid the issues you're describing are not bugs, but rather modeling challenges. Following our issues guidelines, we reserve GitHub issues for bugs in the repository and/or feature requests. For any other matters, we'd like to invite you to use our forum or our discord 🤗

Ali1858 mentioned this issue Nov 16, 2023

Negative KL even after using recommended generation kwargs huggingface/trl#1005

Closed

Ali1858 closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Ali1858 commented Nov 15, 2023

Ali1858 commented Nov 15, 2023 •

edited

Loading

gante commented Nov 17, 2023 •

edited

Loading

Ali1858 commented Nov 17, 2023

Ali1858 commented Nov 17, 2023 •

edited

Loading

gante commented Nov 21, 2023 •

edited

Loading

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Inconsistent SequenceClassfication Behavior for Padding Tokens. #27502

Comments

Ali1858 commented Nov 15, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Ali1858 commented Nov 15, 2023 • edited Loading

gante commented Nov 17, 2023 • edited Loading

Ali1858 commented Nov 17, 2023

Ali1858 commented Nov 17, 2023 • edited Loading

gante commented Nov 21, 2023 • edited Loading

Ali1858 commented Nov 15, 2023 •

edited

Loading

gante commented Nov 17, 2023 •

edited

Loading

Ali1858 commented Nov 17, 2023 •

edited

Loading

gante commented Nov 21, 2023 •

edited

Loading