Only last elements have expected outputs when doing batch inference #32848

HuangBugWei · 2024-08-16T12:17:49Z

System Info

transformers version: 4.44.0
Platform: Linux-4.18.0-372.32.1.0.1.el8_6.x86_64-x86_64-with-glibc2.31
Python version: 3.10.11
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.3
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA GeForce RTX 3090

Who can help?

@ArthurZucker
@gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

name = "google/gemma-2-9b-it"
tokenizer_name = name
llm_model_name = name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
llm_model = AutoModelForCausalLM.from_pretrained(
        llm_model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        # attn_implementation="flash_attention_2",
    )
llm_model.eval()

def chatWithLLM(model: AutoModelForCausalLM, tokenizer: AutoTokenizer):
    messages = [[
        {"role": "user", "content": "laugh " * (idx + 1) + " How many laugh are there?"},
    ] for idx in range(5)]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        padding=True, 
        add_generation_prompt=True, 
        return_tensors="pt", 
        return_dict=True
    ).to(model.device)
    
    outputs = model.generate(
        **input_ids,
        negative_prompt_attention_mask = input_ids["attention_mask"],
        do_sample=False,
        max_new_tokens=500,
        temperature=0.1,
    )

    # this is ugly code to isolate input message, but not related to the bug I guess
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    input_msg = tokenizer.batch_decode(input_ids["input_ids"], skip_special_tokens=True)
    for idx, im in enumerate(input_msg):
        response[idx] = response[idx][len(im):]
    
    return response

print(chatWithLLM(model=llm_model, tokenizer=tokenizer))
# ['', '', '', '', 'There are **5** laughs.  😄 \n']

Expected behavior

['There are 1 laughs. 😄 \n', 'There are 2 laughs. 😄 \n', 'There are 3 laughs. 😄 \n', 'There are 4 laughs. 😄 \n', 'There are 5 laughs. 😄 \n']

It is probably not the issue of apply_chat_template since using

messages = []
for idx in range(5):
    messages.append("laugh " * (idx + 1) + " How many laugh are there?")
print(messages)
input_ids = tokenizer(
    messages, 
    padding=True, 
    return_tensors="pt", 
).to(model.device)

to create batched messages will also reproduce that issue.

The text was updated successfully, but these errors were encountered:

gante · 2024-08-16T16:02:57Z

@HuangBugWei Thank you for opening this issue 🤗

It doesn't seem to be a bug, but rather an undesired output of the model given the prompt. The script you provided is the intended usage -- the only bit missing is the padding side when initializing the tokenizer, which improves modeling quality with padding, but it is still not enough in this case. See this doc for more info on the padding side.

Consider the script below, adapted from yours. If we use llama3.1, it gives the answer we expect :) I suggest playing around with the prompt, if you want to use gemma2.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_name = name
llm_model_name = name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
llm_model = AutoModelForCausalLM.from_pretrained(
        llm_model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
llm_model.eval()

def chatWithLLM(model: AutoModelForCausalLM, tokenizer: AutoTokenizer):
    messages = [[
        {"role": "user", "content": "laugh " * (idx + 1) + " How many laugh are there?"},
    ] for idx in range(5)]
    input_ids = tokenizer.apply_chat_template(
        messages,
        padding=True,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True
    ).to(model.device)

    outputs = model.generate(
        **input_ids,
        do_sample=False,
        max_new_tokens=500,
    )

    # this is ugly code to isolate input message, but not related to the bug I guess
    print(outputs)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(response)

chatWithLLM(model=llm_model, tokenizer=tokenizer)

HuangBugWei · 2024-08-16T16:22:49Z

@gante Thanks for your reply.
But I'm sure that the default gemma 2 tokenizer is padding at the left side.
Below is what the attention mask looks like.
It's a bit strange, haha, since the exact same code snippet works fine on the llama 3.1 8B it model.

'attention_mask': tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

gante · 2024-08-16T17:08:32Z

But I'm sure that the default gemma 2 tokenizer is padding at the left side.

Hehe you're right! It's a reflex on my end, most models don't add padding on the left by default 🤗

HuangBugWei · 2024-08-16T17:13:12Z

Ok I found it might be some issues about current implementation of attention.
In transformers version: 4.44.0, everything works fine if you explicitly set attn_implementation="eager".

gante · 2024-08-17T15:50:03Z

In transformers version: 4.44.0, everything works fine if you explicitly set attn_implementation="eager".

Yeah, gemma 2 doesn't work well with SDPA. Let me fix the default

HuangBugWei added the bug label Aug 16, 2024

HuangBugWei closed this as completed Aug 16, 2024

gante mentioned this issue Aug 17, 2024

Gemma2: eager attention by default #32865

Merged

IlyasMoutawwakil mentioned this issue Oct 28, 2024

Restore SDPA in Gemma2 models for transformers > 4.45 huggingface/optimum-intel#976

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only last elements have expected outputs when doing batch inference #32848

Only last elements have expected outputs when doing batch inference #32848

HuangBugWei commented Aug 16, 2024

gante commented Aug 16, 2024 •

edited

Loading

HuangBugWei commented Aug 16, 2024

gante commented Aug 16, 2024

HuangBugWei commented Aug 16, 2024

gante commented Aug 17, 2024

Only last elements have expected outputs when doing batch inference #32848

Only last elements have expected outputs when doing batch inference #32848

Comments

HuangBugWei commented Aug 16, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

gante commented Aug 16, 2024 • edited Loading

HuangBugWei commented Aug 16, 2024

gante commented Aug 16, 2024

HuangBugWei commented Aug 16, 2024

gante commented Aug 17, 2024

gante commented Aug 16, 2024 •

edited

Loading