Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only last elements have expected outputs when doing batch inference #32848

Closed
2 of 4 tasks
HuangBugWei opened this issue Aug 16, 2024 · 5 comments
Closed
2 of 4 tasks
Labels

Comments

@HuangBugWei
Copy link

System Info

  • transformers version: 4.44.0
  • Platform: Linux-4.18.0-372.32.1.0.1.el8_6.x86_64-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.24.5
  • Safetensors version: 0.4.3
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA GeForce RTX 3090

Who can help?

@ArthurZucker
@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

name = "google/gemma-2-9b-it"
tokenizer_name = name
llm_model_name = name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
llm_model = AutoModelForCausalLM.from_pretrained(
        llm_model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        # attn_implementation="flash_attention_2",
    )
llm_model.eval()

def chatWithLLM(model: AutoModelForCausalLM, tokenizer: AutoTokenizer):
    messages = [[
        {"role": "user", "content": "laugh " * (idx + 1) + " How many laugh are there?"},
    ] for idx in range(5)]
    input_ids = tokenizer.apply_chat_template(
        messages, 
        padding=True, 
        add_generation_prompt=True, 
        return_tensors="pt", 
        return_dict=True
    ).to(model.device)
    
    outputs = model.generate(
        **input_ids,
        negative_prompt_attention_mask = input_ids["attention_mask"],
        do_sample=False,
        max_new_tokens=500,
        temperature=0.1,
    )

    # this is ugly code to isolate input message, but not related to the bug I guess
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    input_msg = tokenizer.batch_decode(input_ids["input_ids"], skip_special_tokens=True)
    for idx, im in enumerate(input_msg):
        response[idx] = response[idx][len(im):]
    
    return response

print(chatWithLLM(model=llm_model, tokenizer=tokenizer))
# ['', '', '', '', 'There are **5** laughs.  😄 \n']

Expected behavior

['There are 1 laughs. 😄 \n', 'There are 2 laughs. 😄 \n', 'There are 3 laughs. 😄 \n', 'There are 4 laughs. 😄 \n', 'There are 5 laughs. 😄 \n']

It is probably not the issue of apply_chat_template since using

messages = []
for idx in range(5):
    messages.append("laugh " * (idx + 1) + " How many laugh are there?")
print(messages)
input_ids = tokenizer(
    messages, 
    padding=True, 
    return_tensors="pt", 
).to(model.device)

to create batched messages will also reproduce that issue.

@gante
Copy link
Member

gante commented Aug 16, 2024

@HuangBugWei Thank you for opening this issue 🤗

It doesn't seem to be a bug, but rather an undesired output of the model given the prompt. The script you provided is the intended usage -- the only bit missing is the padding side when initializing the tokenizer, which improves modeling quality with padding, but it is still not enough in this case. See this doc for more info on the padding side.

Consider the script below, adapted from yours. If we use llama3.1, it gives the answer we expect :) I suggest playing around with the prompt, if you want to use gemma2.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_name = name
llm_model_name = name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
llm_model = AutoModelForCausalLM.from_pretrained(
        llm_model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
llm_model.eval()

def chatWithLLM(model: AutoModelForCausalLM, tokenizer: AutoTokenizer):
    messages = [[
        {"role": "user", "content": "laugh " * (idx + 1) + " How many laugh are there?"},
    ] for idx in range(5)]
    input_ids = tokenizer.apply_chat_template(
        messages,
        padding=True,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True
    ).to(model.device)

    outputs = model.generate(
        **input_ids,
        do_sample=False,
        max_new_tokens=500,
    )

    # this is ugly code to isolate input message, but not related to the bug I guess
    print(outputs)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(response)

chatWithLLM(model=llm_model, tokenizer=tokenizer)

@HuangBugWei
Copy link
Author

@gante Thanks for your reply.
But I'm sure that the default gemma 2 tokenizer is padding at the left side.
Below is what the attention mask looks like.
It's a bit strange, haha, since the exact same code snippet works fine on the llama 3.1 8B it model.

'attention_mask': tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

@gante
Copy link
Member

gante commented Aug 16, 2024

But I'm sure that the default gemma 2 tokenizer is padding at the left side.

Hehe you're right! It's a reflex on my end, most models don't add padding on the left by default 🤗

@HuangBugWei
Copy link
Author

Ok I found it might be some issues about current implementation of attention.
In transformers version: 4.44.0, everything works fine if you explicitly set attn_implementation="eager".

@gante
Copy link
Member

gante commented Aug 17, 2024

In transformers version: 4.44.0, everything works fine if you explicitly set attn_implementation="eager".

Yeah, gemma 2 doesn't work well with SDPA. Let me fix the default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants