-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only last elements have expected outputs when doing batch inference #32848
Comments
@HuangBugWei Thank you for opening this issue 🤗 It doesn't seem to be a bug, but rather an undesired output of the model given the prompt. The script you provided is the intended usage -- the only bit missing is the padding side when initializing the tokenizer, which improves modeling quality with padding, but it is still not enough in this case. See this doc for more info on the padding side. Consider the script below, adapted from yours. If we use import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_name = name
llm_model_name = name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
llm_model = AutoModelForCausalLM.from_pretrained(
llm_model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
)
llm_model.eval()
def chatWithLLM(model: AutoModelForCausalLM, tokenizer: AutoTokenizer):
messages = [[
{"role": "user", "content": "laugh " * (idx + 1) + " How many laugh are there?"},
] for idx in range(5)]
input_ids = tokenizer.apply_chat_template(
messages,
padding=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to(model.device)
outputs = model.generate(
**input_ids,
do_sample=False,
max_new_tokens=500,
)
# this is ugly code to isolate input message, but not related to the bug I guess
print(outputs)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response)
chatWithLLM(model=llm_model, tokenizer=tokenizer) |
@gante Thanks for your reply.
|
Hehe you're right! It's a reflex on my end, most models don't add padding on the left by default 🤗 |
Ok I found it might be some issues about current implementation of attention. |
Yeah, |
System Info
transformers
version: 4.44.0Who can help?
@ArthurZucker
@gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
['There are 1 laughs. 😄 \n', 'There are 2 laughs. 😄 \n', 'There are 3 laughs. 😄 \n', 'There are 4 laughs. 😄 \n', 'There are 5 laughs. 😄 \n']
It is probably not the issue of
apply_chat_template
since usingto create batched messages will also reproduce that issue.
The text was updated successfully, but these errors were encountered: