Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch inputs get different result to single input for llama model. #30378

Closed
1 of 4 tasks
liangan1 opened this issue Apr 22, 2024 · 8 comments
Closed
1 of 4 tasks

Batch inputs get different result to single input for llama model. #30378

liangan1 opened this issue Apr 22, 2024 · 8 comments

Comments

@liangan1
Copy link

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.1
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0.dev20240325+cpu-cxx11-abi (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
#import intel_extension_for_pytorch as ipex
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", padding_side='right')
model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
input_ids = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True).input_ids
#model = ipex.llm.optimize(model, deployment_mode=False)
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

input_ids = tokenizer(["how are you?"], return_tensors="pt",padding=True).input_ids
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

Expected behavior

For batch=2, the output should be ['how are you? I'm doing well, thanks for asking!', 'what is the best the AI algorithm?\n\nThere is no single "best" A'].

@liangan1
Copy link
Author

According to my debug, the generate mask is not correct. for example, there should be some large negative values in mask, but there is only 0 now.
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],

    [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

@liangan1
Copy link
Author

@gante @jianan-gu

@zucchini-nlp
Copy link
Member

Hey @liangan1 !

There are a few things that have to be changed for generation to work properly in batched form. Firstly, tt is recommended to use the left padding side in generation if you are using a decoder-only models. Also, attention mask needs to be passed into the generate if input is batched. Please try the code below to verify that it works :)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", padding_side='left')  # left padding for generation
model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
inputs = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True)
output = model.generate(**inputs, max_new_tokens=10)  # pass in not only input ids, but also attention mask
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

input_ids = tokenizer(["how are you?"], return_tensors="pt",padding=True).input_ids
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

@liangan1
Copy link
Author

@zucchini-nlp thanks. why user need to create mask by themself?

@zucchini-nlp
Copy link
Member

@liangan1 you don't have to create it manually. The tokenizer returns attention mask, which should be passed into generate.

inputs = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True)
print(inputs.attention_mask)

I will close the issue as resolved. For any further questions it is recommended to ask in the forum 🤗

@liangan1
Copy link
Author

Thanks for your help.

@gante
Copy link
Member

gante commented May 3, 2024

@liangan1 to complement the answer above: there are a few seemingly innocuous differences that may result in slightly different LLM outputs, such as batching. To understand why it happens (and why it is unavoidable), have a look at this comment :)

@liangan1
Copy link
Author

liangan1 commented May 6, 2024

@liangan1 to complement the answer above: there are a few seemingly innocuous differences that may result in slightly different LLM outputs, such as batching. To understand why it happens (and why it is unavoidable), have a look at this comment :)

Thanks for your info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants