Batch inputs get different result to single input for llama model. #30378

liangan1 · 2024-04-22T02:26:21Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.40.0
Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.1
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.4.0.dev20240325+cpu-cxx11-abi (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
#import intel_extension_for_pytorch as ipex
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", padding_side='right')
model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
input_ids = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True).input_ids
#model = ipex.llm.optimize(model, deployment_mode=False)
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

input_ids = tokenizer(["how are you?"], return_tensors="pt",padding=True).input_ids
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

Expected behavior

For batch=2, the output should be ['how are you? I'm doing well, thanks for asking!', 'what is the best the AI algorithm?\n\nThere is no single "best" A'].

The text was updated successfully, but these errors were encountered:

liangan1 · 2024-04-22T02:28:09Z

According to my debug, the generate mask is not correct. for example, there should be some large negative values in mask, but there is only 0 now.
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],

    [[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

liangan1 · 2024-04-22T02:29:08Z

@gante @jianan-gu

zucchini-nlp · 2024-04-22T07:52:02Z

Hey @liangan1 !

There are a few things that have to be changed for generation to work properly in batched form. Firstly, tt is recommended to use the left padding side in generation if you are using a decoder-only models. Also, attention mask needs to be passed into the generate if input is batched. Please try the code below to verify that it works :)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", padding_side='left')  # left padding for generation
model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
inputs = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True)
output = model.generate(**inputs, max_new_tokens=10)  # pass in not only input ids, but also attention mask
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

input_ids = tokenizer(["how are you?"], return_tensors="pt",padding=True).input_ids
output = model.generate(input_ids, max_new_tokens=10)
out_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(out_text)

liangan1 · 2024-04-23T01:24:00Z

@zucchini-nlp thanks. why user need to create mask by themself?

zucchini-nlp · 2024-04-23T08:25:19Z

@liangan1 you don't have to create it manually. The tokenizer returns attention mask, which should be passed into generate.

inputs = tokenizer(["how are you?", "what is the best the AI algorithm?"], return_tensors="pt",padding=True)
print(inputs.attention_mask)

I will close the issue as resolved. For any further questions it is recommended to ask in the forum 🤗

liangan1 · 2024-04-23T08:48:00Z

Thanks for your help.

gante · 2024-05-03T10:24:20Z

@liangan1 to complement the answer above: there are a few seemingly innocuous differences that may result in slightly different LLM outputs, such as batching. To understand why it happens (and why it is unavoidable), have a look at this comment :)

liangan1 · 2024-05-06T00:48:37Z

@liangan1 to complement the answer above: there are a few seemingly innocuous differences that may result in slightly different LLM outputs, such as batching. To understand why it happens (and why it is unavoidable), have a look at this comment :)

Thanks for your info.

zucchini-nlp closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inputs get different result to single input for llama model. #30378

Batch inputs get different result to single input for llama model. #30378

liangan1 commented Apr 22, 2024

liangan1 commented Apr 22, 2024

liangan1 commented Apr 22, 2024

zucchini-nlp commented Apr 22, 2024

liangan1 commented Apr 23, 2024

zucchini-nlp commented Apr 23, 2024

liangan1 commented Apr 23, 2024

gante commented May 3, 2024

liangan1 commented May 6, 2024

Batch inputs get different result to single input for llama model. #30378

Batch inputs get different result to single input for llama model. #30378

Comments

liangan1 commented Apr 22, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

liangan1 commented Apr 22, 2024

liangan1 commented Apr 22, 2024

zucchini-nlp commented Apr 22, 2024

liangan1 commented Apr 23, 2024

zucchini-nlp commented Apr 23, 2024

liangan1 commented Apr 23, 2024

gante commented May 3, 2024

liangan1 commented May 6, 2024