Assisted decoding results are not correct #30413

jiqing-feng · 2024-04-23T09:24:32Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.40.0.dev0
Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_CPU
- mixed_precision: bf16
- use_cpu: True
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- ipex_config: {'ipex': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

promtpt = """
You are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
conversation history:```human: How do I create a civil @@@ gpt: I'm sorry, but I'm not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr
eate a block in AutoCAD using python?```
"""

device = "cuda:1"
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer(promtpt, return_tensors="pt").to(device)

generate_kwargs = {"do_sample": False, "num_beams": 1, "max_new_tokens": 128}

model.generation_config.num_assistant_tokens=1

print("greedy search")
outputs = model.generate(**inputs, **generate_kwargs)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

print("assisted decoding")
outputs = model.generate(**inputs, assistant_model=model, **generate_kwargs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print(outputs)

The outputs:

greedy search
tensor([[    1, 29871,    13,  3492,   526, 13563,  7451, 29889,   450, 11301,
          4955,   338,  2183, 14811,  7521,  4954,  1412,  7806,  1006,  2029,
          3406,  8665,   411,   376, 29887,   415, 29901,   376,   470,   376,
         26029, 29901,   376,   322, 10614,   411, 17962, 25380,  1642,   887,
          1708,   376, 29887,   415,  1642,   887,   817,   304,  8908,   304,
           376, 26029,  1642,    13,   535,   874,   362,  4955, 29901, 28956,
         26029, 29901,  1128,   437,   306,  1653,   263,  7631,   732, 25380,
           330,   415, 29901,   306, 29915, 29885,  7423, 29892,   541,   306,
         29915, 29885,   451,  1854,   825,   366,  2099,   491,   376,  3258,
           263,  7631,  1213,  6527,   366,  3113,  3867,   901,  3030,   470,
          7542,  2450, 29973,   732, 25380,  5199, 29901,   920,   437,   306,
          2181,    13, 29872,   403,   263,  2908,   297, 11133, 29907,  3035,
           773,  3017, 29973, 28956,    13,    13,  3492,   526, 13563,  7451,
         29889,   450, 14983,  4955,   338,  2183,  1546,  7521,  7521,  1412,
          7806,  1006,  2029,  3406,  8665,   411,   376, 29887,   415, 29901,
           376,   470,   376, 26029, 29901,   376,   322, 10614,   411, 17962,
         25380,  1642,   887,  1708,   376, 29887,   415,  1642,   887,   817,
           304,  8908,   304,   376, 26029,  1642,    13,    13,   535,   874,
           362,  4955, 29901,    13, 28956, 26029, 29901,  1128,   437,   306,
          1653,   263,  7631,   732, 25380,   330,   415, 29901,   306, 29915,
         29885,  7423, 29892,   541,   306, 29915, 29885,   451,  1854,   825,
           366,  2099,   491,   376,  3258,   263,  7631,  1213,  6527,   366,
          3113,  3867,   901,  3030,   470,  7542,  2450, 29973,   732, 25380,
          5199, 29901,   920,   437,   306,  1653,   263,  2908,   297, 11133,
         29907,  3035,   773,  3017, 29973, 28956,    13,    13,  3492,   508,
          8908,   304,   278]], device='cuda:1')
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation hist
ory:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block
in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply
 to "human".\n\nconversation history:\n```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@
@ human: how do I create a block in AutoCAD using python?```\n\nYou can reply to the']



assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block
in AutoCAD using python?```\n\nYou are ae "gpt chatbot". You need to bot". You play "gpt" to reply to "human". You play "gpt: reply to "human".\n\nHere is given the "human".\nreplyly "replyly. You are are
you "human".\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nreplylylylylylylylylylylylylylylylylylylylylylylylylylylylylylylylyly']
tensor([[    1, 29871,    13,  3492,   526, 13563,  7451, 29889,   450, 11301,
          4955,   338,  2183, 14811,  7521,  4954,  1412,  7806,  1006,  2029,
          3406,  8665,   411,   376, 29887,   415, 29901,   376,   470,   376,
         26029, 29901,   376,   322, 10614,   411, 17962, 25380,  1642,   887,
          1708,   376, 29887,   415,  1642,   887,   817,   304,  8908,   304,
           376, 26029,  1642,    13,   535,   874,   362,  4955, 29901, 28956,
         26029, 29901,  1128,   437,   306,  1653,   263,  7631,   732, 25380,
           330,   415, 29901,   306, 29915, 29885,  7423, 29892,   541,   306,
         29915, 29885,   451,  1854,   825,   366,  2099,   491,   376,  3258,
           263,  7631,  1213,  6527,   366,  3113,  3867,   901,  3030,   470,
          7542,  2450, 29973,   732, 25380,  5199, 29901,   920,   437,   306,
          2181,    13, 29872,   403,   263,  2908,   297, 11133, 29907,  3035,
           773,  3017, 29973, 28956,    13,    13,  3492,   526,   263, 29872,
           376, 29887,   415, 13563,  7451,  1642,   887,   817,   304,  9225,
          1642,   887,  1708,   376, 29887,   415, 29908,   304,  8908,   304,
           376, 26029,  1642,   887,  1708,   376, 29887,   415, 29901,  8908,
           304,   376, 26029,  1642,    13,    13, 10605,   338,  2183,   278,
           376, 26029,  1642,    13,  3445,   368,   368,   376,  3445,   368,
           368, 29889,   887,   526,   526,   366,   376, 26029,  1642,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,  3445,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368]], device='cuda:1')

Expected behavior

Hi @gante

The outputs should be the same, but the assisted decoding is incorrect. I suppose there are some arguments mistake caused this issue, I've checked it and found the candidate generator has the same output as greedy search but the target model (self) forward results are incorrect. Would you please help me to figure out the issue? Thx!

BTW, I see that the cache_position is inconsistent, but I don't know the correct format.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-04-23T09:36:25Z

Related to (#30042)

zucchini-nlp · 2024-04-23T13:34:25Z

@jiqing-feng , the fix was merged on main.

You can update transformers with !pip install --upgrade git+https://github.com/huggingface/transformers.git to get the correct behavior. Tested with the script you provided and can confirm that generations match

Closing issue as resolved :)

jiqing-feng · 2024-04-24T00:39:28Z

greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?```\n\nYou can reply to the']


assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n\nhuman: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?\n\nPlease provide a response as "']

It's not exactly the same in the last few tokens, but better. Is it reasonable with a little difference?

jiqing-feng · 2024-04-24T01:35:07Z

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

promtpt = """
You are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".                       conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade tra
"""

device = "cuda:1"
model_id = "meta-llama/Llama-2-7b-chat-hf"
as_model_id = "Felladrin/Llama-68M-Chat-v1"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
as_model = AutoModelForCausalLM.from_pretrained(as_model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer(promtpt, return_tensors="pt").to(device)

generate_kwargs = {"do_sample": False, "num_beams": 1, "max_new_tokens": 256}

print("greedy search")
outputs = model.generate(**inputs, **generate_kwargs)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

print("assisted decoding")
outputs = model.generate(**inputs, assistant_model=as_model, **generate_kwargs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print(outputs)

output:

greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
    conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
 tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
 dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 15th and will be in Belgrade for 4 days. They are interested in history
, culture, and fun activities like museums, parks, and playgrounds. @@@ gpt: Great! Based on your preferences, I have created a 4-day itinerary for your family\'s trip to Belgrade. Here\'s a summary of the
 plan: Day 1: Arrival and Exploring the City Centre @@@ human: That sounds great! Can you please provide me with more details about each activity and the estimated time required for each one? @@@ gpt: Of c
ourse! Here are the details of each activity in the itinerary: Day 1: Arrival and Exploring the City Centre @@@ human: That\'s very helpful! Can you please provide me with some']


assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
    conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
 tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
 dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 10th and return on July 17th. They are both very interested in history
and culture, and they enjoy visiting museums and historical sites. Do you have any recommendations for places to visit in Belgrade? gpt: Great! Based on the information you provided, I would recommend visi
ting the following places in Belgrade: 1. The Nikola Tesla Museum: This museum is dedicated to the life and work of the famous Serbian inventor and engineer, Nikola Tesla. It\'s a great place for kids to l
earn about science and technology. 2. The Museum of Contemporary Art: This museum features a collection of modern and contemporary art from Serbia and around the world. The kids can enjoy the interactive e
xhibits and learn about different artistic styles. 3. The']

Found mismatch when output length is long.

zucchini-nlp · 2024-04-24T10:18:31Z

@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.

cc @gante do you have any other ideas why this happens?

jiqing-feng · 2024-04-25T00:54:43Z

@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.

cc @gante do you have any other ideas why this happens?

It is reasonable, thanks : )

gante · 2024-05-03T14:03:15Z

@jiqing-feng Yes, numerical issues will cause assisted generation to pick a different token from time to time. It's the exact same issue as with batched generation or the use of KV caches :)

👉 you can read more about the issue here

amyeroberts added the Generation label Apr 23, 2024

zucchini-nlp closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assisted decoding results are not correct #30413

Assisted decoding results are not correct #30413

jiqing-feng commented Apr 23, 2024 •

edited

Loading

zucchini-nlp commented Apr 23, 2024

zucchini-nlp commented Apr 23, 2024

jiqing-feng commented Apr 24, 2024 •

edited

Loading

jiqing-feng commented Apr 24, 2024

zucchini-nlp commented Apr 24, 2024

jiqing-feng commented Apr 25, 2024

gante commented May 3, 2024

Assisted decoding results are not correct #30413

Assisted decoding results are not correct #30413

Comments

jiqing-feng commented Apr 23, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Apr 23, 2024

zucchini-nlp commented Apr 23, 2024

jiqing-feng commented Apr 24, 2024 • edited Loading

jiqing-feng commented Apr 24, 2024

zucchini-nlp commented Apr 24, 2024

jiqing-feng commented Apr 25, 2024

gante commented May 3, 2024

jiqing-feng commented Apr 23, 2024 •

edited

Loading

jiqing-feng commented Apr 24, 2024 •

edited

Loading