Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assisted decoding results are not correct #30413

Closed
2 of 4 tasks
jiqing-feng opened this issue Apr 23, 2024 · 7 comments
Closed
2 of 4 tasks

Assisted decoding results are not correct #30413

jiqing-feng opened this issue Apr 23, 2024 · 7 comments

Comments

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Apr 23, 2024

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.40.0.dev0
  • Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_CPU
    - mixed_precision: bf16
    - use_cpu: True
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - ipex_config: {'ipex': False}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

promtpt = """
You are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
conversation history:```human: How do I create a civil @@@ gpt: I'm sorry, but I'm not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr
eate a block in AutoCAD using python?```
"""

device = "cuda:1"
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer(promtpt, return_tensors="pt").to(device)

generate_kwargs = {"do_sample": False, "num_beams": 1, "max_new_tokens": 128}

model.generation_config.num_assistant_tokens=1

print("greedy search")
outputs = model.generate(**inputs, **generate_kwargs)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

print("assisted decoding")
outputs = model.generate(**inputs, assistant_model=model, **generate_kwargs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print(outputs)

The outputs:

greedy search
tensor([[    1, 29871,    13,  3492,   526, 13563,  7451, 29889,   450, 11301,
          4955,   338,  2183, 14811,  7521,  4954,  1412,  7806,  1006,  2029,
          3406,  8665,   411,   376, 29887,   415, 29901,   376,   470,   376,
         26029, 29901,   376,   322, 10614,   411, 17962, 25380,  1642,   887,
          1708,   376, 29887,   415,  1642,   887,   817,   304,  8908,   304,
           376, 26029,  1642,    13,   535,   874,   362,  4955, 29901, 28956,
         26029, 29901,  1128,   437,   306,  1653,   263,  7631,   732, 25380,
           330,   415, 29901,   306, 29915, 29885,  7423, 29892,   541,   306,
         29915, 29885,   451,  1854,   825,   366,  2099,   491,   376,  3258,
           263,  7631,  1213,  6527,   366,  3113,  3867,   901,  3030,   470,
          7542,  2450, 29973,   732, 25380,  5199, 29901,   920,   437,   306,
          2181,    13, 29872,   403,   263,  2908,   297, 11133, 29907,  3035,
           773,  3017, 29973, 28956,    13,    13,  3492,   526, 13563,  7451,
         29889,   450, 14983,  4955,   338,  2183,  1546,  7521,  7521,  1412,
          7806,  1006,  2029,  3406,  8665,   411,   376, 29887,   415, 29901,
           376,   470,   376, 26029, 29901,   376,   322, 10614,   411, 17962,
         25380,  1642,   887,  1708,   376, 29887,   415,  1642,   887,   817,
           304,  8908,   304,   376, 26029,  1642,    13,    13,   535,   874,
           362,  4955, 29901,    13, 28956, 26029, 29901,  1128,   437,   306,
          1653,   263,  7631,   732, 25380,   330,   415, 29901,   306, 29915,
         29885,  7423, 29892,   541,   306, 29915, 29885,   451,  1854,   825,
           366,  2099,   491,   376,  3258,   263,  7631,  1213,  6527,   366,
          3113,  3867,   901,  3030,   470,  7542,  2450, 29973,   732, 25380,
          5199, 29901,   920,   437,   306,  1653,   263,  2908,   297, 11133,
         29907,  3035,   773,  3017, 29973, 28956,    13,    13,  3492,   508,
          8908,   304,   278]], device='cuda:1')
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation hist
ory:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block
in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply
 to "human".\n\nconversation history:\n```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@
@ human: how do I create a block in AutoCAD using python?```\n\nYou can reply to the']



assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block
in AutoCAD using python?```\n\nYou are ae "gpt chatbot". You need to bot". You play "gpt" to reply to "human". You play "gpt: reply to "human".\n\nHere is given the "human".\nreplyly "replyly. You are are
you "human".\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nreplylylylylylylylylylylylylylylylylylylylylylylylylylylylylylylylyly']
tensor([[    1, 29871,    13,  3492,   526, 13563,  7451, 29889,   450, 11301,
          4955,   338,  2183, 14811,  7521,  4954,  1412,  7806,  1006,  2029,
          3406,  8665,   411,   376, 29887,   415, 29901,   376,   470,   376,
         26029, 29901,   376,   322, 10614,   411, 17962, 25380,  1642,   887,
          1708,   376, 29887,   415,  1642,   887,   817,   304,  8908,   304,
           376, 26029,  1642,    13,   535,   874,   362,  4955, 29901, 28956,
         26029, 29901,  1128,   437,   306,  1653,   263,  7631,   732, 25380,
           330,   415, 29901,   306, 29915, 29885,  7423, 29892,   541,   306,
         29915, 29885,   451,  1854,   825,   366,  2099,   491,   376,  3258,
           263,  7631,  1213,  6527,   366,  3113,  3867,   901,  3030,   470,
          7542,  2450, 29973,   732, 25380,  5199, 29901,   920,   437,   306,
          2181,    13, 29872,   403,   263,  2908,   297, 11133, 29907,  3035,
           773,  3017, 29973, 28956,    13,    13,  3492,   526,   263, 29872,
           376, 29887,   415, 13563,  7451,  1642,   887,   817,   304,  9225,
          1642,   887,  1708,   376, 29887,   415, 29908,   304,  8908,   304,
           376, 26029,  1642,   887,  1708,   376, 29887,   415, 29901,  8908,
           304,   376, 26029,  1642,    13,    13, 10605,   338,  2183,   278,
           376, 26029,  1642,    13,  3445,   368,   368,   376,  3445,   368,
           368, 29889,   887,   526,   526,   366,   376, 26029,  1642,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,  3445,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368,   368,   368,   368,   368,   368,   368,   368,
           368,   368,   368]], device='cuda:1')

Expected behavior

Hi @gante

The outputs should be the same, but the assisted decoding is incorrect. I suppose there are some arguments mistake caused this issue, I've checked it and found the candidate generator has the same output as greedy search but the target model (self) forward results are incorrect. Would you please help me to figure out the issue? Thx!

BTW, I see that the cache_position is inconsistent, but I don't know the correct format.

@zucchini-nlp
Copy link
Member

Related to (#30042)

@zucchini-nlp
Copy link
Member

@jiqing-feng , the fix was merged on main.

You can update transformers with !pip install --upgrade git+https://github.com/huggingface/transformers.git to get the correct behavior. Tested with the script you provided and can confirm that generations match

Closing issue as resolved :)

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Apr 24, 2024

greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?```\n\nYou can reply to the']


assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\nconversation history:```human: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I cr\neate a block in AutoCAD using python?```\n\nYou are chatbot. The conversation history is given between ``` ````. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".\n\nconversation history:\n\nhuman: How do I create a civil @@@ gpt: I\'m sorry, but I\'m not sure what you mean by "create a civil." Could you please provide more context or clarification? @@@ human: how do I create a block in AutoCAD using python?\n\nPlease provide a response as "']

It's not exactly the same in the last few tokens, but better. Is it reasonable with a little difference?

@jiqing-feng
Copy link
Contributor Author

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

promtpt = """
You are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".                       conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade tra
"""

device = "cuda:1"
model_id = "meta-llama/Llama-2-7b-chat-hf"
as_model_id = "Felladrin/Llama-68M-Chat-v1"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
as_model = AutoModelForCausalLM.from_pretrained(as_model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer(promtpt, return_tensors="pt").to(device)

generate_kwargs = {"do_sample": False, "num_beams": 1, "max_new_tokens": 256}

print("greedy search")
outputs = model.generate(**inputs, **generate_kwargs)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

print("assisted decoding")
outputs = model.generate(**inputs, assistant_model=as_model, **generate_kwargs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
print(outputs)

output:

greedy search
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
    conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
 tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
 dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 15th and will be in Belgrade for 4 days. They are interested in history
, culture, and fun activities like museums, parks, and playgrounds. @@@ gpt: Great! Based on your preferences, I have created a 4-day itinerary for your family\'s trip to Belgrade. Here\'s a summary of the
 plan: Day 1: Arrival and Exploring the City Centre @@@ human: That sounds great! Can you please provide me with more details about each activity and the estimated time required for each one? @@@ gpt: Of c
ourse! Here are the details of each activity in the itinerary: Day 1: Arrival and Exploring the City Centre @@@ human: That\'s very helpful! Can you please provide me with some']


assisted decoding
['\nYou are chatbot. The conversion history is givenbetween ``` ```. Each interlocutor starts with "gpt: " or "human: " and ends with "@@@". You play "gpt". You need to reply to "human".
    conversation history:```system: *This chat conversation is shared from [**TypingMind.com**](https://typingmind.com)* @@@ human: Create a travel plan for a Family with small kids from London to Belgrade
 tra\ngpt: Sure, I\'d be happy to help you create a travel plan for a family with small kids from London to Belgrade! Can you please provide me with some details such as the age of the children, the travel
 dates, and any specific interests or preferences? @@@ human: Sure! The kids are 7 and 9 years old. We are planning to travel on July 10th and return on July 17th. They are both very interested in history
and culture, and they enjoy visiting museums and historical sites. Do you have any recommendations for places to visit in Belgrade? gpt: Great! Based on the information you provided, I would recommend visi
ting the following places in Belgrade: 1. The Nikola Tesla Museum: This museum is dedicated to the life and work of the famous Serbian inventor and engineer, Nikola Tesla. It\'s a great place for kids to l
earn about science and technology. 2. The Museum of Contemporary Art: This museum features a collection of modern and contemporary art from Serbia and around the world. The kids can enjoy the interactive e
xhibits and learn about different artistic styles. 3. The']

Found mismatch when output length is long.

@zucchini-nlp
Copy link
Member

@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.

cc @gante do you have any other ideas why this happens?

@jiqing-feng
Copy link
Contributor Author

@jiqing-feng After a bit of exploration I do not see any bugs in the way assisted decoding is passing in arguments. My guess is that the problem comes from small numerical precision errors that are accumulated over generation timesteps. In other words, for greedy decoding we always have 1 more token when generating, so the calculation of key/value is actually a vector-matrix multiplication. However for assisted generation it's always a matrix-matrix multiplication due to having large number of candidate tokens verified. So my opinion is that torch internally handles those differently with slightly different operation's order, which leads to error accumulation.

cc @gante do you have any other ideas why this happens?

It is reasonable, thanks : )

@gante
Copy link
Member

gante commented May 3, 2024

@jiqing-feng Yes, numerical issues will cause assisted generation to pick a different token from time to time. It's the exact same issue as with batched generation or the use of KV caches :)

👉 you can read more about the issue here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants