Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StaticCache Bad generation results with Llama after v4.39.0 #30417

Closed
2 of 4 tasks
mobicham opened this issue Apr 23, 2024 · 2 comments · Fixed by #30476
Closed
2 of 4 tasks

StaticCache Bad generation results with Llama after v4.39.0 #30417

mobicham opened this issue Apr 23, 2024 · 2 comments · Fixed by #30476

Comments

@mobicham
Copy link
Contributor

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.
Python version: 3.10.
Huggingface_hub version: 0.20.
Safetensors version: 0.4.
Accelerate version: 0.21.0

Who can help?

@ArthurZucker @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The generation output quality with the current 4.41.0.dev0 version is very bad compared to the previous 4.39.0 version, at least with Llama. With quantized models, it outputs complete gibberish. The same code works totally fine with 4.39.0

import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id  = "meta-llama/Llama-2-7b-chat-hf"
model     = AutoModelForCausalLM.from_pretrained(model_id, cache_dir='.', torch_dtype=torch.float16, attn_implementation="sdpa").to('cuda').eval();
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir='.') 

tokenizer.add_bos_token = False
tokenizer.add_eos_token = False

prompt = "<s>[INST] How do I build a car? [/INST]"

gen_out = model.generate(**tokenizer([prompt], return_tensors="pt").to(model.device), do_sample=False, 
                                                cache_implementation="static", max_new_tokens=100, pad_token_id=tokenizer.eos_token_id, 
                                                temperature=None, top_p=None, use_cache=False)

print()
print(tokenizer.decode(gen_out[0]))
# version: 4.39.0 - works as expected
<s> [INST] How do I build a car? [/INST]  Building a car is a complex and challenging project that requires a significant amount of time, money, and expertise. Here are some general steps that you might consider when building a car:

1. Define your goals: What kind of car do you want to build? What features do you want to include? What is your budget? Answering these questions will help you determine the scope of your project and what you need to do to get started.
2. Research and plan:
# version: 4.41.0 - bad output, outputs gibberish 
<s> [INST] How do I build a car? [/INST]  I's (the 0-2) are dots d's the traveling 4 v5 8 out the9 of the9 1t 1 ch do not always and the9 10 11 is-not 1 rt 1 c0 the0.

To build a car, you will need to have a good understanding of mechanical systems, electrical systems, and fabrication techniques. You will also need to have a

Expected behavior

The output should be the same as with the previous 4.39.0 version

@ArthurZucker ArthurZucker changed the title Bad generation results with Llama after v4.39.0 StaticCache Bad generation results with Llama after v4.39.0 Apr 23, 2024
@ArthurZucker
Copy link
Collaborator

Mmm that's super weird, it's most probably generate as the test_torch_compile here is all green

def test_compile_static_cache(self):

@gante
Copy link
Member

gante commented Apr 23, 2024

having a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants