Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove bug in convert.py permute function #3364

Merged
merged 1 commit into from
Sep 27, 2023

Conversation

jzhang38
Copy link
Contributor

@jzhang38 jzhang38 commented Sep 27, 2023

This bug will only be triggered by HuggingFace GQA models. Nobody realized it because we never used convert.py to convert the HF llama2 70B model Llama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.

This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)

jzhang38/TinyLlama#24

@Green-Sky
Copy link
Collaborator

@TheBloke did you ever have issues with GQA 70B and hf models?

@TheBloke
Copy link
Contributor

No I haven't had issues with those for ages - not since the issues were fixed shortly after GGUF released. I've done loads in the last few weeks, all work fine.

@jzhang38 it's not true to say that all 70B models come from Meta PTH weights. 99% of 70B conversions now are done from HF weights in pytorch_model.bin or model.safetensors format - because they're fine tuned models.

Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually

@Green-Sky
Copy link
Collaborator

Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually

yea please do, it's kind of hard to "just" convert one of those for me 😅

@Green-Sky
Copy link
Collaborator

@jzhang38 I can indeed confirm that this fixes the converted tinyllama models 👍

@jzhang38
Copy link
Contributor Author

jzhang38 commented Sep 27, 2023

@TheBloke Yeah the actual reason would be Llama 2 70B use 64 heads and 8 key-value heads, which makes

n_head //= n_head_kv

the same as

n_head = n_head_kv

So the bug is not triggered.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Sep 27, 2023

ran perplexity on the first 300 chunks(batch 512) of wikitest on the f32 models
105b: 11.8159 +/- 0.11470
503b: 11.3646 +/- 0.10860

q8_0 (gpu):
105b: 11.8219 +/- 0.11475
503b: 11.3588 +/- 0.10852

it is safe to say that it works with this pr

@Green-Sky
Copy link
Collaborator

are there other GQA/MQA models we can test?

@TheBloke
Copy link
Contributor

The 70B Llama 2 model worked fine BTW

@slaren
Copy link
Collaborator

slaren commented Sep 27, 2023

are there other GQA/MQA models we can test?

Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already.

@Ar57m
Copy link

Ar57m commented Sep 27, 2023

This bug will only be triggered by HuggingFace GQA models. Nobody realized it because we never used convert.py to convert the HF llama2 70B model Llama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.

This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)

jzhang38/TinyLlama#24

I've notice https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1/discussions/4#651432a05d12b3abdd5d16bd
but I thought it was my fault or something wrong with the model.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Sep 27, 2023

Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already.

this one has a different context management. "sliding window context", so the trained context of 32768 is gonna result in a wrong user experience. Window size should be 4096.

@Green-Sky
Copy link
Collaborator

@TheBloke how did you convert Mistral-7B-v0.1 ?

@TheBloke
Copy link
Contributor

I applied the GQA fix from this PR, and then I deleted added_tokens.json which probably shouldn't be there, and breaks convert.py

Then I just ran convert.py as normal.

Same with Mistral-7B-Instruct-v0.1, except I didn't need to delete added_tokens.json there, so I guess they realised it wasn't meant to be there.

@Green-Sky
Copy link
Collaborator

Oh, that easy... can you add a note that llama.cpp does currently not perform sliding window context, and that max context should be set to 4096 (sliding_window).
since it should work the same before the window starts sliding.

@TheBloke
Copy link
Contributor

OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported

@Green-Sky
Copy link
Collaborator

OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported

this might be just like llama2 where, contrary to llama1, it does not immediately deteriorate when going past the trained size.

@Green-Sky
Copy link
Collaborator

from #3362

Actually no my quants don't work fine! I needed that permute fix. Re-making now

so you used this pr?

@TheBloke
Copy link
Contributor

Yes, changing \= to =. Before I applied that the ggufs produced gibberish after a few words

@Green-Sky Green-Sky merged commit e519621 into ggerganov:master Sep 27, 2023
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Sep 27, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  convert : remove bug in convert.py permute function (ggerganov#3364)
  make-ggml.py : compatibility with more models and GGUF (ggerganov#3290)
  gguf : fix a few general keys (ggerganov#3341)
  metal : reusing llama.cpp logging (ggerganov#3152)
  build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (ggerganov#3342)
  readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (ggerganov#3340)
  cmake : fix build-info.h on MSVC (ggerganov#3309)
  docs: Fix typo CLBlast_DIR var. (ggerganov#3330)
  nix : add cuda, use a symlinked toolkit for cmake (ggerganov#3202)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trying to build a model of PY007/TinyLlama-1.1B-step-50K-105b
5 participants