-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376
Comments
It would indeed be very nice to be able to convert the 70b HF models. I tried to look into it and figure out how it all worked but the needed skills are beyond me. I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py, but this function is not yet compatible with llama v2 which uses GQA. |
Have you tried just temporarily disabling Basically just def permute(weights: NDArray, n_head: int) -> NDArray:
return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
.swapaxes(1, 2)
.reshape(weights.shape)) to def permute(weights: NDArray, n_head: int) -> NDArray:
return weights It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand. |
No I dont think that that would work since all pth models, including the 70b, is transformed by the HF permute(). I guess we need a new permute() in convert.py to reverse it. |
I can try it out later today, I have the airoboros 70b finetune in HF on my desktop and I think the base llama2 also. |
What's the level of effort on porting the permutation changes? Does it require a lot of knowledge, or is it a straight transposition? |
Here is the new HF conversion script that converts the original pth models to HF format: This is the old HF conversion script: Now, llama.cpp want the tensors in the pth layout so any transformations made to HF tensors need to be reversed in convert.py |
Hello, I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more: I am not sure if it is correct but the converted Huggingface model produces exactly the same outputs as the converted pth model when sampling is disabled. @TheBloke Could you probably try that out? I'm sorry for any mistakes, this is my first time contributing here. |
Wonderful, thank you! I am having dinner now but will check as soon as I am at my pc |
Nice, I was working on this as well and what you did is very close to my approach. So theoretically if I'm not an idiot that is a good sign. I'd suggest making the type for |
Thanks for the suggestions. I just updated the code accordingly. Although I also think that the similarities are a good sign, I'm sorry that I caused duplicated work. |
Absolutely no need to apologize. I'm happy to see someone else got it done! Assuming it works, you should make a pull with these changes! I can't test whether it actually works, but for whatever my opinion is worth the code looks very reasonable. If you wanted to reduce duplication a bit you could try something like: def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
dim1 = weights.shape[0]
if n_kv_head is not None and n_kv_head != n_head:
dim1 *= n_kv_head
n_head //= n_kv_head
return (weights.reshape(n_head, 2, dim1 // n_head // 2, *weights.shape[1:])
.swapaxes(1, 2)
.reshape(weights.shape)) |
Thank you! I should probably wait for @TheBloke's result until I make a pull request, shouldn't I? I have only tested it with a Huggingface model that I converted locally from the 70B pth model using the official Huggingface script. |
if n_kv_head is used, n_head should not be divided by n_kv_head in the third parameter. Something like this looks correct to me:
|
I don't think either approach is wrong, so if you're more comfortable with that then it's perfectly fine. Pull requests have to get approved by someone before they're merged, so as long as you added a note that it was still being tested there wouldn't be a danger of it instantly getting merged with problems. It would also be possible to create a draft pull request that can't be merged until you set it ready for review. It's generally easier to discuss actual code changes when they're in a pull request so that's why if it was me I'd probably just go ahead and create a pull (even a draft one). I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :) |
@klosax Yes, but I think you can even remove the def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
if n_kv_head is not None and n_head != n_kv_head:
n_head //= n_kv_head
return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
.swapaxes(1, 2)
.reshape(weights.shape)) @KerfuffleV2 That's very cool! I'll make a pull request. |
I'd wager the renaming was due to trademark issues. I've converted it with https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py and quantized it for Q4_K_S and have uploaded it. It is available here (for an unknown amount of time, but at least for as long as this PR): https://storage.labs.rwx.im/llm/stable-beluga-2/stable-beluga-2-ggml/stable-beluga-2-q4_k_s-ggml.bin It'll be a bit before I can test it myself, but feel free to try the link if it's faster. It's ~36.2 GiB. SHASUMs:
|
@mkroman Your converted model works with GGML!
|
Yeah, just confirmed it myself. It looks very promising - thanks for the patch :) Output
|
Wonderful work @mj-shifu ! My Stable Beluga 2 GGMLs are uploading and I will soon do Airoboros 70B and Guanaco 70B Thanks so much for getting this done. Amazing first contribution! :) |
Following from discussions in the Llama 2 70B PR: #2276 :
Since that PR, converting Llama 2 70B models from Meta's original PTH format files works great.
But it is not possible to make usable Llama 2 70B models from HF format. The models convert and quantise fine, but always produce gibberish, as in this example:
@klosax reports:
For reference, here are all the changes that happened in Transformers'
convert_llama_weights_to_hf.py
for the Llama 2 release: huggingface/transformers@07360b6#diff-110a445233a8b15a0875998eeaf75cb8607b38a5daa736291dd058766879bbddWould anyone be able to look into this? It's a bit beyond my experience.
I'm getting multiple requests a day for 70B fine tune quants for FreeWilly 2, Llama2-Guanaco, and the newly released Airoboros 1.4.1 70B, and would love to be able to provide them for people.
Thanks in advance.
The text was updated successfully, but these errors were encountered: