Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

Closed
TheBloke opened this issue Jul 24, 2023 · 19 comments · Fixed by #2427
Closed

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

TheBloke opened this issue Jul 24, 2023 · 19 comments · Fixed by #2427
Labels
help wanted Extra attention is needed high priority Very important issue

Comments

@TheBloke
Copy link
Contributor

Following from discussions in the Llama 2 70B PR: #2276 :

Since that PR, converting Llama 2 70B models from Meta's original PTH format files works great.

But it is not possible to make usable Llama 2 70B models from HF format. The models convert and quantise fine, but always produce gibberish, as in this example:

 ### Human: write a story about llamas\n### Assistant:20 300202000 B00A0

@klosax reports:

It looks like the tensors gets transformed with the new permute using the GQA parameter num_local_key_value_heads and num_key_value_heads somehow:
https://github.com/huggingface/transformers/blob/b257c46a075419c09e5ce5c5aa39bc346ecdb9a5/src/transformers/models/llama/convert_llama_weights_to_hf.py#L173-L195

For reference, here are all the changes that happened in Transformers' convert_llama_weights_to_hf.py for the Llama 2 release: huggingface/transformers@07360b6#diff-110a445233a8b15a0875998eeaf75cb8607b38a5daa736291dd058766879bbdd

Would anyone be able to look into this? It's a bit beyond my experience.

I'm getting multiple requests a day for 70B fine tune quants for FreeWilly 2, Llama2-Guanaco, and the newly released Airoboros 1.4.1 70B, and would love to be able to provide them for people.

Thanks in advance.

@klosax
Copy link
Contributor

klosax commented Jul 24, 2023

It would indeed be very nice to be able to convert the 70b HF models.

I tried to look into it and figure out how it all worked but the needed skills are beyond me.

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py, but this function is not yet compatible with llama v2 which uses GQA.

@klosax klosax added help wanted Extra attention is needed high priority Very important issue labels Jul 24, 2023
@KerfuffleV2
Copy link
Collaborator

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just

def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))

to

def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights

It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

@klosax
Copy link
Contributor

klosax commented Jul 26, 2023

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

No I dont think that that would work since all pth models, including the 70b, is transformed by the HF permute(). I guess we need a new permute() in convert.py to reverse it.

@7erminalVelociraptor
Copy link

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just

def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))

to

def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights

It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

I can try it out later today, I have the airoboros 70b finetune in HF on my desktop and I think the base llama2 also.

@MrJackSpade
Copy link

What's the level of effort on porting the permutation changes?

Does it require a lot of knowledge, or is it a straight transposition?

@klosax
Copy link
Contributor

klosax commented Jul 26, 2023

What's the level of effort on porting the permutation changes?

Here is the new HF conversion script that converts the original pth models to HF format:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

This is the old HF conversion script:

https://github.com/huggingface/transformers/blob/feb83521eca849731573dd40da89a02e4f370e5a/src/transformers/models/llama/convert_llama_weights_to_hf.py

Now, llama.cpp want the tensors in the pth layout so any transformations made to HF tensors need to be reversed in convert.py

@mj-shifu
Copy link
Contributor

Hello,

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more:

https://github.com/mj-shifu/llama.cpp/blob/e15a67d6b21c10326a5cc74ab6d6ce9b8d7702bb/convert.py

I am not sure if it is correct but the converted Huggingface model produces exactly the same outputs as the converted pth model when sampling is disabled.

@TheBloke Could you probably try that out?

I'm sorry for any mistakes, this is my first time contributing here.

@TheBloke
Copy link
Contributor Author

Wonderful, thank you! I am having dinner now but will check as soon as I am at my pc

@KerfuffleV2
Copy link
Collaborator

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more

Nice, I was working on this as well and what you did is very close to my approach. So theoretically if I'm not an idiot that is a good sign.

I'd suggest making the type for n_kv_head Optional[int] since it gets set to None. Also, then you don't even need a conditional to populate it, you can just do n_kv_head = config.get("num_key_value_heads")

@mj-shifu
Copy link
Contributor

Thanks for the suggestions. I just updated the code accordingly. Although I also think that the similarities are a good sign, I'm sorry that I caused duplicated work.

https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py

@KerfuffleV2
Copy link
Collaborator

Absolutely no need to apologize. I'm happy to see someone else got it done! Assuming it works, you should make a pull with these changes! I can't test whether it actually works, but for whatever my opinion is worth the code looks very reasonable.

If you wanted to reduce duplication a bit you could try something like:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0]
    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= n_kv_head
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, dim1 // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

@mj-shifu
Copy link
Contributor

Thank you! I should probably wait for @TheBloke's result until I make a pull request, shouldn't I? I have only tested it with a Huggingface model that I converted locally from the 70B pth model using the official Huggingface script.

@klosax
Copy link
Contributor

klosax commented Jul 27, 2023

If you wanted to reduce duplication a bit you could try something like:

if n_kv_head is used, n_head should not be divided by n_kv_head in the third parameter.

Something like this looks correct to me:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0] // n_head // 2

    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= k_kv_head
        n_head //= n_kv_head

    return (weights.reshape(n_head, 2, dim1, *weights.shape[1:])
                    .swapaxes(1, 2)
                    .reshape(weights.shape))

@KerfuffleV2
Copy link
Collaborator

I should probably wait for @TheBloke's result until I make a pull request, shouldn't I?

I don't think either approach is wrong, so if you're more comfortable with that then it's perfectly fine.

Pull requests have to get approved by someone before they're merged, so as long as you added a note that it was still being tested there wouldn't be a danger of it instantly getting merged with problems. It would also be possible to create a draft pull request that can't be merged until you set it ready for review.

It's generally easier to discuss actual code changes when they're in a pull request so that's why if it was me I'd probably just go ahead and create a pull (even a draft one).

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

@mj-shifu
Copy link
Contributor

@klosax Yes, but I think you can even remove the dim1 because // (n_head // n_kv_head) gets // n_head * n_kv_head, which is what we want.

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    if n_kv_head is not None and n_head != n_kv_head:
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

@KerfuffleV2 That's very cool! I'll make a pull request.

@mkroman
Copy link

mkroman commented Jul 27, 2023

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

I'd wager the renaming was due to trademark issues.

I've converted it with https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py and quantized it for Q4_K_S and have uploaded it.

It is available here (for an unknown amount of time, but at least for as long as this PR):

https://storage.labs.rwx.im/llm/stable-beluga-2/stable-beluga-2-ggml/stable-beluga-2-q4_k_s-ggml.bin

It'll be a bit before I can test it myself, but feel free to try the link if it's faster. It's ~36.2 GiB.

SHASUMs:

a65f23c4a43fc18e2bde619d3792599e090e3a6b  stable-beluga-2-q4_k_s-ggml.bin

@mj-shifu
Copy link
Contributor

@mkroman Your converted model works with GGML!

llama.cpp/build/bin/main -m stable-beluga-2-q4_k_s-ggml.bin -t 18 -gqa 8 -f prompt.txt
main: build = 917 (1a94186)
main: seed  = 1690488588
llama.cpp: loading model from stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 18 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### System:
This is a system prompt, please behave and help the user.

### User:
Write me a poem.

### Assistant:
 A gentle breeze caresses my face,
As I walk through the verdant maze,
Nature's beauty surrounds me here,
In this tranquil moment, away from fear.

The songbirds sing their harmonious tune,
And the sun casts its warm golden hue,
Dancing on leaves of lush green,
Creating magic so serene.

My heart is filled with joy and awe,
For this splendid paradise I strawl,
Here, I find love unfurled and true,
In this garden of peace, where my spirit flew. [end of text]

llama_print_timings:        load time =  1597.66 ms
llama_print_timings:      sample time =    60.97 ms /   138 runs   (    0.44 ms per token,  2263.30 tokens per second)
llama_print_timings: prompt eval time = 17151.98 ms /    37 tokens (  463.57 ms per token,     2.16 tokens per second)
llama_print_timings:        eval time = 75291.60 ms /   137 runs   (  549.57 ms per token,     1.82 tokens per second)
llama_print_timings:       total time = 92530.96 ms

@mkroman
Copy link

mkroman commented Jul 27, 2023

@mkroman Your converted model works with GGML!

Yeah, just confirmed it myself. It looks very promising - thanks for the patch :)

Output
% ./main -m ./models/stable-beluga-2-q4_k_s-ggml.bin -p "$(echo -en "### System:\nYou are Stable Beluga, an AI th
at follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.\n\n### User:\nWhat is t
he most common way of transportation in Amsterdam?\n\n### Assistant:\n")" --no-mmap -n 400 -t 37 --gqa 8
main: build = 917 (1a94186)
main: seed  = 1690488472
llama.cpp: loading model from ./models/stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000                                                                                             llama_model_load_internal: n_ctx      = 512                                                                                               llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 37070.96 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 37 / 40 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0
| F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.
000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 ### System:
You are Stable Beluga, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything i
llegal.

### User:
What is the most common way of transportation in Amsterdam?

### Assistant:
 The most common way of transportation in Amsterdam is cycling. The city has an extensive network of bicycle paths and a large number of c
yclists, making it one of the most bike-friendly cities in the world. [end of text]

llama_print_timings:        load time = 107335.19 ms
llama_print_timings:      sample time =    41.41 ms /    50 runs   (    0.83 ms per token,  1207.32 tokens per second)
llama_print_timings: prompt eval time = 38123.01 ms /    67 tokens (  569.00 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time = 33838.50 ms /    49 runs   (  690.58 ms per token,     1.45 tokens per second)
llama_print_timings:       total time = 72018.47 ms

@TheBloke
Copy link
Contributor Author

Wonderful work @mj-shifu ! My Stable Beluga 2 GGMLs are uploading and I will soon do Airoboros 70B and Guanaco 70B

Thanks so much for getting this done. Amazing first contribution! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants