Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

TheBloke · 2023-07-24T16:24:41Z

Following from discussions in the Llama 2 70B PR: #2276 :

Since that PR, converting Llama 2 70B models from Meta's original PTH format files works great.

But it is not possible to make usable Llama 2 70B models from HF format. The models convert and quantise fine, but always produce gibberish, as in this example:

 ### Human: write a story about llamas\n### Assistant:20 300202000 B00A0

@klosax reports:

It looks like the tensors gets transformed with the new permute using the GQA parameter num_local_key_value_heads and num_key_value_heads somehow:
https://github.com/huggingface/transformers/blob/b257c46a075419c09e5ce5c5aa39bc346ecdb9a5/src/transformers/models/llama/convert_llama_weights_to_hf.py#L173-L195

For reference, here are all the changes that happened in Transformers' convert_llama_weights_to_hf.py for the Llama 2 release: huggingface/transformers@07360b6#diff-110a445233a8b15a0875998eeaf75cb8607b38a5daa736291dd058766879bbdd

Would anyone be able to look into this? It's a bit beyond my experience.

I'm getting multiple requests a day for 70B fine tune quants for FreeWilly 2, Llama2-Guanaco, and the newly released Airoboros 1.4.1 70B, and would love to be able to provide them for people.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

klosax · 2023-07-24T17:16:14Z

It would indeed be very nice to be able to convert the 70b HF models.

I tried to look into it and figure out how it all worked but the needed skills are beyond me.

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py, but this function is not yet compatible with llama v2 which uses GQA.

KerfuffleV2 · 2023-07-26T13:50:51Z

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just

def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))

to

def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights

It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

klosax · 2023-07-26T14:21:28Z

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

No I dont think that that would work since all pth models, including the 70b, is transformed by the HF permute(). I guess we need a new permute() in convert.py to reverse it.

7erminalVelociraptor · 2023-07-26T14:26:21Z

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just
def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))
to
def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights
It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

I can try it out later today, I have the airoboros 70b finetune in HF on my desktop and I think the base llama2 also.

MrJackSpade · 2023-07-26T14:30:47Z

What's the level of effort on porting the permutation changes?

Does it require a lot of knowledge, or is it a straight transposition?

klosax · 2023-07-26T14:39:01Z

What's the level of effort on porting the permutation changes?

Here is the new HF conversion script that converts the original pth models to HF format:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

This is the old HF conversion script:

https://github.com/huggingface/transformers/blob/feb83521eca849731573dd40da89a02e4f370e5a/src/transformers/models/llama/convert_llama_weights_to_hf.py

Now, llama.cpp want the tensors in the pth layout so any transformations made to HF tensors need to be reversed in convert.py

mj-shifu · 2023-07-27T17:23:16Z

Hello,

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more:

https://github.com/mj-shifu/llama.cpp/blob/e15a67d6b21c10326a5cc74ab6d6ce9b8d7702bb/convert.py

I am not sure if it is correct but the converted Huggingface model produces exactly the same outputs as the converted pth model when sampling is disabled.

@TheBloke Could you probably try that out?

I'm sorry for any mistakes, this is my first time contributing here.

TheBloke · 2023-07-27T17:37:21Z

Wonderful, thank you! I am having dinner now but will check as soon as I am at my pc

KerfuffleV2 · 2023-07-27T17:57:31Z

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more

Nice, I was working on this as well and what you did is very close to my approach. So theoretically if I'm not an idiot that is a good sign.

I'd suggest making the type for n_kv_head Optional[int] since it gets set to None. Also, then you don't even need a conditional to populate it, you can just do n_kv_head = config.get("num_key_value_heads")

mj-shifu · 2023-07-27T18:07:53Z

Thanks for the suggestions. I just updated the code accordingly. Although I also think that the similarities are a good sign, I'm sorry that I caused duplicated work.

https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py

KerfuffleV2 · 2023-07-27T18:44:11Z

Absolutely no need to apologize. I'm happy to see someone else got it done! Assuming it works, you should make a pull with these changes! I can't test whether it actually works, but for whatever my opinion is worth the code looks very reasonable.

If you wanted to reduce duplication a bit you could try something like:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0]
    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= n_kv_head
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, dim1 // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

mj-shifu · 2023-07-27T19:21:00Z

Thank you! I should probably wait for @TheBloke's result until I make a pull request, shouldn't I? I have only tested it with a Huggingface model that I converted locally from the 70B pth model using the official Huggingface script.

klosax · 2023-07-27T19:27:23Z

If you wanted to reduce duplication a bit you could try something like:

if n_kv_head is used, n_head should not be divided by n_kv_head in the third parameter.

Something like this looks correct to me:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0] // n_head // 2

    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= k_kv_head
        n_head //= n_kv_head

    return (weights.reshape(n_head, 2, dim1, *weights.shape[1:])
                    .swapaxes(1, 2)
                    .reshape(weights.shape))

KerfuffleV2 · 2023-07-27T19:31:21Z

I should probably wait for @TheBloke's result until I make a pull request, shouldn't I?

I don't think either approach is wrong, so if you're more comfortable with that then it's perfectly fine.

Pull requests have to get approved by someone before they're merged, so as long as you added a note that it was still being tested there wouldn't be a danger of it instantly getting merged with problems. It would also be possible to create a draft pull request that can't be merged until you set it ready for review.

It's generally easier to discuss actual code changes when they're in a pull request so that's why if it was me I'd probably just go ahead and create a pull (even a draft one).

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

mj-shifu · 2023-07-27T19:38:46Z

@klosax Yes, but I think you can even remove the dim1 because // (n_head // n_kv_head) gets // n_head * n_kv_head, which is what we want.

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    if n_kv_head is not None and n_head != n_kv_head:
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

@KerfuffleV2 That's very cool! I'll make a pull request.

mkroman · 2023-07-27T19:45:35Z

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

I'd wager the renaming was due to trademark issues.

I've converted it with https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py and quantized it for Q4_K_S and have uploaded it.

It is available here (for an unknown amount of time, but at least for as long as this PR):

https://storage.labs.rwx.im/llm/stable-beluga-2/stable-beluga-2-ggml/stable-beluga-2-q4_k_s-ggml.bin

It'll be a bit before I can test it myself, but feel free to try the link if it's faster. It's ~36.2 GiB.

SHASUMs:

a65f23c4a43fc18e2bde619d3792599e090e3a6b  stable-beluga-2-q4_k_s-ggml.bin

mj-shifu · 2023-07-27T20:11:57Z

@mkroman Your converted model works with GGML!

llama.cpp/build/bin/main -m stable-beluga-2-q4_k_s-ggml.bin -t 18 -gqa 8 -f prompt.txt
main: build = 917 (1a94186)
main: seed  = 1690488588
llama.cpp: loading model from stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 18 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### System:
This is a system prompt, please behave and help the user.

### User:
Write me a poem.

### Assistant:
 A gentle breeze caresses my face,
As I walk through the verdant maze,
Nature's beauty surrounds me here,
In this tranquil moment, away from fear.

The songbirds sing their harmonious tune,
And the sun casts its warm golden hue,
Dancing on leaves of lush green,
Creating magic so serene.

My heart is filled with joy and awe,
For this splendid paradise I strawl,
Here, I find love unfurled and true,
In this garden of peace, where my spirit flew. [end of text]

llama_print_timings:        load time =  1597.66 ms
llama_print_timings:      sample time =    60.97 ms /   138 runs   (    0.44 ms per token,  2263.30 tokens per second)
llama_print_timings: prompt eval time = 17151.98 ms /    37 tokens (  463.57 ms per token,     2.16 tokens per second)
llama_print_timings:        eval time = 75291.60 ms /   137 runs   (  549.57 ms per token,     1.82 tokens per second)
llama_print_timings:       total time = 92530.96 ms

mkroman · 2023-07-27T20:16:16Z

@mkroman Your converted model works with GGML!

Yeah, just confirmed it myself. It looks very promising - thanks for the patch :)

Output

% ./main -m ./models/stable-beluga-2-q4_k_s-ggml.bin -p "$(echo -en "### System:\nYou are Stable Beluga, an AI th
at follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.\n\n### User:\nWhat is t
he most common way of transportation in Amsterdam?\n\n### Assistant:\n")" --no-mmap -n 400 -t 37 --gqa 8
main: build = 917 (1a94186)
main: seed  = 1690488472
llama.cpp: loading model from ./models/stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000                                                                                             llama_model_load_internal: n_ctx      = 512                                                                                               llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 37070.96 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 37 / 40 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0
| F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.
000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 ### System:
You are Stable Beluga, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything i
llegal.

### User:
What is the most common way of transportation in Amsterdam?

### Assistant:
 The most common way of transportation in Amsterdam is cycling. The city has an extensive network of bicycle paths and a large number of c
yclists, making it one of the most bike-friendly cities in the world. [end of text]

llama_print_timings:        load time = 107335.19 ms
llama_print_timings:      sample time =    41.41 ms /    50 runs   (    0.83 ms per token,  1207.32 tokens per second)
llama_print_timings: prompt eval time = 38123.01 ms /    67 tokens (  569.00 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time = 33838.50 ms /    49 runs   (  690.58 ms per token,     1.45 tokens per second)
llama_print_timings:       total time = 72018.47 ms

TheBloke · 2023-07-27T20:49:46Z

Wonderful work @mj-shifu ! My Stable Beluga 2 GGMLs are uploading and I will soon do Airoboros 70B and Guanaco 70B

Thanks so much for getting this done. Amazing first contribution! :)

klosax added help wanted Extra attention is needed high priority Very important issue labels Jul 24, 2023

mj-shifu mentioned this issue Jul 27, 2023

convert.py : Update to support 70B HF format model files #2427

Merged

Green-Sky mentioned this issue Jul 27, 2023

LLAMA 2 70B convert fails with: failed to find n_mult for (n_ff=28672, n_embd=8192) #2286

Closed

KerfuffleV2 closed this as completed in #2427 Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

TheBloke commented Jul 24, 2023

klosax commented Jul 24, 2023

KerfuffleV2 commented Jul 26, 2023

klosax commented Jul 26, 2023

7erminalVelociraptor commented Jul 26, 2023

MrJackSpade commented Jul 26, 2023

klosax commented Jul 26, 2023 •

edited

Loading

mj-shifu commented Jul 27, 2023

TheBloke commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

klosax commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

mkroman commented Jul 27, 2023 •

edited

Loading

mj-shifu commented Jul 27, 2023

mkroman commented Jul 27, 2023

TheBloke commented Jul 27, 2023

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

Llama 2 70B: Update needed to convert.py to support 70B HF format model files #2376

Comments

TheBloke commented Jul 24, 2023

klosax commented Jul 24, 2023

KerfuffleV2 commented Jul 26, 2023

klosax commented Jul 26, 2023

7erminalVelociraptor commented Jul 26, 2023

MrJackSpade commented Jul 26, 2023

klosax commented Jul 26, 2023 • edited Loading

mj-shifu commented Jul 27, 2023

TheBloke commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

klosax commented Jul 27, 2023

KerfuffleV2 commented Jul 27, 2023

mj-shifu commented Jul 27, 2023

mkroman commented Jul 27, 2023 • edited Loading

mj-shifu commented Jul 27, 2023

mkroman commented Jul 27, 2023

TheBloke commented Jul 27, 2023

klosax commented Jul 26, 2023 •

edited

Loading

mkroman commented Jul 27, 2023 •

edited

Loading