Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-quantization of a split gguf file produces "invalid split file" #6548

Closed
he29-net opened this issue Apr 8, 2024 · 11 comments · Fixed by #6688
Closed

Re-quantization of a split gguf file produces "invalid split file" #6548

he29-net opened this issue Apr 8, 2024 · 11 comments · Fixed by #6688
Labels
bug Something isn't working good first issue Good for newcomers split GGUF split model sharding

Comments

@he29-net
Copy link

he29-net commented Apr 8, 2024

Hi, while testing #6491 branch, I downloaded a Q8_0 quant (split into 3 files) from dranger003, and re-quantized it to Q2_K_S to make it more digestible for my museum hardware:

./quantize --allow-requantize --imatrix ../models/ggml-c4ai-command-r-plus-104b-f16-imatrix.dat ../models/ggml-c4ai-command-r-plus-104b-q8_0-00001-of-00003.gguf ../models/command-r-plus-104b-Q2_K_S.gguf Q2_K_S 2

I only passed the first piece, but ./quantize processed it correctly and produced a single file with the expected size. However, it probably did not update some metadata and ./main still thinks the result is a split file:

./main -m ../models/command-r-plus-104b-Q2_K_S.gguf -t 15 --color -p "this is a test" -c 2048 -ngl 25 -ctk q8_0
...
llama_model_load: error loading model: invalid split file: ../models/command-r-plus-104b-Q2_K_S.gguf
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../models/command-r-plus-104b-Q2_K_S.gguf'
main: error: unable to load model

As a workaround, it is possible to "reset" the metadata by doing a "dummy pass" of gguf-split:

./gguf-split --split-max-tensors 999 --split ../models/command-r-plus-104b-Q2_K_S.gguf ../models/command-r-plus-104b-Q2_K_S.gguf.split

The resulting file then seems to be working fine.

It's probably an easy fix, but after a quick grep through the source and a look at quantize.cpp I figured I don't even know where to start, so it would be probably much easier and faster done by someone who knows the code-base.

@ggerganov ggerganov added bug Something isn't working good first issue Good for newcomers and removed bug-unconfirmed labels Apr 8, 2024
@AlexsCode
Copy link
Contributor

From a newbie perspective, It appears that LLM_KV_SPLIT_COUNT is retaining the value from when it was split.

https://github.com/ggerganov/llama.cpp/blob/cc4a95426d17417d3c83f12bdb514fbe8abe2a88/llama.cpp#L2942-L2956

In this instance LLM_KV_SPLIT_COUNT is clearly returning as greater than 1.

We then see https://github.com/ggerganov/llama.cpp/blob/cc4a95426d17417d3c83f12bdb514fbe8abe2a88/llama.cpp#L2954 is checking that the postfix (end of the file ) is named in the following format "-%05d-of-%05d.gguf". Which since the quantization occurred the file will no longer be named as such. (command-r-plus-104b-Q2_K_S.gguf)

I endeavour to look into why LLM_KV_SPLIT_COUNT has retained its count at a later time.

@phymbert phymbert added the split GGUF split model sharding label Apr 8, 2024
@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

Yes I see 2 solutions:

  • quantize should also generate shards if the model is loaded from a split
  • quantize must remove split metadata

@phymbert
Copy link
Collaborator

phymbert commented Apr 9, 2024

Probably a duplicate of :

@zj040045
Copy link
Contributor

@phymbert I'm working on it. Is It better to support both?

  • for models from splits to single file, quantize remove split metadata
  • for models from splits to splits, quantize generates shards and everything works out of box

@4cecoder
Copy link

how do i combine the shards?

@phymbert
Copy link
Collaborator

how do i combine the shards?

You can use --merge operation but it is not necessary anymore as of now loading model from shards is built-in.

@zj040045
Copy link
Contributor

The fix has been merged here #6591

@he29-net
Copy link
Author

Thanks; closing the issue as fixed then. 👍

@phymbert phymbert reopened this Apr 12, 2024
@phymbert
Copy link
Collaborator

No reopening because I think the target should be a split version after quantize

@he29-net
Copy link
Author

OK, no problem. I thought of that solution more as a new feature, while this issue was more about resolving the bug (producing invalid files).

As for the split during quantization: I would consider that most of the splits are currently done only to fit shards into the 50 GB huggingface upload limit – and after quantization, it is likely that a lot of the time the output will already fit in the single file limit. So I would argue the default behavior should be no splitting after quantization, since a) the split is probably unnecessary, or b) the user will probably want to use a different number of shards anyway.

@zj040045
Copy link
Contributor

@he29-net Create another PR to generate "a split version after quantize". It is optional so it won't affect default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers split GGUF split model sharding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants