Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : PoC for normalizing weights for better quantization packing #2434

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jul 28, 2023

This is a proof-of-concept for alternative packing of the quantization scaling factors. The idea is to pre-normalize the model tensors with row-wise scaling factors that transform the weights in each row in the range [-1 .. 1]. Having the information that the weights are bounded in this range the delta and min factors of the quantizations can be represented with less bits.

This PR demonstrates the technique for Q4_0, Q4_1 and Q5_1. CPU + CUDA + Metal implementations are provided.

Here are some sample results:

Model Type Branch Size PPL
7Bv1 Q4_0 master 3.6G 6.1249
7Bv1 Q4_0 PR 3.5G 6.1276
7Bv1 Q4_1 master 4.0G 6.0667
7Bv1 Q4_1 PR 3.6G 6.0716
7Bv2 Q4_1 master 4.0G 6.0019
7Bv2 Q4_1 PR 3.6G 5.9701
7Bv1 Q5_1 master 4.8G 5.9432
7Bv1 Q5_1 PR 4.2G 5.9618
  • x64 CPU
model size params backend threads test t/s
llama2 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 128 15.85 ± 0.02
llama2 7B mostly Q4_1 3.95 GiB 6.74 B CPU 8 tg 128 10.91 ± 0.02
llama2 7B mostly Q5_1 4.72 GiB 6.74 B CPU 8 tg 128 10.15 ± 0.01

build: 71d6975 (1129)

model size params backend threads test t/s
llama2 7B mostly Q4_0 3.41 GiB 6.74 B CPU 8 tg 128 15.71 ± 0.10
llama2 7B mostly Q4_1 3.60 GiB 6.74 B CPU 8 tg 128 15.05 ± 0.02
llama2 7B mostly Q5_1 4.16 GiB 6.74 B CPU 8 tg 128 13.06 ± 0.01
  • CUDA
model size params backend ngl threads test t/s
llama2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp 512 3456.53 ± 3.70
llama2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 99 1 pp 512 2979.64 ± 3.96
llama2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 99 1 pp 512 2737.19 ± 2.68
llama2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg 128 129.65 ± 0.02
llama2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 99 1 tg 128 120.66 ± 0.02
llama2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 99 1 tg 128 103.79 ± 0.01

build: 71d6975 (1129)

model size params backend ngl threads test t/s
llama2 7B mostly Q4_0 3.41 GiB 6.74 B CUDA 99 1 pp 512 3258.49 ± 1.55
llama2 7B mostly Q4_1 3.60 GiB 6.74 B CUDA 99 1 pp 512 2727.44 ± 5.07
llama2 7B mostly Q5_1 4.16 GiB 6.74 B CUDA 99 1 pp 512 2536.55 ± 5.60
llama2 7B mostly Q4_0 3.41 GiB 6.74 B CUDA 99 1 tg 128 124.90 ± 0.03
llama2 7B mostly Q4_1 3.60 GiB 6.74 B CUDA 99 1 tg 128 119.31 ± 0.04
llama2 7B mostly Q5_1 4.16 GiB 6.74 B CUDA 99 1 tg 128 106.89 ± 0.01

build: 8c2b881 (1131)

M2 Ultra (Metal):

  • master
model size params backend ngl threads test t/s
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 4 pp 512 629.99 ± 0.39
LLaMA 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 4 pp 512 631.83 ± 0.48
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 4 tg 128 86.83 ± 0.01
LLaMA 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 4 tg 128 82.44 ± 0.02
LLaMA 13B mostly Q4_0 6.86 GiB 13.02 B Metal 1 4 pp 512 367.60 ± 0.13
LLaMA 13B mostly Q4_1 7.61 GiB 13.02 B Metal 1 4 pp 512 370.46 ± 0.10
LLaMA 13B mostly Q4_0 6.86 GiB 13.02 B Metal 1 4 tg 128 54.71 ± 0.02
LLaMA 13B mostly Q4_1 7.61 GiB 13.02 B Metal 1 4 tg 128 50.94 ± 0.02
  • PR
model size params backend ngl threads test t/s
LLaMA 7B mostly Q4_0 3.41 GiB 6.74 B Metal 1 4 pp 512 631.35 ± 0.44
LLaMA 7B mostly Q4_1 3.60 GiB 6.74 B Metal 1 4 pp 512 629.73 ± 0.23
LLaMA 7B mostly Q4_0 3.41 GiB 6.74 B Metal 1 4 tg 128 82.53 ± 0.06
LLaMA 7B mostly Q4_1 3.60 GiB 6.74 B Metal 1 4 tg 128 81.37 ± 0.04
LLaMA 13B mostly Q4_0 6.54 GiB 13.02 B Metal 1 4 pp 512 370.89 ± 0.12
LLaMA 13B mostly Q4_1 6.91 GiB 13.02 B Metal 1 4 pp 512 369.56 ± 0.02
LLaMA 13B mostly Q4_0 6.54 GiB 13.02 B Metal 1 4 tg 128 52.99 ± 0.02
LLaMA 13B mostly Q4_1 6.91 GiB 13.02 B Metal 1 4 tg 128 51.60 ± 0.03

PPL Q4_1 with Metal: 6.0702 +/- 0.03387

In these examples all tensors are quantized with the specified quantization, except output.weight and tok_embeddings.weight. They are quantized with Q6_K.

The technique is compatible with all quantization methods. It's quite cheap in terms of computation - the extra ggml_mul() operation is applied on the matrix multiplication results with broadcast across rows. The implementation requires very little changes to the existing code - we only compute d and m in different way, all else is the same. For each tensor we add an extra tensor with the row scaling factors stored in F32. The amount of bits to use for d and m can be easily adjusted as long as the sum of the bits is 8 or 16.

It's also possible to extend the normalization step to transform the weights to the [0 .. 1] range using an extra normalization tensor containing row-wise offset factors. Tried it, it works though I don't think if it is worth the extra effort.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jul 28, 2023
@ikawrakow
Copy link
Contributor

Do you use LLAMA_CUDA_FORCE_DMMV=ON on CUDA? If I use LLAMA_CUDA_FORCE_DMMV=OFF, I get

CUDA error 716 at /home/iwan/other/llama.cpp/tmp/llama.cpp/ggml-cuda.cu:3526: misaligned address

on this branch with Q4_1. If I set LLAMA_CUDA_FORCE_DMMV=ON, for 7B I get 66.6 tokens per second for Q4_0 and 70 tokens per second for Q4_1 on a 4080. In comparison, I have 126 t/s for Q4_0 and 114 t/s for Q4_1 on master, so this would be 60-90% performance regression. Q4_0 still works with LLAMA_CUDA_FORCE_DMMV=OFF. With that I get 111 t/s, so ~14% slower compared to master. Is this 14% performance penalty due to the extra ggml_mul() operation, or is it due to something else?

@JohannesGaessler
Copy link
Collaborator

In #2160 I am changing how 8 bit integers are loaded as 32 bit integers. The gist of it is that if the int8 array is aligned to at least 2 bytes then it's faster to load it via 2 16 bit integers than it is to use memcpy (1-2% total token generation speedup, the fastest way is still to directly cast the int8 array to an int32 pointer). This would not work with block sizes that are not aligned but I think the VRAM savings would be worth it. At some point it may be worthwhile to look into rearranging the blocks into larger blocks that are aligned to at least 4 bytes in order to improve effective memory bandwidth.

@JohannesGaessler
Copy link
Collaborator

@ggerganov what is the intended time scale for this change?

@ggerganov
Copy link
Owner Author

ggerganov commented Jul 28, 2023

@ikawrakow

I noticed that the Q4_0 CUDA is slower with this PR - not sure why. I suspect the ggml_mul() is not optimal. The Q4_1 and Q5_1 I didn't run text-generation with CUDA at all due to misaligned address - I didn't remember about the LLAMA_CUDA_FORCE_DMMV option. I ran only PPL runs since these do not rely on the problematic kernels.

I did brief speed test on the CPU, both ARM and x64 and there was no speed difference with Q4_0. I think on x64 the branch was even faster than master with 8 threads.

I don't think the ggml_mul() will be a bottleneck if optimized properly, but it is definitely something to confirm in more details. Could be wrong.

@JohannesGaessler

I don't plan to prioritize this. Want to first finish the GGUF change which should happen in the second part of Aug.
After that, if this approach is viable at all (we still need to do more tests), we can start thinking about upstreaming it. I just had the idea how to implement the weight normalization and was curious to quickly give it a try before the vacation.

@ikawrakow
Copy link
Contributor

@JohannesGaessler Yes, I understand why Q4_1 does not work with this PR and LLAMA_CUDA_FORCE_DMMV=OFF and it is easy to fix. If I fix it, I get 107 t/s for Q4_1, so in this case only a ~7% performance penalty. And this is basically my question: is the 14 or 7% performance penalty due to the unaligned memory access, or is it due to the extra ggml_mul() operation, or a combination of these two?

@ikawrakow
Copy link
Contributor

@ggerganov I just pushed a fix for the misaligned memory access, so you can now run TG with Q4_1

@ggerganov
Copy link
Owner Author

Perfect! Thanks

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 28, 2023

is the 14 or 7% performance penalty due to the unaligned memory access, or is it due to the extra ggml_mul() operation, or a combination of these two?

Probably both but I suspect the main factor is memory alignment and maybe slightly less optimal data types.

@klosax
Copy link
Contributor

klosax commented Aug 5, 2023

I suggest keeping old quantization formats for compatibility and user choice. Users have their own hardware constraints and preferences about model quality vs speed vs memory usage.

In PR #1508 the Q4_0 format could been kept and renamed to something like Q4_0_fp32d,.and the new Q4_0_fp16d. The format in this PR could be named Q4_0_rownorm. And if new formats gets new identifiers instead of replacing the old, couldn't he quantization version metadata be dropped then?

Since the NN weights could be formed totally different in different models (not only LLMs), you have to test different quantization formats to see which one works best for that model, the more formats you can test the better.

Maybe no more breaking changes to model data when GGUF is released?

@JohannesGaessler
Copy link
Collaborator

While that's understandable from a user perspective you should keep in mind that backwards compatibility is not free. Each additional quantization format increases the amount of work required to roll out new features. And I think the time of developers is much more valuable than the time of users, especially for a project that is still relatively new. I also think that an optimized version of this approach will be universally better than the current implementation.

@klosax
Copy link
Contributor

klosax commented Aug 5, 2023

And I think the time of developers is much more valuable than the time of users, especially for a project that is still relatively new.

Even if this project is relatively new, the number of users have exploded at the same rate as the general use of LLMs. A quick search for "ggml" at huuggingface returns some 1000 model folders and many of them like @TheBloke contain the models in several quantization formats.

Like @philpax commented here #1408 (comment) it is a bad idea to keep the old quantization names, since it is confusing and frustrating for end-users to find compatible model files. A typical model filename could be named llama-65b-ggml-q8_0.binand the user could wait hours for it to download only to find out that the file somehow is not compatible anymore. @LostRuins even decided to keep 100% backwards compatibility for user convenience in kobold.cpp #1408 (comment).

@JohannesGaessler
Copy link
Collaborator

The number of users is not an argument. The exact same number of users will need to deal with a breaking change as would benefit from developers being more productive. In fact, if the number of users continues to grow this only weighs more heavily on the side of making a breaking change that causes trouble once but permanently reduces the amount of work that needs to spent on maintenance.

@philpax
Copy link

philpax commented Aug 5, 2023

I don't mind if you break the format/lose compatibility, but please use a different name. As @klosax says, it is extremely confusing and frustrating for users to download a model for hours or even days only to discover that it's not compatible with the version of GGML they're using, especially for older models.

We've seen this play out several times already and it leads to much user frustration playing out in old and new GitHub issues. Most people are not paying attention to the vagaries of llama.cpp versioning. TheBloke is doing a fantastic job of including the file format version in the models, but not everyone is, and if it's not in the name of the quantisation format people are unlikely to know there's a difference.

Call it q4_0-v2 or something and/or drop support for q4_0, but please don't reuse the names again!

@klosax
Copy link
Contributor

klosax commented Aug 5, 2023

In fact, if the number of users continues to grow this only weighs more heavily on the side of making a breaking change that causes trouble once but permanently reduces the amount of work that needs to spent on maintenance.

If the project needs increasingly more maintenance for each quantization format added, maybe the layout is not optimal. A start could be to have the quantization formats separated from ggml.c, like what @ikawrakow did when adding k-quants.

@LostRuins
Copy link
Collaborator

I agree with @philpax that disambiguation is probably the most important, since correctly supporting two different formats without being able to tell them apart makes things that much more difficult. Dropping support for ancient formats due to maintenance burden as @JohannesGaessler mentioned is perfectly understandable, just that these incompatibilities should be easy to detect and handle for future clients (so definitely do not reuse the deprecated IDs or identifiers e.g. Q4_3).

I think when evaluating our options, it is helpful to examine other software that face similar problems, such as Video Codecs, which have gone through decades of evolution. You have VP2 all the way to VP9, each version extending and iterating on the previous, dropping and adding capabilities as things change, and a good video player is able to differentiate and handle all of them because they're properly versioned.

@@ -17200,7 +17230,13 @@ struct ggml_cplan ggml_graph_plan(struct ggml_cgraph * cgraph, int n_threads) {
}
} break;
case GGML_OP_SILU_BACK:
{
n_tasks = n_threads;
} break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move GGML_OP_SILU_BACK to after GGML_OP_MUL to avoid the duplicate code.

@KerfuffleV2
Copy link
Collaborator

I don't suppose there would be a way to convert from the old format (even with a slight quality loss)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
Development

Successfully merging this pull request may close these issues.

8 participants