ggml : PoC for normalizing weights for better quantization packing #2434

ggerganov · 2023-07-28T08:07:12Z

This is a proof-of-concept for alternative packing of the quantization scaling factors. The idea is to pre-normalize the model tensors with row-wise scaling factors that transform the weights in each row in the range [-1 .. 1]. Having the information that the weights are bounded in this range the delta and min factors of the quantizations can be represented with less bits.

This PR demonstrates the technique for Q4_0, Q4_1 and Q5_1. CPU + CUDA + Metal implementations are provided.

Here are some sample results:

Model	Type	Branch	Size	PPL
7Bv1	`Q4_0`	`master`	3.6G	6.1249
7Bv1	`Q4_0`	PR	3.5G	6.1276
7Bv1	`Q4_1`	`master`	4.0G	6.0667
7Bv1	`Q4_1`	PR	3.6G	6.0716
7Bv2	`Q4_1`	`master`	4.0G	6.0019
7Bv2	`Q4_1`	PR	3.6G	5.9701
7Bv1	`Q5_1`	`master`	4.8G	5.9432
7Bv1	`Q5_1`	PR	4.2G	5.9618

x64 CPU

model	size	params	backend	threads	test	t/s
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 128	15.85 ± 0.02
llama2 7B mostly Q4_1	3.95 GiB	6.74 B	CPU	8	tg 128	10.91 ± 0.02
llama2 7B mostly Q5_1	4.72 GiB	6.74 B	CPU	8	tg 128	10.15 ± 0.01

build: 71d6975 (1129)

model	size	params	backend	threads	test	t/s
llama2 7B mostly Q4_0	3.41 GiB	6.74 B	CPU	8	tg 128	15.71 ± 0.10
llama2 7B mostly Q4_1	3.60 GiB	6.74 B	CPU	8	tg 128	15.05 ± 0.02
llama2 7B mostly Q5_1	4.16 GiB	6.74 B	CPU	8	tg 128	13.06 ± 0.01

CUDA

model	size	params	backend	ngl	threads	test	t/s
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp 512	3456.53 ± 3.70
llama2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	99	1	pp 512	2979.64 ± 3.96
llama2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	99	1	pp 512	2737.19 ± 2.68
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg 128	129.65 ± 0.02
llama2 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	99	1	tg 128	120.66 ± 0.02
llama2 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	99	1	tg 128	103.79 ± 0.01

build: 71d6975 (1129)

model	size	params	backend	ngl	threads	test	t/s
llama2 7B mostly Q4_0	3.41 GiB	6.74 B	CUDA	99	1	pp 512	3258.49 ± 1.55
llama2 7B mostly Q4_1	3.60 GiB	6.74 B	CUDA	99	1	pp 512	2727.44 ± 5.07
llama2 7B mostly Q5_1	4.16 GiB	6.74 B	CUDA	99	1	pp 512	2536.55 ± 5.60
llama2 7B mostly Q4_0	3.41 GiB	6.74 B	CUDA	99	1	tg 128	124.90 ± 0.03
llama2 7B mostly Q4_1	3.60 GiB	6.74 B	CUDA	99	1	tg 128	119.31 ± 0.04
llama2 7B mostly Q5_1	4.16 GiB	6.74 B	CUDA	99	1	tg 128	106.89 ± 0.01

build: 8c2b881 (1131)

M2 Ultra (Metal):

master

model	size	params	backend	ngl	threads	test	t/s
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	4	pp 512	629.99 ± 0.39
LLaMA 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	4	pp 512	631.83 ± 0.48
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	4	tg 128	86.83 ± 0.01
LLaMA 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	4	tg 128	82.44 ± 0.02
LLaMA 13B mostly Q4_0	6.86 GiB	13.02 B	Metal	1	4	pp 512	367.60 ± 0.13
LLaMA 13B mostly Q4_1	7.61 GiB	13.02 B	Metal	1	4	pp 512	370.46 ± 0.10
LLaMA 13B mostly Q4_0	6.86 GiB	13.02 B	Metal	1	4	tg 128	54.71 ± 0.02
LLaMA 13B mostly Q4_1	7.61 GiB	13.02 B	Metal	1	4	tg 128	50.94 ± 0.02

PR

model	size	params	backend	ngl	threads	test	t/s
LLaMA 7B mostly Q4_0	3.41 GiB	6.74 B	Metal	1	4	pp 512	631.35 ± 0.44
LLaMA 7B mostly Q4_1	3.60 GiB	6.74 B	Metal	1	4	pp 512	629.73 ± 0.23
LLaMA 7B mostly Q4_0	3.41 GiB	6.74 B	Metal	1	4	tg 128	82.53 ± 0.06
LLaMA 7B mostly Q4_1	3.60 GiB	6.74 B	Metal	1	4	tg 128	81.37 ± 0.04
LLaMA 13B mostly Q4_0	6.54 GiB	13.02 B	Metal	1	4	pp 512	370.89 ± 0.12
LLaMA 13B mostly Q4_1	6.91 GiB	13.02 B	Metal	1	4	pp 512	369.56 ± 0.02
LLaMA 13B mostly Q4_0	6.54 GiB	13.02 B	Metal	1	4	tg 128	52.99 ± 0.02
LLaMA 13B mostly Q4_1	6.91 GiB	13.02 B	Metal	1	4	tg 128	51.60 ± 0.03

PPL Q4_1 with Metal: 6.0702 +/- 0.03387

In these examples all tensors are quantized with the specified quantization, except output.weight and tok_embeddings.weight. They are quantized with Q6_K.

The technique is compatible with all quantization methods. It's quite cheap in terms of computation - the extra ggml_mul() operation is applied on the matrix multiplication results with broadcast across rows. The implementation requires very little changes to the existing code - we only compute d and m in different way, all else is the same. For each tensor we add an extra tensor with the row scaling factors stored in F32. The amount of bits to use for d and m can be easily adjusted as long as the sum of the bits is 8 or 16.

It's also possible to extend the normalization step to transform the weights to the [0 .. 1] range using an extra normalization tensor containing row-wise offset factors. Tried it, it works though I don't think if it is worth the extra effort.

ikawrakow · 2023-07-28T13:53:04Z

Do you use LLAMA_CUDA_FORCE_DMMV=ON on CUDA? If I use LLAMA_CUDA_FORCE_DMMV=OFF, I get

CUDA error 716 at /home/iwan/other/llama.cpp/tmp/llama.cpp/ggml-cuda.cu:3526: misaligned address

on this branch with Q4_1. If I set LLAMA_CUDA_FORCE_DMMV=ON, for 7B I get 66.6 tokens per second for Q4_0 and 70 tokens per second for Q4_1 on a 4080. In comparison, I have 126 t/s for Q4_0 and 114 t/s for Q4_1 on master, so this would be 60-90% performance regression. Q4_0 still works with LLAMA_CUDA_FORCE_DMMV=OFF. With that I get 111 t/s, so ~14% slower compared to master. Is this 14% performance penalty due to the extra ggml_mul() operation, or is it due to something else?

JohannesGaessler · 2023-07-28T13:53:10Z

In #2160 I am changing how 8 bit integers are loaded as 32 bit integers. The gist of it is that if the int8 array is aligned to at least 2 bytes then it's faster to load it via 2 16 bit integers than it is to use memcpy (1-2% total token generation speedup, the fastest way is still to directly cast the int8 array to an int32 pointer). This would not work with block sizes that are not aligned but I think the VRAM savings would be worth it. At some point it may be worthwhile to look into rearranging the blocks into larger blocks that are aligned to at least 4 bytes in order to improve effective memory bandwidth.

JohannesGaessler · 2023-07-28T13:59:00Z

@ggerganov what is the intended time scale for this change?

ggerganov · 2023-07-28T14:09:17Z

@ikawrakow

I noticed that the Q4_0 CUDA is slower with this PR - not sure why. I suspect the ggml_mul() is not optimal. The Q4_1 and Q5_1 I didn't run text-generation with CUDA at all due to misaligned address - I didn't remember about the LLAMA_CUDA_FORCE_DMMV option. I ran only PPL runs since these do not rely on the problematic kernels.

I did brief speed test on the CPU, both ARM and x64 and there was no speed difference with Q4_0. I think on x64 the branch was even faster than master with 8 threads.

I don't think the ggml_mul() will be a bottleneck if optimized properly, but it is definitely something to confirm in more details. Could be wrong.

@JohannesGaessler

I don't plan to prioritize this. Want to first finish the GGUF change which should happen in the second part of Aug.
After that, if this approach is viable at all (we still need to do more tests), we can start thinking about upstreaming it. I just had the idea how to implement the weight normalization and was curious to quickly give it a try before the vacation.

ikawrakow · 2023-07-28T14:10:47Z

@JohannesGaessler Yes, I understand why Q4_1 does not work with this PR and LLAMA_CUDA_FORCE_DMMV=OFF and it is easy to fix. If I fix it, I get 107 t/s for Q4_1, so in this case only a ~7% performance penalty. And this is basically my question: is the 14 or 7% performance penalty due to the unaligned memory access, or is it due to the extra ggml_mul() operation, or a combination of these two?

ikawrakow · 2023-07-28T14:17:20Z

@ggerganov I just pushed a fix for the misaligned memory access, so you can now run TG with Q4_1

ggerganov · 2023-07-28T14:20:08Z

Perfect! Thanks

JohannesGaessler · 2023-07-28T14:38:51Z

is the 14 or 7% performance penalty due to the unaligned memory access, or is it due to the extra ggml_mul() operation, or a combination of these two?

Probably both but I suspect the main factor is memory alignment and maybe slightly less optimal data types.

klosax · 2023-08-05T03:34:23Z

I suggest keeping old quantization formats for compatibility and user choice. Users have their own hardware constraints and preferences about model quality vs speed vs memory usage.

In PR #1508 the Q4_0 format could been kept and renamed to something like Q4_0_fp32d,.and the new Q4_0_fp16d. The format in this PR could be named Q4_0_rownorm. And if new formats gets new identifiers instead of replacing the old, couldn't he quantization version metadata be dropped then?

Since the NN weights could be formed totally different in different models (not only LLMs), you have to test different quantization formats to see which one works best for that model, the more formats you can test the better.

Maybe no more breaking changes to model data when GGUF is released?

JohannesGaessler · 2023-08-05T07:05:50Z

While that's understandable from a user perspective you should keep in mind that backwards compatibility is not free. Each additional quantization format increases the amount of work required to roll out new features. And I think the time of developers is much more valuable than the time of users, especially for a project that is still relatively new. I also think that an optimized version of this approach will be universally better than the current implementation.

klosax · 2023-08-05T10:23:37Z

And I think the time of developers is much more valuable than the time of users, especially for a project that is still relatively new.

Even if this project is relatively new, the number of users have exploded at the same rate as the general use of LLMs. A quick search for "ggml" at huuggingface returns some 1000 model folders and many of them like @TheBloke contain the models in several quantization formats.

Like @philpax commented here #1408 (comment) it is a bad idea to keep the old quantization names, since it is confusing and frustrating for end-users to find compatible model files. A typical model filename could be named llama-65b-ggml-q8_0.binand the user could wait hours for it to download only to find out that the file somehow is not compatible anymore. @LostRuins even decided to keep 100% backwards compatibility for user convenience in kobold.cpp #1408 (comment).

JohannesGaessler · 2023-08-05T11:03:51Z

The number of users is not an argument. The exact same number of users will need to deal with a breaking change as would benefit from developers being more productive. In fact, if the number of users continues to grow this only weighs more heavily on the side of making a breaking change that causes trouble once but permanently reduces the amount of work that needs to spent on maintenance.

philpax · 2023-08-05T11:25:03Z

I don't mind if you break the format/lose compatibility, but please use a different name. As @klosax says, it is extremely confusing and frustrating for users to download a model for hours or even days only to discover that it's not compatible with the version of GGML they're using, especially for older models.

We've seen this play out several times already and it leads to much user frustration playing out in old and new GitHub issues. Most people are not paying attention to the vagaries of llama.cpp versioning. TheBloke is doing a fantastic job of including the file format version in the models, but not everyone is, and if it's not in the name of the quantisation format people are unlikely to know there's a difference.

Call it q4_0-v2 or something and/or drop support for q4_0, but please don't reuse the names again!

klosax · 2023-08-05T15:15:57Z

In fact, if the number of users continues to grow this only weighs more heavily on the side of making a breaking change that causes trouble once but permanently reduces the amount of work that needs to spent on maintenance.

If the project needs increasingly more maintenance for each quantization format added, maybe the layout is not optimal. A start could be to have the quantization formats separated from ggml.c, like what @ikawrakow did when adding k-quants.

LostRuins · 2023-08-06T10:30:08Z

I agree with @philpax that disambiguation is probably the most important, since correctly supporting two different formats without being able to tell them apart makes things that much more difficult. Dropping support for ancient formats due to maintenance burden as @JohannesGaessler mentioned is perfectly understandable, just that these incompatibilities should be easy to detect and handle for future clients (so definitely do not reuse the deprecated IDs or identifiers e.g. Q4_3).

I think when evaluating our options, it is helpful to examine other software that face similar problems, such as Video Codecs, which have gone through decades of evolution. You have VP2 all the way to VP9, each version extending and iterating on the previous, dropping and adding capabilities as things change, and a good video player is able to differentiate and handle all of them because they're properly versioned.

cebtenzzre · 2023-08-30T17:15:25Z

ggml.c

@@ -17200,7 +17230,13 @@ struct ggml_cplan ggml_graph_plan(struct ggml_cgraph * cgraph, int n_threads) {
                    }
                } break;
            case GGML_OP_SILU_BACK:
+                {
+                    n_tasks = n_threads;
+                } break;


You could move GGML_OP_SILU_BACK to after GGML_OP_MUL to avoid the duplicate code.

KerfuffleV2 · 2023-08-31T10:11:29Z

I don't suppose there would be a way to convert from the old format (even with a slight quality loss)?

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jul 28, 2023

ggerganov force-pushed the norm-quants branch from 0dfbd1b to a4d1eb7 Compare July 28, 2023 12:20

klosax mentioned this pull request Aug 6, 2023

GGUF file format specification ggerganov/ggml#302

Merged

ggml : poc for normalizing weights for better quantization (metal)

253eab8

ggerganov force-pushed the norm-quants branch from dead8f4 to 253eab8 Compare August 30, 2023 16:06

ggml : use less ggml_mul tasks when src0 rows are few

df54d2f

cebtenzzre reviewed Aug 30, 2023

View reviewed changes

cuda : poc for norm quants (only -b 1 works)

8c2b881

KerfuffleV2 mentioned this pull request Sep 3, 2023

quantize: k_quants.c:73: nearest_int: Assertion `fval <= 4194303.f' failed. #2982

Closed

ggerganov mentioned this pull request Jun 15, 2024

Add support for BitnetForCausalLM (new model / new datatype) #7931

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : PoC for normalizing weights for better quantization packing #2434

ggml : PoC for normalizing weights for better quantization packing #2434

ggerganov commented Jul 28, 2023 •

edited

Loading

ikawrakow commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023

ggerganov commented Jul 28, 2023 •

edited

Loading

ikawrakow commented Jul 28, 2023

ikawrakow commented Jul 28, 2023

ggerganov commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023 •

edited

Loading

klosax commented Aug 5, 2023 •

edited

Loading

JohannesGaessler commented Aug 5, 2023

klosax commented Aug 5, 2023

JohannesGaessler commented Aug 5, 2023

philpax commented Aug 5, 2023

klosax commented Aug 5, 2023

LostRuins commented Aug 6, 2023

cebtenzzre Aug 30, 2023

KerfuffleV2 commented Aug 31, 2023

ggml : PoC for normalizing weights for better quantization packing #2434

Are you sure you want to change the base?

ggml : PoC for normalizing weights for better quantization packing #2434

Conversation

ggerganov commented Jul 28, 2023 • edited Loading

M2 Ultra (Metal):

ikawrakow commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023

ggerganov commented Jul 28, 2023 • edited Loading

ikawrakow commented Jul 28, 2023

ikawrakow commented Jul 28, 2023

ggerganov commented Jul 28, 2023

JohannesGaessler commented Jul 28, 2023 • edited Loading

klosax commented Aug 5, 2023 • edited Loading

JohannesGaessler commented Aug 5, 2023

klosax commented Aug 5, 2023

JohannesGaessler commented Aug 5, 2023

philpax commented Aug 5, 2023

klosax commented Aug 5, 2023

LostRuins commented Aug 6, 2023

cebtenzzre Aug 30, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented Aug 31, 2023

ggerganov commented Jul 28, 2023 •

edited

Loading

ggerganov commented Jul 28, 2023 •

edited

Loading

JohannesGaessler commented Jul 28, 2023 •

edited

Loading

klosax commented Aug 5, 2023 •

edited

Loading