Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOTA 2-bit quants #4773

Merged
merged 17 commits into from
Jan 8, 2024
Merged

SOTA 2-bit quants #4773

merged 17 commits into from
Jan 8, 2024

Conversation

ikawrakow
Copy link
Contributor

TL;DR

This PR adds new "true" 2-bit quantization (but due to being implemented within the block-wise quantization approach of ggml/llama.cpp we end up using 2.0625 bpw, see below for more details). The approach achieves perplexities similar to QuIP# (see this discussion). Inference performance is reasonable, but not great. E.g., for a 7B LLaMA model for TG-128 I measure 155 t/s on an RTX-4080 using CUDA, 54 t/s on a 30-core M2 Max using Metal, and 24.3 t/s on a Ryzen 5975WX CPU. In comparison, for Q4_0, I have 130 t/s (CUDA), 63.5 t/s (Metal), and 15.4 t/s (Ryzen).

Caveats

  • Quantization is not provided (but see heavily commented quantization source code below). For now I'll publish quantized models on HF. Adding the quantization functions is a fairly big change, so this may be added later.
  • Not all back-ends are provided by this PR. I have implemented the necessary functions/kernels for CPU without SIMD, AVX2, ARM_NEON, Metal, and CUDA (but only tested with cuBLAS on RTX-4080).
  • Quantized matrix multiplications on CUDA are not (yet) implemented, and neither is the LLAMA_CUDA_FORCE_DMMV option (is this still being used?)

Perplexities

The table gives some sample results for quantized model sizes and perplexities. Note that a context length of 4096 has been used for these results (and not 512, as typically done when reporting PPL in this repo). Model sizes are in GiB, not GB, as this is more relevant to see if a model will load into a device with limited RAM/VRAM.

Model File size (GiB) PPL
Mistral-7B 1.855 6.446
LLaMA-v2-7B 1.728 7.067
LLaMA-v2-13B 3.295 5.728
LLaMA-v2-70B 17.03 4.079

Some details

I have borrowed 2 ideas from QuIP#:

  • They force an even number of positive (or negative) quant signs in a group of 8 quants. This allows to use 7 bits for 8 quants to record their sign. This looked strange to me at first sight (as one will obviously always have cases where the number of positive/negative weights in a group of 8 is odd), but after some more thought, this is actually quite brilliant as one can always flip the sign of the most "unimportant" quant in a group of 8 if necessary to ensure even number of positive/negative signs. As usual, the devil is in the detail. The QuIP# authors do not specify how they choose the quant to sign-flip. After some experimentation, I have settled on using the quant with minimum w * x^2, where x is the model weight and w is the importance of this weight derived from embedding statistics obtained in a calibration run (the "importance matrix").
  • They use the E8 lattice (see https://en.wikipedia.org/wiki/E8_lattice) to encode the magnitude of the quants in a group of 8. For 3 allowed quant values (1/2, 3/2, 5/2) there are 3^8 = 6561 grid points. After taking into account the E8 requirement that the sum of the grid point coordinates is even, one is left with 3281 possible choices. For a 2-bit quantization one has at most 512 slots available (8 quants x 2 bits = 16 bits minus 7 bits spent on encoding signs, so 9 bits left), so one needs to select a subset of the possible E8 lattice points. The QuIP# authors use all points within an 8D-sphere with radius sqrt(10) (227 grid points) plus another 29 hand-picked points within radius sort(12), for a total of 256 grid points. They then use the remaining one bit to record a +/- 1/4 shift for the group of 8. Instead, I have used the following approach to pick the 256 grid points, which gives a completely different set of points from theirs (they call it a "codebook"):
    • Perform quantization using all possible E8 lattice points on a bunch of models
    • Count how many times each of the 3281 points occurred in the quantized models
    • Select the 256 points such that a) The total count of selected points is maximized and b) The maximum distance of not selected points to the closest selected point is minimized

I have not used any of the "fancy" QuiP# stuff (incoherent processing, Hadamard transforms, and such). Instead, I utilize the battle-tested block scaling that is used for all other ggml quants. Recall that we have spent 7 bits to record the signs and 8 bits to record the grid point index ("codebook") for a group of 8 quants, so we have one spare bit, which gives 4 spare bits in a block of 32. I use this for a 4-bit scale, ending up with exactly 64 bits per block of 32, so a "true" 2-bit quantization. I have verified in my private development repo that using only one additional floating point scale per tensor row is sufficient, this adds less than 0.01 bpw. But the change required in ggml for a row-wise quantization is quite significant, so for now I'm adding this new quantization type within the existing ggml block-wise framework. As with k-quants, there are super-blocks of 256 weights, and each such super-block has one fp16 super-block scale, so this adds 16/256 = 0.0625 bits per weight to what is achievable in a row-wise quantization.

For reference I'm also adding the quantization function from my private development repo that is being used to generate quantized models for this quantization type:

Quantization source code `

// * x points to the model weights to be quantized (input)
// * vy points to the buffer where the quantized weights will be stored (output)
// * n is the number of weights to be quantized
// * quant_weights is the importance matrix. In my case it is very simple, it contains just diagonal elements
//. stored consecutively
static void quantize_row_iq2_xxs_impl(const float * restrict x, void * restrict vy, int n, const float * restrict quant_weights) {

// kgrid_q2xs, kmap_q2xs, kneighbors_q2xs need to be initialized before the first call to this function
// Some are quite big (especially kneighbors_q2xs), so it does not make sense to have a separate
// copy for each thread in a multi-threaded quantization run.
//  * kgrid_q2xs contains the 256 selected E8-lattice points (so 256 x 8 uint8_t's)
//. * kmap_q2xs maps a 16-bit index for 8 quants, 2 bits per quant, to a corresponding grid point, or -1 if these 16
//    bits do not represent a point on the grid
// * kneighbors_q2xs contains the set of closest neighbors that are in the set of selected points for each of the 3281
//.   possible E8-lattice points 
GGML_ASSERT(kgrid_q2xs);   
GGML_ASSERT(kmap_q2xs);
GGML_ASSERT(kneighbors_q2xs);
GGML_ASSERT(n%QK_K == 0);

const int kMaxQ = 3;

const int nbl = n/256;  // number of super-blocks

block_iq2_xxs * y = vy;

float scales[QK_K/32];  // stores the  scales of 32-weight blocks in a super-block
float weight[32];            // stores importances
float xval[32];                //  model values after sign flips in a block of 32
int8_t L[32];                  //  quants in a block of 32
int8_t Laux[32];            //  tmp quants
float  waux[32];            //  auxiliary weight
bool   is_on_grid[4];         // flag for each of the 4 groups of 8 in a block of 32 indicating if the point was on the grid
bool   is_on_grid_aux[4]; // same as above for tmp usage
uint8_t block_signs[4];    //  sign bits for each group of 8 in a block of 32
uint32_t q2[2*(QK_K/32)]; // we will record the quantized data here before copying into the block_iq2_xss struct

for (int ibl = 0; ibl < nbl; ++ibl) { // for each super-block

    y[ibl].d = GGML_FP32_TO_FP16(0.f);
    memset(q2, 0, QK_K/4);

    float max_scale = 0;

    const float * xbl = x + QK_K*ibl;  // the model weights for this super block
    float sumx2 = 0;
    for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
    float sigma2 = sumx2/QK_K;  // the weight variance in this super-block

    for (int ib = 0; ib < QK_K/32; ++ib) { // for each block of 32 in this super-block 
        const float * xb = xbl + 32*ib;       // model weights for this block
        if (quant_weights) {
            // it is helpful to augment the importance matrix in the following way  
            const float * qw = quant_weights + QK_K*ibl + 32*ib;
            for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
        } else {
            // OK, that's a version without an importance matrix, but don't even think using it.
            // For 2-bit quantization you will get total garbage without the importance matrix
            for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
        }
        // We use this augmented weight when searching for closest neighbors of points not on the grid
        for (int i = 0; i < 32; ++i) waux[i] = sqrtf(weight[i]);
        // Sign flips
        for (int k = 0; k < 4; ++k) {
            int nflip = 0;
            uint8_t s = 0;
            for (int i = 0; i < 8; ++i) {
                if (xb[8*k + i] >= 0) xval[8*k + i] = xb[8*k + i];
                else {
                    xval[8*k + i] = -xb[8*k + i]; ++nflip; s |= (1 << i);
                }
            }
            if (nflip%2) {
                // We have an odd number of negative weights. Need to flip one sign, so lets find the least important quant
                int imin = 0; float min = weight[8*k+imin]*xb[8*k+imin]*xb[8*k+imin];
                for (int i = 1; i < 8; ++i) {
                    float ax = weight[8*k+i]*xb[8*k+i]*xb[8*k+i];
                    if (ax < min) {
                        min = ax; imin = i;
                    }
                }
                // flip the sign
                xval[8*k+imin] = -xval[8*k+imin];
                s ^= (1 << imin);
            }
            block_signs[k] = s & 127;
        }
        // after the above loop, we have decided on sign flips and xval contains the "signless" model weights
        // for the block against which we will be comparing from here on.
        // Find max value in the block
        float max = xval[0];
        for (int i = 1; i < 32; ++i) max = MAX(max, xval[i]);
        if (!max) {
            // all zeros - nothing to do
            scales[ib] = 0;
            memset(L, 0, 32);
            continue;
        }
        // Now lets try to find the best scale for this block
        // We try a bunch of scales around max/5. Why 5 ? Because our possible quants are 1, 3, 5 
        float best = 0;
        float scale = max/(2*kMaxQ-1);
        for (int is = -9; is <= 9; ++is) {
            float id = (2*kMaxQ-1+is*0.1f)/max; // inverse scale
            float this_scale = 1/id;
            for (int k = 0; k < 4; ++k) { // for each group of 8
                // get the quants via RTN
                for (int i = 0; i < 8; ++i) {
                    int l = nearest_int(0.5f*(id*xval[8*k+i]-1));
                    Laux[8*k+i] = MAX(0, MIN(kMaxQ-1, l));
                }
                // convert the quants to a 16-bit integer and lookup the corresponding grid point 
                uint16_t u = 0;
                for (int i = 0; i < 8; ++i) u |= (Laux[8*k+i] << 2*i);
                int grid_index = kmap_q2xs[u];
                is_on_grid_aux[k] = true;
                if (grid_index < 0) {
                    // Not on the grid of selected points. Need to find a neigbouring grid point  
                    is_on_grid_aux[k] = false;
                    // the set of closest neigbhbors we will use.
                    // We could check all points on the grid, but that would be much too slow
                    // in a loop over 18 scale choices
                    const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
                    int num_neighbors = neighbours[0];
                    GGML_ASSERT(num_neighbors > 0);
                    // find the "closest" neighbor
                    float best_d2 = FLT_MAX;
                    for (int j = 1; j <= num_neighbors; ++j) {
                        const int8_t * pg = (const int8_t *)(kgrid_q2xs + neighbours[j]);
                        float d2 = 0;
                        for (int i = 0; i < 8; ++i) {
                            float q = pg[i];
                            float diff = this_scale*q - xval[8*k + i];
                            d2 += waux[8*k+i]*diff*diff;
                        }
                        if (d2 < best_d2) {
                            best_d2 = d2; grid_index = neighbours[j];
                        }
                    }
                    const int8_t * pg = (const int8_t *)(kgrid_q2xs + grid_index);
                    // store the closest neighbor quant values into Laux
                    for (int i = 0; i < 8; ++i) Laux[8*k+i] = (pg[i] - 1)/2;
                }
            }
            // OK, now we have in Laux quants that are all one of the 256 selected points from the E8 lattice
            // time to find the best scale via weighted RMSE minimization   
            float sumqx = 0, sumq2 = 0;
            for (int i = 0; i < 32; ++i) {
                float w = weight[i];
                float q = 2*Laux[i] + 1;
                sumqx += w*xval[i]*q;
                sumq2 += w*q*q;
            }
            if (sumq2 > 0 && sumqx*sumqx > best*sumq2) {
                // this set of quants is better than what we had so far => store the quants/scale
                scale = sumqx/sumq2; best = scale*sumqx;
                for (int i = 0; i < 32; ++i) L[i] = Laux[i];
                for (int k = 0; k <  4; ++k) is_on_grid[k] = is_on_grid_aux[k];
            }
        }
        int n_not_ongrid = 0;
        for (int k = 0; k < 4; ++k) if (!is_on_grid[k]) ++n_not_ongrid;
        if (n_not_ongrid > 0 && scale > 0) {
            // some of the points we selected weren't on the grid and we are using their closest grid neighbor
            // it is a good idea to re-quantize those with our final block scale 
            float id = 1/scale;
            for (int k = 0; k < 4; ++k) {
                if (is_on_grid[k]) continue;
                uint16_t u = 0;
                for (int i = 0; i < 8; ++i) {
                    int l = nearest_int(0.5f*(id*xval[8*k+i]-1));
                    l = MAX(0, MIN(kMaxQ-1, l));
                    u |= (l << 2*i);
                }
                int grid_index = kmap_q2xs[u];
                if (grid_index < 0) {
                    // still not on the grid, so find the closest neighbor (same as above)
                    const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
                    int num_neighbors = neighbours[0];
                    GGML_ASSERT(num_neighbors > 0);
                    float best_d2 = FLT_MAX;
                    for (int j = 1; j <= num_neighbors; ++j) {
                        const int8_t * pg = (const int8_t *)(kgrid_q2xs + neighbours[j]);
                        float d2 = 0;
                        for (int i = 0; i < 8; ++i) {
                            float q = pg[i];
                            float diff = scale*q - xval[8*k + i];
                            d2 += waux[8*k+i]*diff*diff;
                        }
                        if (d2 < best_d2) {
                            best_d2 = d2; grid_index = neighbours[j];
                        }
                    }
                }
                const int8_t * pg = (const int8_t *)(kgrid_q2xs + grid_index);
                // update this group of 8 quants
                for (int i = 0; i < 8; ++i) L[8*k+i] = (pg[i] - 1)/2;
            }
            // determine the best scale again via weighted RMSE minimization
            float sumqx = 0, sumq2 = 0;
            for (int i = 0; i < 32; ++i) {
                float w = weight[i];
                float q = 2*L[i] + 1;
                sumqx += w*xval[i]*q;
                sumq2 += w*q*q;
            }
            if (sumq2 > 0) scale = sumqx/sumq2;
        }
        // This should not actually happen, but just in case: flip the scale (so it is positive) and
        // correspondingly flip the quant signs in the block
        if (scale < 0) {
            scale = -scale;
            for (int k = 0; k < 4; ++k) block_signs[k] = (~block_signs[k]) & 127;
        }
        // encode the quants
        for (int k = 0; k < 4; ++k) {
            uint16_t u = 0;
            for (int i = 0; i < 8; ++i) u |= (L[8*k+i] << 2*i);
            int grid_index = kmap_q2xs[u];
            if (grid_index < 0) {
                printf("Oops: found point %u not on grid:", u);
                for (int i = 0; i < 8; ++i) printf(" %d", L[8*k+i]);
                printf("\n");
                GGML_ASSERT(false);
            }
            q2[2*ib+0] |= (grid_index << 8*k);
            q2[2*ib+1] |= (block_signs[k] << 7*k);
        }
        GGML_ASSERT(scale >= 0);
        scales[ib] = scale;
        max_scale = MAX(max_scale, scale);
    }

    if (!max_scale) {
        // all weights in the super-block were zero, so nothing to do
        memset(y[ibl].qs, 0, QK_K/4);
        continue;
    }

    // super-block scale
    float d = max_scale/31;
    y[ibl].d = GGML_FP32_TO_FP16(d);
    float id = 1/d;
    // Now quantize the block scales to 4 bits
    float sumqx = 0, sumq2 = 0;
    for (int ib = 0; ib < QK_K/32; ++ib) {
        int l = nearest_int(0.5f*(id*scales[ib]-1));
        l = MAX(0, MIN(15, l));
        // add the quantized scale to the encoded quant data
        q2[2*ib+1] |= ((uint32_t)l << 28);
        // In principle we are done here. We get a minor improvement by re-quantizing after
        // scaling with the quantized scales and re-optimizing the super-block scale
        // To do so, we need again the block importances
        const float * xb = xbl + 32*ib;
        if (quant_weights) {
            const float * qw = quant_weights + QK_K*ibl + 32*ib;
            for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
        } else {
            for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
        }
        const uint8_t * aux8 = (const uint8_t *)(q2 + 2*ib);
        // The actual scale for this block
        const float db = d * (1 + 2*l);
        uint32_t u = 0;
        for (int k = 0; k < 4; ++k) {
            const int8_t * signs = keven_signs_q2xs + 8*((q2[2*ib+1] >> 7*k) & 127);
            const float * xk = xb + 8*k;
            const float * wk = weight + 8*k;
            const uint8_t * grid = (const uint8_t *)(kgrid_q2xs + aux8[k]);
            float best_mse = 0; int best_index = aux8[k];
            for (int j = 0; j < 8; ++j) {
                float diff = db * grid[j] * signs[j] - xk[j];
                best_mse += wk[j] * diff * diff;
            }
            // As we are doing this just once, we can afford to check all 256 points
            for (int idx = 0; idx < 256; ++idx) {
                grid = (const uint8_t *)(kgrid_q2xs + idx);
                float mse = 0;
                for (int j = 0; j < 8; ++j) {
                    float diff = db * grid[j] * signs[j] - xk[j];
                    mse += wk[j] * diff * diff;
                }
                if (mse < best_mse) {
                    best_mse = mse; best_index = idx;
                }
            }
            u |= (best_index << 8*k);
            grid = (const uint8_t *)(kgrid_q2xs + aux8[k]);
            for (int j = 0; j < 8; ++j) {
                float q = db * grid[j] * signs[j];
                sumqx += wk[j] * q * xk[j];
                sumq2 += wk[j] * q * q;
            }
        }
        q2[2*ib] = u;
        if (sumq2 > 0) y[ibl].d = GGML_FP32_TO_FP16(d*sumqx/sumq2);
    }
    memcpy(y[ibl].qs, q2, QK_K/4);
}

}
`

@ggerganov ggerganov added the high priority Very important issue label Jan 4, 2024
@slaren
Copy link
Collaborator

slaren commented Jan 4, 2024

Very nice! A 70b quant that fits in 24gb is awesome.

You can use test-backend-ops to compare the GPU implementations with the CPU implementation, by adding the new quant type to all_types:

const ggml_type all_types[] = {
GGML_TYPE_F32, GGML_TYPE_F16,
GGML_TYPE_Q4_0, GGML_TYPE_Q4_1,
GGML_TYPE_Q5_0, GGML_TYPE_Q5_1,
GGML_TYPE_Q8_0,
GGML_TYPE_Q2_K, GGML_TYPE_Q3_K,
GGML_TYPE_Q4_K, GGML_TYPE_Q5_K,
GGML_TYPE_Q6_K
};

@ikawrakow
Copy link
Contributor Author

Thank you @slaren for pointing out the automation in testing. I cannot quite use it yet because of the missing quantization capability (unless the automation is so clever as to pull quantized models from somewhere else). The automated testing is also failing because there is no quantization function provided.

@JohannesGaessler
Copy link
Collaborator

Quantized matrix multiplications on CUDA are not (yet) implemented, and neither is the LLAMA_CUDA_FORCE_DMMV option (is this still being used?)

The dequantize_mul_mat_vec kernel is almost always not being used. On all devices with compute capability 6.1 or higher mul_mat_vec_q is used. But if you were to use e.g. a P100 or a Maxwell card it would be used. I think it would be fine if those cards are simply not supported though.

Regarding mul_mat_q: do you intend to do an implementation? In my experience MMQ is more difficult to work with than MMVQ so I would be willing to do the implementation instead (or just provide help).

But the change required in ggml for a row-wise quantization is quite significant, so for now I'm adding this new quantization type within the existing ggml block-wise framework.

I'm still prototyping but per-row/per-column scales may also be necessary for an efficient int8 tensor core implementation. The problem with the current block formats is that every time the scale changes you need to unload the tensor core accumulator which is very slow. In my case the necessary changes would be minor and isolated to ggml-cuda.cu though.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, would be amazing to have if NVIDIA releases another RTX ??90 with 24 GB VRAM.

Can the multiple instances of the hardcoded values be deduplicated? Perhaps by defining macros like IQ2XXS_GRID_VALUES?

Also consider #4755 if you haven't yet seen it. There seem to be issues with numerical precision when quantizing the hidden state for short context sizes. These issues will not show up in perplexity calculations because they exclude the first few tokens of a chunk. Though this does not necessarily mean that per-row scales for the weights are also problematic.

@@ -1292,6 +1300,128 @@ static __global__ void dequantize_block_q6_K(const void * __restrict__ vx, dst_t
#endif
}

static const __device__ uint64_t kgrid_iq2xxs[256] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the same effect as __constant__? In other words, does it actually put these values into constant memory? (Should be faster than if it is in global memory.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried. Replacing

static const __device__ uint8_t ksigns_iq2xs[128]

with

static __constant__ __device__ uint8_t ksigns_iq2xs[128]

makes it massively slower (108 t/s vs 155 t/s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the multiple instances of the hardcoded values be deduplicated? Perhaps by defining macros like IQ2XXS_GRID_VALUES?

I thought about this, but did not do it (yet). Mainly because:

  • The different files require different qualifiers (__device__ on CUDA, constexpr on Metal, etc.), so one either needs to work with pre-processor trickery, or needs to define the actual content as a macro. I did not like both options too much
  • This would add an extra file, and my concept is that @ggerganov likes having as few files as possible.

But yes, absolutely, this is something one should consider.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The compiler is probably not copying the data from constant memory to registers. So for frequently used data it's slower as long as there is no register spilling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider #4755 if you haven't yet seen it. There seem to be issues with numerical precision when quantizing the hidden state for short context sizes.

Yes, I know about this. It is context and model dependent. For some models (e.g. Falcon-7B) the difference between quantizing and not quantizing the hidden state can be quite dramatic. This is why the dequantize_mul_mat_vec kernels are actually useful, and I'm somewhat surprised they have fallen out of favor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding mul_mat_q: do you intend to do an implementation? In my experience MMQ is more difficult to work with than MMVQ so I would be willing to do the implementation instead (or just provide help).

So, the MMQ kernels are mostly useful for contexts in the few - few tens of tokens range. I know this is an important use case for stuff such as speculative sampling, but in my private repo I have an MMQ implementation based on plain vector dot products that outperforms MMQ for, say, up to 16 tokens. If one could extend this up, and/or extend the dequantize/cuBLAS performance superiority down to fewer tokens, the MMQ kernels become unnecessary. This is the main reason I'm kind of reluctant with those, especially considering the amount of code and compilation time increase each new MMQ kernel adds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the MMQ kernels were intended for large matrices. But it later turned out that the FP32 cuBLAS GEMM does not actually use tensor cores and that FP16 cuBLAS GEMM is still faster for Volta or newer. Georgi then repurposed the MMQ kernels for small batch sizes by changing the tile sizes. They were never intended or optimized by me for this use case and in my testing they still perform worse than FP16 cuBLAS even for small batch sizes:

cublas_vs_mmq_q8_0

However, because you do not need to dequantize the weight matrix MMQ should still be more efficient in terms of VRAM. Also on Pascal/RDNA2 or older there are no tensor cores so it is also faster than cuBLAS GEMM by a factor of ~2.

in my private repo I have an MMQ implementation based on plain vector dot products that outperforms MMQ for, say, up to 16 tokens.

I was thinking that you could extend MMVQ to allow for >1 y columns and probably get better performance than with MMQ/cuBLAS GEMM. Presumably this is very similar to what you have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to follow up on your __constant__ comment, I did copy the data into shared memory on Metal. This boosted TG from about 48 t/s to ~54 t/s, so 12.5% speedup. There it was easy to implement without changing any other kernel. I was thinking that one could gain some performance on CUDA too by copying the grid/sign data to shared memory, but I didn't see how to do it without changing the MMVQ template and touching every single dot product kernel, so left it as is for now.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the shared memory on Metal works but with CUDA there are by default 32 shared memory banks with a size of 4 byte each. The values here are uint64 = 8 byte so unless you can ensure that the threads in a warp access different memory banks ((pointer % (32*4 bytes))/4 bytes needs to be different) I would expect you to get a lot of memory bank conflicts which drastically reduces the memory bandwidth.

From what I can tell you have not yet published any models using the new format so I cannot test this myself but with NVIDIA NSight Compute under the occupancy section you can see the number of registers needed per thread. If that number is not limiting occupancy I would not expect much of a performance gain from moving the values to shared memory (but doing that may still reduce cache evictions of other data so I could be wrong).

Comment on lines +3976 to +4000
#else
// iqs is 0...15
const int ib32 = iqs/2;
const int il = iqs%2;
const uint16_t * q2 = bq2->qs + 4*ib32;
const uint8_t * aux8 = (const uint8_t *)q2;
const uint8_t * grid1 = (const uint8_t *)(kgrid_iq2xxs + aux8[2*il+0]);
const uint8_t * grid2 = (const uint8_t *)(kgrid_iq2xxs + aux8[2*il+1]);
const uint32_t aux32 = q2[2] | (q2[3] << 16);
const float d = (float)bq2->d * (0.5f + (aux32 >> 28)) * (float)bq8_1[ib32].ds.x * 0.25f;
const uint8_t signs1 = ksigns_iq2xs[(aux32 >> 14*il) & 127];
const uint8_t signs2 = ksigns_iq2xs[(aux32 >> (14*il + 7)) & 127];
const int8_t * q8 = bq8_1[ib32].qs + 16*il;
int sumi1 = 0, sumi2 = 0;
for (int j = 0; j < 8; ++j) {
sumi1 += q8[j+0] * grid1[j] * (signs1 & kmask_iq2xs[j] ? -1 : 1);
sumi2 += q8[j+8] * grid2[j] * (signs2 & kmask_iq2xs[j] ? -1 : 1);
}
return d * (sumi1 + sumi2);
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is for testing different versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The one currently active is slightly faster on my RTX-4080, but I left the other version behind (which was the initial implementation) just in case. You never know with all these different cards that are being supported.

@Dampfinchen
Copy link

On a side note, are Mixtral GGUFs "finalized" now to the point a breaking change is not to be expected? I remember that warning Slaren wrote in the first PR.

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jan 5, 2024

I have posted some IQ2_XXS quantized models, including Mixtral-8x7B, on Huggingface. See https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

@Dampfinchen
Copy link

Dampfinchen commented Jan 5, 2024

I have posted some IQ2_XXS quantized models, including Mixtral-8x7B, on Huggingface. See https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Wow, only 12 GB for Mixtral? That is super impressive. Thank you for your amazing work. Do you plan to update the other quants as well?

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jan 5, 2024

Do you plan to update the other quants as well?

I started playing with Mixtral this morning, so yes, I will post quantizations for the other ggml quants in the next days. I just finished a perplexity calculation for Q4_K_S, and I'm getting PPL = 4.1764 for context of 512. According to PR #4739, Q4_K_S perplexity is 4.5136 on current master. Is this possible? I have not yet seen such a massive difference between official k-quants and my activation aware quantization.

Update: I see that the perplexity of 4.5136 quoted in PR #4739 is simply wrong. I get PPL = 4.2523 for Q4_K_S with current llama.cpp master.

@JianbangZ
Copy link

Do you plan to update the other quants as well?

I started playing with Mixtral this morning, so yes, I will post quantizations for the other ggml quants in the next days. I just finished a perplexity calculation for Q4_K_S, and I'm getting PPL = 4.1764 for context of 512. According to PR #4739, Q4_K_S perplexity is 4.5136 on current master. Is this possible? I have not yet seen such a massive difference between official k-quants and my activation aware quantization.

I think you are onto something. I will take look at your repo and do some PPL and KL divergence calcualtions today

I have posted some IQ2_XXS quantized models, including Mixtral-8x7B, on Huggingface. See https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Can you talk more about how you quantize mixtral? follow the same principles as the others that leave gating part to fp16 or int8? are the code to quantize mixtral pushed to your repo? I will run some extensive test over the weekend for PPL and KL divergence

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jan 5, 2024

Can you talk more about how you quantize mixtral? follow the same principles as the others that leave gating part to fp16 or int8?

Yes, the ffn_gate_inp.weight tensor is left as fp16 just like in mainline llama.cpp. Else the selection of quants is quantization type dependent. For Q4_K_S I have restored using Q5_K for ffn_down for all experts in the first 4 layers when quantizing to Q4_K_S (this has been lost in mainline llama.cpp), plus I use Q5_K for all attn_v tensors (these are just 4096 x 1024, so using 1 extra bit leads to negligible increase in size). I see someone has added a change to mainline llama.cpp to use Q8_0 for attn_k. I don't use this as in my experience quantization errors in the K and Q tensors have the least impact on quantization quality after token embeddings. For the "true" 2-bit model added via this PR (IQ2_XXS) I use Q2_K (so, 2.5625 bpw) for all attn_v, for token embeddings, and for all expert ffn_down tensors in the first 4 layers, everything else is IQ2_XXS.

@sakura-umi
Copy link
Contributor

quant_weights is the importance matrix. In my case it is very simple, it contains just diagonal elements

can you explain more about how to get the importance matrix with a certain model (such as qwen)?

also see that the quantize_row_iq2_xxs_reference function not implement yet, how can I use the quantize_row_iq2_xxs_impl code above to make examples/quantize work?

@sorasoras
Copy link

server.exe don't seems to support SOTA2bit Quants yet.
No matter what I input. it would keep outputing
Llama: #���#�!������������������������!��$�$!����$#$������$������� #������� ����!�!�����������������$���#�����#$�$## �!����� ##��$ ��$#����� ����!�� #����!����$��#��� ������!

@Dampfinchen
Copy link

Dampfinchen commented Jan 6, 2024

Can you talk more about how you quantize mixtral? follow the same principles as the others that leave gating part to fp16 or int8?

Yes, the ffn_gate_inp.weight tensor is left as fp16 just like in mainline llama.cpp. Else the selection of quants is quantization type dependent. For Q4_K_S I have restored using Q5_K for ffn_down for all experts in the first 4 layers when quantizing to Q4_K_S (this has been lost in mainline llama.cpp), plus I use Q5_K for all attn_v tensors (these are just 4096 x 1024, so using 1 extra bit leads to negligible increase in size). I see someone has added a change to mainline llama.cpp to use Q8_0 for attn_k. I don't use this as in my experience quantization errors in the K and Q tensors have the least impact on quantization quality after token embeddings. For the "true" 2-bit model added via this PR (IQ2_XXS) I use Q2_K (so, 2.5625 bpw) for all attn_v, for token embeddings, and for all expert ffn_down tensors in the first 4 layers, everything else is IQ2_XXS.

On anotther note, @TheBloke found out that while quantising Mixtral, Q4K_S and Q4K_M have the exact same size currently, so he decided to just upload the Q4K_M quants for now. Do you have any idea why that is and do you think your new quants could improve that?

@ggerganov
Copy link
Owner

@Dampfinchen The Q4_K_S and Q4_K_M refer to the quantization mixtures used to quantize a model. A mixture can utilize various quantization types (see the log when loading the model), but the name of the mixture usually indicates the most prominent quantization type used - Q4_K in this case. The current quantization mixtures for Mixtral models are a first iteration (it was stated in the original PR #4406). As a first quick iteration, the current Q4_K_S and Q4_K_M quantization mixtures use the same quantization types as stated in #4406:

image

@ikawrakow

I see someone has added a change to mainline llama.cpp to use Q8_0 for attn_k. I don't use this as in my experience quantization errors in the K and Q tensors have the least impact on quantization quality after token embeddings.

The only consideration to bump both attn_k and attn_v to Q8_0 was that due to the GQA, these tensors are relatively small (4x smaller than attn_q) so using Q8_0 results in just extra ~200-300 MB added to the model, so I thought it wouldn't hurt. Haven't tested at all how this reflects to PPL - likely not much as you have observed. So it's something that can probably be improved - there are comments to remind us:

llama.cpp/llama.cpp

Lines 8917 to 8928 in c75ca5d

if (qs.model.hparams.n_expert == 8) {
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
// TODO: explore better strategies
new_type = GGML_TYPE_Q8_0;
}
++qs.i_attention_wv;
} else if (name.find("attn_k.weight") != std::string::npos) {
if (qs.model.hparams.n_expert == 8) {
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
// TODO: explore better strategies
new_type = GGML_TYPE_Q8_0;
}

@ikawrakow
Copy link
Contributor Author

quant_weights is the importance matrix. In my case it is very simple, it contains just diagonal elements

can you explain more about how to get the importance matrix with a certain model (such as qwen)?

This will become available soon. I will either make my repo public, or I will make a PR to mainline llama.cpp to add the capability to compute the importance matrix used for these quantizations.

also see that the quantize_row_iq2_xxs_reference function not implement yet, how can I use the quantize_row_iq2_xxs_impl code above to make examples/quantize work?

You cannot use the existing implementation in the quantize example and llama.cpp. The importance matrix needs to be loaded, passed along to llama_model_quantize_internal, and from there propagated to the actual quantization functions. This requires a changes to the ggml quantization interface (ggml_quantize_chunk) and implementation. Once you make this change, you don't need to call quantize_row_iq2_xxs_reference but can call any function of your choice.

@he29-net
Copy link

he29-net commented Jan 8, 2024

On a RX 6800 with hipBLAS I'm getting:
GGML_ASSERT: ggml-cuda.cu:7556: false

I assumed it's related to the not yet implemented MMQ (mul_mat_q) so I tried to add -nommq to avoid using it, but it makes no difference, so maybe I'm misunderstanding what the parameter does. Just wanted to mention it in case it's not expected. CPU-only inference on an old CPU (no AVX2) worked fine.

@ikawrakow
Copy link
Contributor Author

On a RX 6800 with hipBLAS I'm getting: GGML_ASSERT: ggml-cuda.cu:7556: false

I assumed it's related to the not yet implemented MMQ (mul_mat_q) so I tried to add -nommq to avoid using it, but it makes no difference, so maybe I'm misunderstanding what the parameter does. Just wanted to mention it in case it's not expected. CPU-only inference on an old CPU (no AVX2) worked fine.

The -nommq options has been lost. To work around that I have introduced ggml_supports_mmq, which returns true for all existing quants and false for the new IQ2_XXS. The problem was that the call to ggml_supports_mmq was in the wrong pre-processor nesting level, so did not work for hipBLAS. I have pushed a fix, can you you try now?

@he29-net
Copy link

he29-net commented Jan 8, 2024

Just tested it and it works perfectly, thanks. 👍

Mixtral-8x7b-2.10bpw now fully fits into the 16 GB VRAM with 4096 token context and gets about 29 tokens/s. Just a month ago it didn't even cross my mind this could be possible. :)

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
Somehow strangely slow (112 ms/token).
Dequantize works, something is still wrong with the
dot product.
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.
TG-128 is now 48.4 t/s
TG-128 is now 50.9 t/s
TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
@jxy
Copy link
Contributor

jxy commented Jan 9, 2024

looks fine to me

./main -m models/mixtral-instruct-8x7b-2.10bpw.gguf --temp 0 --repeat-penalty 1.0 --no-penalize-nl -p '[INST] what is 12*8+7 [/INST]' --log-disable
 [INST] what is 12*8+7 [/INST] To calculate the expression 12*8+7, you need to follow the order of operations, which is often remembered by the acronym PEMDAS: Parentheses, Exponents, Multiplication and Division (from left to right), Addition and Subtraction (from left to right).

In this case, you should perform multiplication first:

12 * 8 = 96

Then, add 7:

96 + 7 = 103

So, the result of the expression 12*8+7 is 103.

@jxy
Copy link
Contributor

jxy commented Jan 9, 2024

@x4080 I thought you meant mixtral. But the mistral 7b from @ikawrakow's HF repo is not instruct tuned. So you need to give it a few shot prompt. This works.

./main -m models/mistral-7b-2.20bpw.gguf --temp 0 --repeat-penalty 1.0 --no-penalize-nl -p 'Q: What is 2*4?
A: 8

Q: What is 3+6?
A: 9

Q: What is 3*4+7?
A: 3 * 4 = 12
   12 + 7 = 19
   So 3 * 4 + 7 = 19

Q: What is 13*6+5?
A: 13 * 6 = 78
   78 + 5 = 83
   So 13 * 6 + 5 = 83

Q: What is 12*8+7?
A:' --log-disable -r '
Q:'

The output is

 Q: What is 2*4?
A: 8

Q: What is 3+6?
A: 9

Q: What is 3*4+7?
A: 3 * 4 = 12
   12 + 7 = 19
   So 3 * 4 + 7 = 19

Q: What is 13*6+5?
A: 13 * 6 = 78
   78 + 5 = 83
   So 13 * 6 + 5 = 83

Q: What is 12*8+7?
A: 12 * 8 = 96
   96 + 7 = 103
   So 12 * 8 + 7 = 103

Q:

Overall this 2bit quants perform quite good! Thanks @ikawrakow

@Ttl
Copy link
Contributor

Ttl commented Jan 9, 2024

It looks like the output logit norm of llama-v2-7b-2.20bpw is slightly lower than fp16 model's output norm. After subtracting the mean, the output norm of quantized model is 1.035 times smaller. Multiplying that into the output weights and biases or just multiplying the output logits reduces the perplexity on wiki.test from 7.0681 to 7.0080. It's unlikely to matter much in practice as it just applies a very small temperature to output softmax, but just for optimizing numbers it does help a little.

Testing on wiki.test 2.8% of time the top token of llama-v2-7b-2.20bpw was not in top-10 of fp16 model output tokens. For Q2_K it's 0.85% and 0.10% for Q4_K_S.

There are 2100 instances in wiki.test where difference in probability of fp16 model's top-1 token is larger than 90% out of 280k tokens in the text. The same number is 0 for Q4_K_S so there is a definitely something lost in the 2-bit quantization. Looking at the differences most of them are dates, names and other facts that 2-bit model seems to have forgotten.

Some cherry picked examples with 0 temperature:

"American Beauty is a 1999 American drama film directed by"
fp16 model: " Sam Mendes and written by Alan Ball.". 99% probability for "Sam". (This is correct)
2.20 bpw: " Allan F.reviewed by the critics". 5% probability for "All".
"QuackShot , known in Japan as I Love Donald Duck : Georgia Ou no Hihou ( Japanese : アイラブドナルドダック グルジア王の秘"
fp16 model: "宝, ". 99% probability. (This is correct)
2.20 bpw: "増, ". 50% probability.
"Undertaker then returned and chokeslam"
fp16 model: "med Kane through the announce table.". 98% for "med".
2.20 bpw: "'d him, but he was able to escape." 5% for "'".

Last example generates reasonable text on CPU but garbage when run with CUDA: ./main -m llama-v2-7b-2.20bpw.gguf -p "Undertaker then returned and chokeslam" --top-k 1 -ngl 99. I'm not sure if this is a problem with this PR or #4755.

Still very good results for this extreme quantization.

@x4080
Copy link

x4080 commented Jan 9, 2024

@jxy Thanks for the tip. I'm glad that the 2bit works good, how many VRAM do you have to run the mixtral 2bit ?

@x4080
Copy link

x4080 commented Jan 9, 2024

Is the convert to 2bit available anywhere ?

Tried nous-hermes-2-34b-2.69bpw.gguf on my m2 16gb and it wont even start, maybe not enough memory ?

@JianbangZ
Copy link

Tested some mixtral 8x7 models. Generation speed is great, about 60% of single 7B model. But what is up with the prompt evaluation speed? Why it’s 1/7 of the 7B model? Some implementation inefficiency?

I have a RTX 6000 Ada, gen speed is about 80t/s, prompt evaluation speed is 480 t/s without 1k input (2k window). While 7B model is yielding 3300 t/s prompt evaluation speed and 120 t/s gen speed.

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request Jan 9, 2024
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@joseph777111
Copy link

joseph777111 commented Jan 10, 2024

@ikawrakow Will the techniques used to make pure 2 quants be used to improve the larger quantizations as well? 🤔🤩

@ikawrakow
Copy link
Contributor Author

Tested some mixtral 8x7 models. Generation speed is great, about 60% of single 7B model. But what is up with the prompt evaluation speed? Why it’s 1/7 of the 7B model? Some implementation inefficiency?

I have a RTX 6000 Ada, gen speed is about 80t/s, prompt evaluation speed is 480 t/s without 1k input (2k window). While 7B model is yielding 3300 t/s prompt evaluation speed and 120 t/s gen speed.

For prompt processing Mixtral is equivalent to a 46B model rather than a 13B model: each token from the prompt is given to a different set of 2 "experts", so one always ends up with all 8 "experts" being involved, so the model has effectively 46B parameters, so yes, about 1/7 the speed of a 7B model. I was confused about this myself as well due to the misleading use of "Mixture Of Experts". The 8 different feed forward networks are not really "experts", which are turned on and remain on based on the context being processed. Instead, this architecture allows to enlarge the feed forward network, use all of it when evaluating a prompt, but use 1/4 of it when generating tokens one-by-one.

@JianbangZ
Copy link

Tested some mixtral 8x7 models. Generation speed is great, about 60% of single 7B model. But what is up with the prompt evaluation speed? Why it’s 1/7 of the 7B model? Some implementation inefficiency?
I have a RTX 6000 Ada, gen speed is about 80t/s, prompt evaluation speed is 480 t/s without 1k input (2k window). While 7B model is yielding 3300 t/s prompt evaluation speed and 120 t/s gen speed.

For prompt processing Mixtral is equivalent to a 46B model rather than a 13B model: each token from the prompt is given to a different set of 2 "experts", so one always ends up with all 8 "experts" being involved, so the model has effectively 46B parameters, so yes, about 1/7 the speed of a 7B model. I was confused about this myself as well due to the misleading use of "Mixture Of Experts". The 8 different feed forward networks are not really "experts", which are turned on and remain on based on the context being processed. Instead, this architecture allows to enlarge the feed forward network, use all of it when evaluating a prompt, but use 1/4 of it when generating tokens one-by-one.

I get it now. Basically it's because of batch. All the experts were used for the entire batch of tokens (256, 512 etc.)。Might worth doing some experiment on "batched top-2 experts" to fix the top2/3 experts for the entire batch. Welcome other ideas to improve the prompt eval speed.

@x4080
Copy link

x4080 commented Jan 10, 2024

Hi, i just downloaded mistral-instruct-7b-2.43bpw.gguf but I got error when tried to running it, so I run it again with minimal config, but still the same error

./main -m ./models/mistral-instruct-7b-2.43bpw.gguf -p "Hello"

the error

...
llm_load_tensors: system memory used  = 6577.58 MiB
................................................................................................GGML_ASSERT: ggml-backend.c:1274: (char *)addr + ggml_backend_buffer_get_alloc_size(buffer, tensor) <= (char *)ggml_backend_buffer_get_base(buffer) + ggml_backend_buffer_get_size(buffer)
zsh: abort      ./main -m ./models/mistral-instruct-7b-2.43bpw.gguf -p "Hello"

system : m2 16gb

jbochi added a commit to jbochi/gguf-tools that referenced this pull request Jan 21, 2024
These were added in ggerganov/llama.cpp#4773

It's annoying that I8 used to be 16 and it's now 18. I16 and I32 also changed.
@ikawrakow ikawrakow mentioned this pull request Jan 29, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@tsengalb99
Copy link

tsengalb99 commented Feb 13, 2024

@ikawrakow I just saw this - exciting to see some of the stuff from quip# being ported over! A few comments

  • We don't actually "force" a number of sign flips during quantization. The shifted E8 lattice has the property that an even number of sign flips keeps you in it, and an odd number of sign flips moves you out of the lattice. All lattice points can be represented as an either an even number of sign flips from an all positive vector or an even number of sign flips from a vector with only one negative entry. Since the absv codebook is all positive, this means that every entry corresponds to either a even or odd number of flips (but not both) and so we can get away with storing 7/8 sign flips.
  • You may want to try the fine tuning approach I posted about yesterday (more details in our arxiv). It seems to make a pretty big difference for smaller models. For ex, with our perplexity measuring method, 7B 2 bit goes from 8.2 to 6.2 ppl (fp16 is 5.x).

@ikawrakow
Copy link
Contributor Author

@tsengalb99 Thanks for the update. The results of your latest paper are very impressive. Congratulations!

Once fine tuning/training gets involved in the quantization as it is the case in your latest paper or the AQLM paper, we are out of the competition in this repository due to the limited llama.cpp capabilities in that regard. But we did give you a good run for your money, I think (despite being consistently ignored in quantization related publications).

Just out of curiosity, how long does it take to quantize one of these models with QuIP#?

@tsengalb99
Copy link

tsengalb99 commented Feb 13, 2024 via email

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen added a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
readme : update hot topics

common : add `--version` option to show build info in CLI (#4433)

build : detect host compiler and cuda compiler separately (#4414)

sync : ggml (SD ops, tests, kernels) (#4444)

* sync : ggml (SD ops, tests, kernels)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

ggml-ci

* cuda : restore correct im2col

ggml-ci

* metal : try to fix moe test by reducing expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>

server : fix handling of characters that span multiple tokens when streaming (#4446)

readme : update supported model list (#4457)

convert : support loading vocab from fast tokenizer config (#3633)

* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453)

ggml : add ggml_row_size() (fixes llama out of space) (#4461)

* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

py : add protobuf dependency (#4466)

ggml : remove n_dims from ggml_tensor (#4469)

ggml-ci

ggml : use ggml_row_size where possible (#4472)

* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu

ggml : group mul_mat_id rows by matrix (cpu only) (#4480)

* ggml : group mul_mat_id rows by matrix (cpu only)

* remove mmid parameters from mm forward

* store row groups in wdata and calculate only once in GGML_TASK_INIT

ggml-ci

server : add optional API Key Authentication example (#4441)

* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : sanity checks for access to logits (#4274)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

lora : add support for non-llama models (#3333)

* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style

Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506)

server : allow requests larger than 8K (#4500)

server : fix possible ambiguity in content type charset (#4501)

server : fix grammar being ignored (#4494)

Fix bug in identifying the grammar.

server : disable llm logs if SERVER_VERBOSE is off (#3792)

finetune : keep allocs alive until all allocations are done (#4486)

build : Check the ROCm installation location (#4485)

* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace

gguf-py : fail fast on nonsensical special token IDs (#4489)

llama.swiftui : add bench functionality (#4483)

* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>

readme : update hot topics

decode : fix logits_valid for legacy API (#4516)

llama : fix try_override for bool_value which always return true (#4519)

llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)

* phi2 implementation

* fix breaking change

* phi-2 : various fixes

* phi-2 : use layer norm eps

* py : whitespaces

* llama : fix meta KV override bug

* convert : phi don't add BOS token

* convert : revert "added_tokens_decoder" change

* phi-2 : scale Q instead of KQ for better precision

* ggml : fix NeoX rope to rotate just first n_dims

* cuda : less diff in the rope_neox kernel

* ggml : add ggml_mul_mat_set_prec

ggml-ci

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision

* cuda : remove oboslete comment

---------

Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

llama.swiftui : add more models

llama.swiftui : add tinyllama 1.1B F16

ggml-cuda: Fix HIP build (#4528)

regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.

Currently using deprecated hipblasDatatype_t since newer ones very recent.

ggml : fixed check for _MSC_VER (#4535)

Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>

CUDA: Faster Mixtral prompt processing (#4538)

* CUDA: make MoE tensors contiguous for batch size>1

* Update ggml-cuda.cu

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)

llama : disable per-tensor info prints on model load (#4562)

cuda : replace asserts in wrong architecture checks with __trap (#4556)

* cuda : replace asserts in wrong architecture checks with __trap

* make bad_arch noreturn, remove returns

cuda : better error message for ggml_get_rows (#4561)

* Update ggml-cuda.cu

* Update ggml-cuda.cu

* Update ggml-cuda.cu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

py : open merges file as 'utf-8' (#4566)

Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error:

```
Traceback (most recent call last):
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
    model_instance.set_vocab()
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
    self._set_vocab_gpt2()
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
    special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
    self._load(Path(path))
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
    self._try_load_merges_txt(path)
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
    for line in fp:
  File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```

readme : update coding guidelines

CUDA: mul_mat_id always on GPU for batches >= 32 (#4553)

common : remove incorrect --model-draft default (#4568)

ggml-cuda: Fix HIP build by adding define for __trap (#4569)

Regression of 139882392258671ffe5acdfcadc0bc08572d6eef
HIP doesn't have trap, only abort

cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)

* AMD ROCm: handle UMA memory VRAM expansions

This resolves #2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.

Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.

Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.

* clarify build process for ROCm on linux with cmake

* avoid using deprecated ROCm hipMallocHost

* keep simplifying the change required for UMA

* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON

metal : fix `ggml_metal_log` vargs (#4373)

llama : allow getting n_batch from llama_context in c api (#4540)

* allowed getting n_batch from llama_context in c api

* changed to use `uint32_t` instead of `int`

* changed to use `uint32_t` instead of `int` in `llama_n_ctx`

* Update llama.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : initial ggml-backend integration (#4520)

* llama : initial ggml-backend integration

* add ggml-metal

* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set

* add ggml_backend_buffer_clear
zero-init KV cache buffer

* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data

* disable gpu backends with ngl 0

* more accurate mlock

* unmap offloaded part of the model

* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap

* update quantize and lora

* update session copy/set to use ggml-backend

ggml-ci

* use posix_fadvise instead of posix_fadvise64

* ggml_backend_alloc_ctx_tensors_from_buft : remove old print

* llama_mmap::align_offset : use pointers instead of references for out parameters

* restore progress_callback behavior

* move final progress_callback call to load_all_data

* cuda : fix fprintf format string (minor)

* do not offload scales

* llama_mmap : avoid unmapping the same fragments again in the destructor

* remove unnecessary unmap

* metal : add default log function that prints to stderr, cleanup code

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ci : add `jlumbroso/free-disk-space` to docker workflow (#4150)

* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo

* [github][workflows][docker]: adds `jlumbroso/free-disk-space`

gguf : simplify example dependencies

gguf-py : fix broken link

ggml : change ggml_scale to take a float instead of tensor (#4573)

* ggml : change ggml_scale to take a float instead of tensor

* ggml : fix CPU implementation

* tests : fix test-grad0

ggml-ci

llama : add ability to cancel model loading (#4462)

* llama : Add ability to cancel model load

Updated llama_progress_callback so that if it returns false, the model
loading is aborted.

* llama : Add test for model load cancellation

* Fix bool return in llama_model_load, remove std::ignore use

* Update llama.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Fail test if model file is missing

* Revert "Fail test if model file is missing"

This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d.

* Add test-model-load-cancel to Makefile

* Revert "Revert "Fail test if model file is missing""

This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b.

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* ci : get ci/run.sh working with test-model-load-cancel

* ci : restrict .github/workflows/build.yml ctest to -L main

* update requirements.txt

* Disable test-model-load-cancel in make

* Remove venv before creation

* Restructure requirements.txt

Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.

* Make per-python-script requirements work alone

This doesn't break the main requirements.txt.

* Add comment

* Add convert-persimmon-to-gguf.py to new requirements.txt scheme

* Add check-requirements.sh script and GitHub workflow

* Remove shellcheck installation step from workflow

* Add nocleanup special arg

* Fix merge

see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573

* reset to upstream/master

* Redo changes for cancelling model load

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

ggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579)

readme : add zig bindings (#4581)

ci : tag docker image with build number (#4584)

make : add LLAMA_HIP_UMA option (#4587)

NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA

ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)

llama : fix platforms without mmap (#4578)

* llama : fix platforms without mmap

* win32 : limit prefetch size to the file size

* fix win32 error clobber, unnecessary std::string in std::runtime_error

Fix CudaMemcpy direction (#4599)

cuda : fix jetson compile error (#4560)

* fix old jetson compile error

* Update Makefile

* update jetson detect and cuda version detect

* update cuda marco define

* update makefile and cuda,fix some issue

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update Makefile

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

sync : ggml (fix im2col) (#4591)

* cuda : fix im2col_f32_f16 (ggml/#658)

ggml-ci

* ggml-alloc : fix ggml_tallocr_is_own

---------

Co-authored-by: leejet <leejet714@gmail.com>

lookup : add prompt lookup decoding example (#4484)

* initial commit, going through initializations

* main loop finished, starting to debug

* BUG: generates gibberish/repeating tokens after a while

* kv_cache management

* Added colors to distinguish drafted tokens (--color). Updated README

* lookup : fix token positions in the draft batch

* lookup : use n_draft from CLI params

* lookup : final touches

---------

Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

CUDA: fixed row rounding for 0 tensor splits (#4594)

grammar : check the full vocab only if necessary (opt) (#4306)

* Check the full vocab for grammar only if necessary

* Fix missing logit restoration step (?)

Does this matter, actually?

* Fix whitespace / formatting

* Adjust comment

* Didn't mean to push test gbnf

* Split sampling into the helper function (?)

And also revert the changes made to the header

* common : fix final newline

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : allow to specify custom prompt for penalty calculation (#3727)

ci(docker): fix tags in "Build and push docker image (tagged)" (#4603)

fallback to CPU buffer if host buffer alloc fails (#4610)

cuda : improve cuda pool efficiency using virtual memory (#4606)

* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

llama : add PLaMo model (#3557)

* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

simplify bug issue template (#4623)

Adding Emeltal reference to UI list (#4629)

Fix new CUDA10 compilation errors (#4635)

Update comment for AdamW implementation reference. (#4604)

Co-authored-by: Will Findley <findley@gmail.com>

cuda : fix vmm pool with multi GPU (#4620)

* cuda : fix vmm pool with multi GPU

* hip

* use recommended granularity instead of minimum

* better error checking

* fix mixtral

* use cudaMemcpy3DPeerAsync

* use cuda_pool_alloc in ggml_cuda_op_mul_mat

* consolidate error checking in ggml_cuda_set_device

* remove unnecessary inlines

ggml-ci

* style fixes

* only use vmm for the main device

* fix scratch buffer size, re-enable vmm pool for all devices

* remove unnecessary check id != g_main_device

Add byte token type when tokenizer.model is not exists (#4641)

* Add byte token type to hf format

* remove unused variable

ggml : fix dot product for ARM (#4630)

ggml-ci

scripts : add sync-ggml-am.sh

finetune : fix output formatting in print_params (#4653)

This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab:   32000
print_params: n_ctx:     128
print_params: n_embd:    4096
print_params: n_ff:      11008
print_params: n_head:    32
print_params: n_head_kv: 32
print_params: n_layer:   32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 4096
print_params: n_ff                  : 11008
print_params: n_head                : 32
print_params: n_head_kv             : 32
print_params: n_layer               : 32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)

* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

gpt2 : Add gpt2 architecture integration (#4555)

Fix OpenAI server sampling w.r.t. temp and seed (#4668)

The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.

The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.

See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29

scripts : do not sync commits from this repo

ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669)

* fixed mul-mat error for old GPUs

* style fixes

* add mul mat src1 f16 test cases, fix more cases

ggml-ci

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

sync : ggml

ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576)

* Build with CLBlast

* Declare GGML_API

After rebasing, examples/talk-llama failed:

"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]

scripts : print list of sync commits

llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674)

* fix infinite loop

* slight UI simplification, clearer UX

* clearer UI text, add timings to completion log

main-cmake-pkg : fix build issue (#4665)

* Fix main-cmake-pkg compilation

* Use glob to load common files

* cmake : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : allow to generate multimodal embeddings (#4681)

server : fix OpenAI server sampling w.r.t. penalty. (#4675)

server : replace sleep with condition variables (#4673)

The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.

See mozilla-Ocho/llamafile@711344b

llava-cli : refactor to use sampling library (#4669)

This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.

This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.

See Mozilla-Ocho/llamafile@1cd334f

cmake : fix ld warning duplicate libraries libllama.a (#4671)

* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"

* fix warning in example.

flake.nix : rewrite (#4605)

* flake.lock: update to hotfix CUDA::cuda_driver

Required to support https://github.com/ggerganov/llama.cpp/pull/4606

* flake.nix: rewrite

1. Split into separate files per output.

2. Added overlays, so that this flake can be integrated into others.
   The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
   `llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
   broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).

3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
   rather than `with pkgs;` so that there's dependency injection rather
   than dependency lookup.

4. Add a description and meta information for each package.
   The description includes a bit about what's trying to accelerate each one.

5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.

6. Format with `serokell/nixfmt` for a consistent style.

7. Update `flake.lock` with the latest goods.

* flake.nix: use finalPackage instead of passing it manually

* nix: unclutter darwin support

* nix: pass most darwin frameworks unconditionally

...for simplicity

* *.nix: nixfmt

nix shell github:piegamesde/nixfmt/rfc101-style --command \
    nixfmt flake.nix .devops/nix/*.nix

* flake.nix: add maintainers

* nix: move meta down to follow Nixpkgs style more closely

* nix: add missing meta attributes

nix: clarify the interpretation of meta.maintainers

nix: clarify the meaning of "broken" and "badPlatforms"

nix: passthru: expose the use* flags for inspection

E.g.:

```
❯ nix eval .#cuda.useCuda
true
```

* flake.nix: avoid re-evaluating nixpkgs too many times

* flake.nix: use flake-parts

* nix: migrate to pname+version

* flake.nix: overlay: expose both the namespace and the default attribute

* ci: add the (Nix) flakestry workflow

* nix: cmakeFlags: explicit OFF bools

* nix: cuda: reduce runtime closure

* nix: fewer rebuilds

* nix: respect config.cudaCapabilities

* nix: add the impure driver's location to the DT_RUNPATHs

* nix: clean sources more thoroughly

...this way outPaths change less frequently,
and so there are fewer rebuilds

* nix: explicit mpi support

* nix: explicit jetson support

* flake.nix: darwin: only expose the default

---------

Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>

python : add check-requirements.sh and GitHub workflow (#4585)

* python: add check-requirements.sh and GitHub workflow

This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.

* Move requirements into ./requirements

* Fail on "==" being used for package requirements (but can be suppressed)

* Enforce "compatible release" syntax instead of ==

* Update workflow

* Add upper version bound for transformers and protobuf

* improve check-requirements.sh

* small syntax change

* don't remove venvs if nocleanup is passed

* See if this fixes docker workflow

* Move check-requirements.sh into ./scripts/

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

cuda: fix vmm oom issue on NVIDIA AGX Orin (#4687)

Signed-off-by: hydai <hydai@secondstate.io>

clip : enable gpu backend (#4205)

* clip: enable CUDA backend

* add missing kernels

* add enough padding for alignment

* remove ggml_repeat of clip.cpp

* add metal backend

* llava : fixes

- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

clip : use ggml_backend_buffer_is_host (#4205)

CUDA: fix tensor core logic for Pascal and HIP (#4682)

ggml : add ggml_cpu_has_avx_vnni() (#4589)

* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

CUDA: fixed tensor cores not being used on RDNA3 (#4697)

clip : refactor + bug fixes (#4696)

* clip : refactor + bug fixes

ggml-ci

* server : add log message

ggml : add ggml_vdotq_s32 alias (#4715)

ggml-ci

flake.nix: expose full scope in legacyPackages

flake.nix: rocm not yet supported on aarch64, so hide the output

flake.nix: expose checks

workflows: nix-ci: init; build flake outputs

workflows: nix-ci: add a job for eval

workflows: weekly `nix flake update`

workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

workflows: nix-ci: add a qemu job for jetsons

flake.nix: suggest the binary caches

flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

metal : enable shader debugging (cmake option) (#4705)

* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci

finetune: fix typo in README.md (#4733)

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

py : re-enable mmap in convert hf (#4732)

* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

* fix: update torch version

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : add --override-kv parameter (#4710)

* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>

editorconfig : fix whitespace and indentation #4710

llama : differentiate the KV dims in the attention (#4657)

* Add n_key_dim and n_value_dim

Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).

Fix issue #4648.

* Fix `llm_build_kqv` to use `n_value_gqa`

* Rebase

* Rename variables

* Fix llm_build_kqv to be more generic wrt n_embd_head_k

* Update default values for n_embd_head_k and n_embd_head_v

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix llm_load_tensors: the asserts were not backcompat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : replace all API facing `int`'s with `int32_t` (#4577)

* replaced all API facing `int`'s with `int32_t`

* formatting and missed `int` in `llama_token_to_piece`

llama : llama_model_desc print number of experts

server : add token counts to html footer (#4738)

* server: add token counts to stats

* server: generate hpp

---------

Co-authored-by: phiharri <ph@got-root.co.uk>

metal : optimize ggml_mul_mat_id (faster Mixtral PP) (#4725)

* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id

server : throw an error when `slot unavailable` (#4741)

ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)

* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

scripts : fix sync order + metal sed

metal : add kernel_get_rows_i32

ggml-ci

sync : ggml

ggml-ci

cuda : mark I16 and I32 ops as unsupported

ggml-ci

cuda : simplify expression

Co-authored-by: slaren <slarengh@gmail.com>

swift : update Package.swift to use ggml as dependency (#4691)

* updates the package.swift to use ggml as dependency

* changes the ggml package url src to ggerganov

train : fix typo in overlapping-samples help msg (#4758)

This commit fixes a typo in the help message for the
--overlapping-samples option.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

llama.swiftui : fix build of ggml.metallib (#4754)

* metal: fix metal backend init failure in swiftui

* metal: build ggml.metallib instead of copy src

* llama.swift : remove debug flags from metallib build

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml : include stdlib.h before intrin.h (#4736)

server : fix options in README.md (#4765)

* fix examples/server/README.md

* minor : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama.swiftui : support loading custom model from file picker (#4767)

* swiftui: support load model from file picker

* swiftui: remove trailing whitespace

Print backend name on test-backend-ops failure (#4751)

server : send token probs for "stream == false" (#4714)

finetune : remove unused includes (#4756)

This commit removes unused includes from finetune.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

examples : add few-shot translation example (#4783)

ggml : do not sched_yield when calling BLAS (#4761)

* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci

ggml : add error handling to graph_compute (whisper/1714)

ggml : fix q2_k bpw in comments (ggml/680)

metal : switch back to default.metallib (ggml/681)

ggml-ci

flake.nix : fix typo (#4700)

betwen -> between

cmake : check for openblas64 (#4134)

openblas v0.3.22 64-bit pkg-config file is named openblas64.pc
https://github.com/OpenMathLib/OpenBLAS/issues/3790

examples : improve base-translate.sh script (#4783)

llama.swiftui : use correct pointer for llama_token_eos (#4797)

server : fix n_predict check (#4798)

ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787)

llama.swiftui : add visionOS target (#4805)

llama : print tensor meta for debugging

llama.swiftui : use llama.cpp as SPM package (#4804)

llama : remove redundant GQA check (#4796)

llama : remove unused vars (#4796)

CUDA: fixed redundant value dequantization (#4809)

llama-bench : add no-kv-offload parameter (#4812)

readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)

examples : add passkey test (#3856)

* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme

main : add self-extend support (#4815)

* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* llama : "self-extend"-like context extension

* passkey : add comment

* main : add Self-Extend support

* llama : add comment about llama_kv_cache_seq_div

llama.swiftui : update readme

swift : exclude ggml-metal.metal from the package (#4822)

SOTA 2-bit quants (#4773)

* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

readme : add link to SOTA models

common : fix the short form of `--grp-attn-w`, not `-gat` (#4825)

See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57

CUDA: faster softmax via shared memory + fp16 math (#4742)

ggml : fix vld1q_s8_x4 32-bit compat (#4828)

* ggml : fix vld1q_s8_x4 32-bit compat

ggml-ci

* ggml : fix 32-bit ARM compat (cont)

ggml-ci

server : add api-key flag to documentation (#4832)

Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441

server : update readme about token probs (#4777)

* updated server readme to reflect the gg/server-token-probs-4088 commit

added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`.

* simplified the `completion_probabilities` JSON schema

It's now easier to understand what the structure of `completion_probabilities` looks like.

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

scripts : script to get Paul Graham essays in txt format (#4838)

readme : add 3rd party collama reference to UI list (#4840)

Add a VSCode extension for llama.cpp reference to UI list

scripts : improve get-pg.sh (#4838)

metal : improve dequantize precision to match CPU (#4836)

ggml-ci

llava-cli : don't crash if --image flag is invalid (#4835)

This change fixes an issue where supplying `--image missing-file` would
result in a segfault due to a null pointer being dereferenced. This can
result in distracting info being printed if robust crash analysis tools
are being used.

convert.py : fix vanilla LLaMA model conversion (#4818)

* Update Imports and Add Notes for Future Reference

- Updated import statements in `convert.py`.
- Added import for `AutoTokenizer` from `transformers` module.
- Added conditional import for `gguf` from the local directory.
- Added comments and notes for future reference.

Additional Notes:

- Noted removal of a redundant `TypeAlias` import.
- Noted the removal of a `gguf` debug statement.
- Commented on the presence of `ARCH` and `NDArray` definitions.
- Commented on cleaning up and refactoring data type definitions.

* Refine Model Hyperparameters and Params Class

- Updated type annotations to use `Optional` for clarity.
- Improved method names and attribute consistency.
- Removed unnecessary variables for better code readability.

Additional Notes:

- Highlighted the use of `Optional` for clearer intent.
- Ensured backward and forward compatibility.

* Restore BpeVocab and SentencePieceVocab classes

- Restored the BpeVocab class for handling BPE tokenization.
- Restored the SentencePieceVocab class for SentencePiece tokenization.

These classes are essential for maintaining the original behavior of the codebase.

* refactor: Standardize vocabulary handling with HfVocab

- Replaced VocabLoader with HfVocab, aligning vocabulary handling across classes.
- Updated initialization of HfVocab with local_files_only=True for AutoTokenizer.
- Introduced optional parameter fname_added_tokens for flexible added token management.
- Streamlined added token handling for clarity and conciseness.
- Maintained special tokens and IDs, enhancing token management.
- Simplified token processing methods for improved readability.
- Added a placeholder for score computation with a default value of -1000.0.
- Optimized newline token check for efficiency.
- Updated __repr__ function for clarity in representation.
- Adjusted type alias Vocab to include BpeVocab, SentencePieceVocab, and HfVocab.
- Removed redundant code related to special token handling, reverse vocabulary mapping, and vocabulary file detection.

This refactoring promotes a standardized and modular approach to vocabulary management, facilitating future integration with a VocabFactory and improving code maintainability and scalability.

* refactor: Enhance readability, functionality, and code quality

- Improved code formatting and readability for better maintainability.
- Refactored LazyUnpickler's CLASSES dictionary for clarity.
- Added print statements and warnings in check_vocab_size for user feedback.
- Removed find_vocab_file_path, as it's superseded by VocabFactory.
- Preparatory changes for upcoming classes: OutputFile and VocabFactory.
- Overall focus on code quality, error handling, and consistency.

These changes reflect a continuous effort to refine the codebase, ensuring it meets best practices and prepares for future enhancements, such as the VocabFactory.

* refactor: Update OutputFile class for enhanced model vocabulary management

- Restructured the constructor for improved readability.
- Updated `add_meta_arch` method for flexible model name determination.
- Introduced `handle_tokenizer_model` for mapping vocab types to supported tokenizer models.
- Streamlined vocabulary extraction with `extract_vocabulary_from_model`.
- Simplified vocabulary metadata addition using `add_meta_vocab`.
- Refactored `add_tensor_info` for clarity and consistency.
- Improved error handling for better user feedback.

These changes signify the development of a versatile and comprehensive `OutputFile` class, enabling efficient management of model conversion output, metadata, vocabulary, and tensor information.

* feat: Introduce VocabFactory for flexible vocabulary management in model conversion

- The VocabFactory class is added to facilitate modular vocabulary handling.
- The constructor initializes a directory path and detects vocabulary-related files.
- The _select_file method provides file paths based on vocabulary type (e.g., BPE, SentencePiece).
- _create_special_vocab generates special vocabularies, accommodating different types.
- The load_vocab method loads vocabularies, handling BPE, SentencePiece, and Hugging Face Fast Tokenizer.
- Error handling and logging enhance debugging and user feedback.
- The modular and flexible design simplifies vocabulary management and supports future extensions.

The VocabFactory class enhances code modularity and maintainability, allowing versatile vocabulary handling in the model conversion process.

* refactor: Improve code organization, argument parsing, and user interface

- Renamed 'default_outfile' to 'default_output_file' for clarity.
- Refactored argument parser setup into 'get_argument_parser' function.
- Introduced descriptive comments for each argument in the parser.
- Added '--vocab-type' argument with choices ["spm", "bpe", "hfft"] for vocabulary processing.
- Improved flag naming consistency: '--outfile' to '--out-file' and '--bigendian' to '--big-endian'.
- Enhanced error handling to prevent overwriting input data in 'default_output_file'.
- Made 'argv' in 'main' an optional parameter for flexibility.
- Introduced dynamic import for 'awq.apply_awq' based on 'args.awq_path' for conditional dependency.

These changes enhance code clarity, organization, and the user interface of the script, aligning it with Python best practices and improving maintainability.

* refactor: Further refine functionality, improve user interaction, and streamline vocabulary handling

- Renamed command-line arguments for clarity and consistency.
- Improved path resolution and import adjustments for robustness.
- Thoughtfully handled 'awq-path' and conditional logic for the weighted model.
- Enhanced model and vocabulary loading with the 'VocabFactory' class for structured and adaptable loading.
- Strengthened error handling and user feedback for a more user-friendly experience.
- Structured output file handling with clear conditions and defaults.
- Streamlined and organized the 'main' function for better logic flow.
- Passed 'sys.argv[1:]' to 'main' for adaptability and testability.

These changes solidify the script's functionality, making it more robust, user-friendly, and adaptable. The use of the 'VocabFactory' class is a notable enhancement in efficient vocabulary handling, reflecting a thoughtful and iterative approach to script development.

* chore: Apply ruff formatting to convert.py

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Revert to commit 0614c33

* chore: Apply flake8 formatting rules

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* refactor: Revise `check_vocab_size` for Enhanced Clarity and Correctness

- Resolved an unreachable branch issue by reorganizing the conditional structure.
- Moved the special case check for `params.n_vocab == -1` to the top for immediate assertion.
- Flattened the conditional logic for improved clarity and predictability of the function's behavior.

These changes enhance the readability and functional correctness of the `check_vocab_size` function without altering its intended functionality.

* py : fix outfile and outtype

* py : suggest hint for missing vocab size

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Python script to compare commits with llama-bench (#4844)

clip : support more quantization types (#4846)

Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants.
This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations.

I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.

llama : recognize 1B phi models (#4847)

This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.

llama : add additional suffixes for model params (#4834)

* llm_load_print_meta: Add additional suffixs for model params

* Update llama.cpp model param log

remove unneeded comments and convert from > to >=

server : add a `/health` endpoint (#4860)

* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

server : fix build + rename enums (#4870)

server : update readme to document the new `/health` endpoint (#4866)

* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

fix : cuda order of synchronization when setting a buffer (ggml/679)

* fix : cuda order of synchronization when setting a buffer

* also sync before memcpy

---------

Co-authored-by: slaren <slarengh@gmail.com>

Fix execlp call (ggml/689)

NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.

ggml : change GGML_MAX_NAME at compile time (ggml/682)

* change GGML_MAX_NAME to 128

* allow controlling the value of GGML_MAX_NAME through external macro definitions

metal : wrap each operation in debug group (ggml/690)

ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)

metal : fix deprecation warning (ggml/690)

sync : ggml

metal : put encoder debug group behind a define (#4873)

server : fix typo in model name (#4876)

main : print total token count and tokens consumed so far (#4874)

* Token count changes

* Add show token count

* Updating before PR

* Two requested changes

* Move param def posn

ci: nix-flake-update: new token with pr permissions (#4879)

* ci: nix-flake-update: new token with pr permissions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : add `LOG_INFO` when model is successfully loaded (#4881)

* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

* used LOG_INFO after successful model loading

server : support for multiple api keys (#4864)

* server: added support for multiple api keys, added loading api keys from file

* minor: fix whitespace

* added file error handling to --api-key-file, changed code to better
reflect current style

* server: update README.md for --api-key-file

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>

server : implement credentialed CORS (#4514)

* Implement credentialed CORS according to MDN

* Fix syntax error

* Move validate_api_key up so it is defined before its first usage

swift : pin ggml commit + remove ggml.h from spm-headers (#4878)

ggml-ci

ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)

* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : restore intended k-quants mixes for MoE models (#4872)

* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

swift : track ggml release branch (#4867)

main : disable token count by default (#4874)

main : better name for variable n_print (#4874)

server : fix infill when prompt is empty (#4833)

Importance Matrix calculation (#4861)

* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : fix llm_build_k_shift to use correct n_rot (#4889)

* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot

py : fix lint (#4889)

common : streamline the formatting of help (#4890)

* common : streamline the formatting of help

- Separate alternative parameters by a comma

- Do not indent `--version` differently

* Update common/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : fix typo "imp_embd" -> "inp_embd"

CUDA: fix softmax compile for old CUDA versions (#4862)

gitignore : imatrix

llama.swiftui : update models layout (#4826)

* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout

export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)

This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

llama : remove redundant assert for StableLM (#4901)

llama : ggml-backend integration (#4766)

* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

CUDA: faster q8_0 -> f16 dequantization (#4895)

examples : add pydantic models to GBNF grammar generator (#4883)

* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.

backend_sched : fix assignments

ggml-ci

ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)

* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix

sync : ggml

convert : update phi-2 to latest HF repo (#4903)

* convert : update phi-2 to latest HF repo

ggml-ci

* py : try to fix flake stuff

server : fix crash with multimodal models without BOS token (#4904)

server : fix deadlock that occurs in multi-prompt scenarios (#4905)

* * fix deadlock

* * dont ruint all whitespace

compare-llama-bench: tweak output format (#4910)

metal : refactor kernel loading code (#4794)

* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci

gguf : fix potential infinite for-loop (#4600)

Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>

main : add parameter --no-display-prompt (#4541)

* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens

* remove empty line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

workflows: unbreak nix-build-aarch64, and split it out (#4915)

The fix should be just the `sudo apt-get update`

llama : minimize size used for state save/load (#4820)

* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes

metal : disable log for loaded kernels (#4794)

llama : fix detokenization of non-special added-tokens (#4916)

Co-authored-by: goerch <jhr.walter@t-online.de>

server : fix prompt caching with system prompt (#4914)

metal : remove old API (#4919)

ggml-ci

ggml: cache sin/cos for RoPE (#4908)

sync : ggml

Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2-bit quantizations (#4897)

* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : support WinXP build with MinGW 8.1.0 (#3419)

metal : correctly set SIMD support flags on iOS (#4923)

* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS

Fix ffn_down quantization mix for MoE models (#4927)

* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : use LLAMA_LOG_ macros for logging

scripts : sync-ggml-am.sh option to skip commits

llama : check LLAMA_TRACE env for extra logging (#4929)

* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci

Add ability to use importance matrix for all k-quants (#4930)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : fix missing quotes (#4937)

CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

cuda : fix dequantize kernel names (#4938)

awq-py : fix typo in awq-py/README.md (#4947)

llama : apply classifier-free guidance to logits directly (#4951)

pass cpu-architecture arguments only to host code (C;C++) (#4943)

speculative : threading options (#4959)

* speculative: expose draft threading

* fix usage format

* accept -td and -tbd args

* speculative: revert default behavior when -td is unspecified

* fix trailing whitespace

finetune : use LLAMA_FILE_MAGIC_GGLA (#4961)

This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

ggml : introduce GGML_CALL function annotation (#4850)

This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.

This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms

examples : fix and improv docs for the grammar generator (#4909)

* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.

* Fixed some issues and bugs of the grammar generator. Imporved Documentation

* Update pydantic_models_to_grammar.py

metal : log `recommendedMaxWorkingSetSize` on iOS 16+ (#4936)

* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+

* Only log on iOS and macOS, ignoring tvOS and other platforms

* Check for Xcode version before using recommendedMaxWorkingSetSize

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

metal : replace loop of dispatch_async with dispatch_apply (#4934)

* Replace loop of dispatch_async with dispatch_apply

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

android : introduce starter project example (#4926)

* Introduce starter project for Android

Based on examples/llama.swiftui.

* Add github workflow

* Set NDK version

* Only build arm64-v8a in CI

* Sync bench code

* Rename CI prop to skip-armeabi-v7a

* Remove unused tests

metal : localized logic in `ggml_metal_graph_compute` (#4924)

* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement

* Whitespace

* Collecting command buffer completions on single t…
@mofosyne mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
Development

Successfully merging this pull request may close these issues.