Clamp out of range values in K quantizer

This assertion fails when quantizing Mixtral 8x7b as Q5_K_M, because I used `convert.py --outtype f32` and the Mixtral weights use bf16 which has a much larger exponent range than the K quantizer is expecting. If --outtype f16 is used then the assert doesn't fail. See ggerganov/llama.cpp#2982 cc: @JohannesGaessler
Mozilla-Ocho · Apr 1, 2024 · ef0307e · ef0307e · JohannesGaessler · Apr 1, 2024
1 parent a8b0b15
commit ef0307e
Showing 1 changed file with 5 additions and 1 deletion.
diff --git a/llama.cpp/ggml-quants.c b/llama.cpp/ggml-quants.c
@@ -1314,7 +1314,11 @@ void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int
 // ===================== Helper functions
 //
 static inline int nearest_int(float fval) {
-    assert(fval <= 4194303.f);
+
+    // [jart] https://github.com/ggerganov/llama.cpp/issues/2982
+    // assert(fval <= 4194303.f);
+    fval = fminf(fval, 4194303.f);
+
     float val = fval + 12582912.f;
     int i; memcpy(&i, &val, sizeof(int));
     return (i & 0x007fffff) - 0x00400000;