Quantize the prompt when it's longer than quantized_kv_start #105

neilmehta24 · 2025-02-21T19:27:07Z

Closes #82

mattjcly · 2025-02-24T15:05:19Z

mlx_engine/model_kit.py

@@ -100,23 +108,38 @@ def __init__(
            )

    @staticmethod
-    def _validate_kv_cache_quantization_params(
+    def _set_kv_cache_quantization_params(


Suggested change

def _set_kv_cache_quantization_params(

def _get_kv_cache_quantization_params(

As far as I can tell this doesn't actually set anything but rather does a light round of processing on kv_bits, kv_group_size, quantized_kv_start to get the final params for usage

Renamed to _get_kv_cache_quantization_params

mattjcly · 2025-02-24T15:05:57Z

mlx_engine/model_kit.py

        kv_bits: Optional[int],
        kv_group_size: Optional[int],
        quantized_kv_start: Optional[int],
-    ):
+    ) -> tuple:


can tuple be stronger/more specific?

mattjcly · 2025-02-24T15:06:47Z

mlx_engine/model_kit.py

-    ):
+    ) -> tuple:
+        """
+        Helper function to set KV cache quantization parameters


Would change this description to reflect something more similar to my comment on the name of this function: Takes raw parameters and validates/sets defaults

mattjcly · 2025-02-24T15:09:12Z

Curious if there was an observable difference after making this change? Any numbers or anything?

neilmehta24 · 2025-02-24T16:57:27Z

Curious if there was an observable difference after making this change? Any numbers or anything?

I noticed a slight slowdown (~10%) when quantizing a 37k token prompt with 8-bits with quantization_start=0 with Llama 3.2 1B. This slowdown is probably expected since 8-bit operations are less efficient than 16-bit, and the context-size//model-size of Llama 3.2 isn't large enough where the saving in memory would outweigh the inefficiencies in calculations.

Quantize the prompt when it's longer than quantized_kv_start

11c6b41

neilmehta24 requested review from yagil and mattjcly February 21, 2025 19:27

mattjcly reviewed Feb 24, 2025

View reviewed changes

code review

3935983

neilmehta24 requested a review from mattjcly February 24, 2025 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantize the prompt when it's longer than quantized_kv_start #105

Quantize the prompt when it's longer than quantized_kv_start #105

neilmehta24 commented Feb 21, 2025

mattjcly Feb 24, 2025

neilmehta24 Feb 24, 2025

mattjcly Feb 24, 2025

mattjcly Feb 24, 2025 •

edited

Loading

mattjcly commented Feb 24, 2025

neilmehta24 commented Feb 24, 2025

	def _set_kv_cache_quantization_params(
	def _get_kv_cache_quantization_params(

Quantize the prompt when it's longer than quantized_kv_start #105

Are you sure you want to change the base?

Quantize the prompt when it's longer than quantized_kv_start #105

Conversation

neilmehta24 commented Feb 21, 2025

mattjcly Feb 24, 2025

Choose a reason for hiding this comment

neilmehta24 Feb 24, 2025

Choose a reason for hiding this comment

mattjcly Feb 24, 2025

Choose a reason for hiding this comment

mattjcly Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

mattjcly commented Feb 24, 2025

neilmehta24 commented Feb 24, 2025

mattjcly Feb 24, 2025 •

edited

Loading