-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantize the prompt when it's longer than quantized_kv_start #105
base: main
Are you sure you want to change the base?
Conversation
mlx_engine/model_kit.py
Outdated
@@ -100,23 +108,38 @@ def __init__( | |||
) | |||
|
|||
@staticmethod | |||
def _validate_kv_cache_quantization_params( | |||
def _set_kv_cache_quantization_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _set_kv_cache_quantization_params( | |
def _get_kv_cache_quantization_params( |
As far as I can tell this doesn't actually set anything but rather does a light round of processing on kv_bits, kv_group_size, quantized_kv_start
to get the final params for usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _get_kv_cache_quantization_params
mlx_engine/model_kit.py
Outdated
kv_bits: Optional[int], | ||
kv_group_size: Optional[int], | ||
quantized_kv_start: Optional[int], | ||
): | ||
) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can tuple
be stronger/more specific?
mlx_engine/model_kit.py
Outdated
): | ||
) -> tuple: | ||
""" | ||
Helper function to set KV cache quantization parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would change this description to reflect something more similar to my comment on the name of this function: Takes raw parameters and validates/sets defaults
Curious if there was an observable difference after making this change? Any numbers or anything? |
I noticed a slight slowdown (~10%) when quantizing a 37k token prompt with 8-bits with quantization_start=0 with Llama 3.2 1B. This slowdown is probably expected since 8-bit operations are less efficient than 16-bit, and the context-size//model-size of Llama 3.2 isn't large enough where the saving in memory would outweigh the inefficiencies in calculations. |
Closes #82