Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantize the prompt when it's longer than quantized_kv_start #105

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

neilmehta24
Copy link
Member

Closes #82

@@ -100,23 +108,38 @@ def __init__(
)

@staticmethod
def _validate_kv_cache_quantization_params(
def _set_kv_cache_quantization_params(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _set_kv_cache_quantization_params(
def _get_kv_cache_quantization_params(

As far as I can tell this doesn't actually set anything but rather does a light round of processing on kv_bits, kv_group_size, quantized_kv_start to get the final params for usage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _get_kv_cache_quantization_params

kv_bits: Optional[int],
kv_group_size: Optional[int],
quantized_kv_start: Optional[int],
):
) -> tuple:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can tuple be stronger/more specific?

):
) -> tuple:
"""
Helper function to set KV cache quantization parameters
Copy link
Member

@mattjcly mattjcly Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would change this description to reflect something more similar to my comment on the name of this function: Takes raw parameters and validates/sets defaults

@mattjcly
Copy link
Member

Curious if there was an observable difference after making this change? Any numbers or anything?

@neilmehta24
Copy link
Member Author

Curious if there was an observable difference after making this change? Any numbers or anything?

I noticed a slight slowdown (~10%) when quantizing a 37k token prompt with 8-bits with quantization_start=0 with Llama 3.2 1B. This slowdown is probably expected since 8-bit operations are less efficient than 16-bit, and the context-size//model-size of Llama 3.2 isn't large enough where the saving in memory would outweigh the inefficiencies in calculations.

@neilmehta24 neilmehta24 requested a review from mattjcly February 24, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quantize KV cache for first prompt tokens
2 participants