-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Prefix Caching Support #1669
Conversation
@DouHappy Can you try with calling |
My test script:
I find a bug that when using two GPUs is slower than single GPU. Prefix‘s state 'on_gpu' is always False before prepare_inputs() When using two GPUs. and it works nice on single gpu. It mean multi_query_cached_kv_attention never be used when running on multi-gpus. My last test is also pass on single gpu.
but it cost about 60s on two gpus. |
vllm/worker/worker.py
Outdated
if sampling_params.prompt_logprobs is not None: | ||
selected_token_indices.extend( | ||
range(selected_token_start_idx, | ||
selected_token_start_idx + prompt_len - 1)) | ||
selected_token_indices.append(selected_token_start_idx + | ||
prompt_len - 1) | ||
selected_token_start_idx += max_seq_len | ||
|
||
# set the prefix state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when tp>1, seq_group_metadata.prefix here is copied by ray workers, so on_gpu=true
won't work on multi gpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc. @DouHappy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reply. This help me a lot.
Co-authored-by: DouHappy <2278958187@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! Can you also merge with the latest main branch as well? I will test the PR after the merge.
examples/api_client.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding another example just for prefix caching?
Thanks a lot for this great feature! Hi @DouHappy , did you observe any speed improvement afterwards? |
Yes,i got observe speed up. Could you should me your test script? Maybe you forgot warmup? BTW, I am trying to introduce prefix but only chinese version now. See this vLLM-prefix浅析(System Prompt,大模型推理加速) @franklyd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great work! Pushed some style refactors by myself.
Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this merge now invalidates FP8 KV cache (#2279).
Look at kernels
in prefix_prefill.py
, when FP8 KV cache is ON
, K/V
and K_cache/V_cache
are different types now.
Please let know what is the best way to move forward, thanks!
Could you provide a test script for the speedup? |
+1 |
Hi @HaiShaw Triton doesn't seem to support mixed precision dot product, so this kernel here fails if the |
Hi, @AlpinDale. Are you using prefix caching with FP8 KVCache? PyTorch and Triton used by vLLM could not support FP8 KVCache. Here are more information about prefix caching and FP8 KVCache in #3234. |
brilliant feature, thx! |
add prefix caching support
Section 1 (Basic Functionality):
Todo:
Automatic Prefix Caching Support -- SGLang RadixAttention