-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381
Comments
Yep, can confirm. I think it's undocumented that using both together is not supported? I get this error on a dual 4090 machine:
Some other engine args that I used, in case they're relevant:
|
Let me make a PR to raise an error for now. cc @comaniac I believe you made this work before. Did you use kv cache dtype fp 8? |
It should work with xformers backend with paged attention, but I'm not sure if that works with GPTQ. |
Same issue here. I am using llama 3.1 8B which has a context length of 128k. Chunked prefill is automatically enabled for models over a certain sequence length (128k is over it) and I found that I had to set |
That's not expected. I'll file a PR to automatically disable chunked prefill for now if fp8 kv-cache is enabled. |
I know it's super long but here's the full trace: the full very long trace
|
(on a v100 tesla hence the fp16 instead of bf16) |
Seems like triton kernel issue, looks fixable. Let me take a look. Also: vllm/vllm/attention/backends/xformers.py Line 600 in 1f26efb
Also, is this comment still relevant? vllm/vllm/worker/model_runner.py Line 765 in 1f26efb
|
Btw, why is this not on the testing path? Where should such a test be included as regression test? |
Your current environment
H100 (but I believe it happens in any machine)
🐛 Describe the bug
Seems to be broken with some type incompatibility error.
The text was updated successfully, but these errors were encountered: