-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable HPU support in vLLM #1
Conversation
2. set some inpu metadata from cuda to cpu
2. update hpu paged attention API (for hpu graph compatibility)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking note of some things that need to be addressed here
@@ -217,7 +217,7 @@ def _init_cache(self) -> None: | |||
# Since we use a shared centralized controller, we take the minimum | |||
# number of blocks across all workers to make sure all the memory | |||
# operators can be applied to all workers. | |||
num_gpu_blocks = min(b[0] for b in num_blocks) | |||
num_gpu_blocks = min(10500, min(b[0] for b in num_blocks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdvoretc-intel any reason why we set specifically 10500 as min block count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10500 is a max block count, since we take the minimum between it and whatever we get from the workers. I do not know why this extra value is introduced, and on 1xHPU the block count is actually 9250.
@@ -414,7 +414,7 @@ def capture_model(self, kv_caches: List[KVCache]) -> None: | |||
input_tokens = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda() | |||
input_positions = torch.zeros(max_batch_size, 1, | |||
dtype=torch.long).cuda() | |||
slot_mapping = torch.empty(max_batch_size, 1, dtype=torch.long).cuda() | |||
slot_mapping = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda() # FIXME (kzawora): revert this to torch.empty after bridge bug is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it fixed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there a ticket submitted for this or do I have to check it myself from time to time to see if the torch.empty works?
vllm/hpu/xops.py
Outdated
############################################################################### | ||
# Copyright (C) 2023 Habana Labs, Ltd. an Intel Company | ||
# All Rights Reserved. | ||
# | ||
# Unauthorized copying of this file or any element(s) within it, via any medium | ||
# is strictly prohibited. | ||
# This file contains Habana Labs, Ltd. proprietary and confidential information | ||
# and is subject to the confidentiality and license agreements under which it | ||
# was provided. | ||
# | ||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how relevant is that copyright header in open-source code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so leave only "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few statements of issues being resolved not being backed up by code inspection. Are the changes (such as the removal of benchmarks/run_benchmark_bloom560m.sh) still local and not yet pushed?
numpy | ||
#torch == 2.1.2 | ||
transformers >= 4.36.0 # Required for Mixtral. | ||
#xformers == 0.0.23.post1 # Required for CUDA 12.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: May want to rephrase the comment here to mention that required functionality is integrated for HPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true. I have some changes local because I'm still testing compatibility in all possible places (tests, benchmarks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the comment thread resolutions be withheld until the changes land on the PR? The current state makes it harder to track which issues are known, since comments on their instances may be closed without a visible change.
* Enable HPU support in vLLM (HabanaAI#1) * Enable cache ops for beam search (HabanaAI#3)
This PR introduces basic support for Intel Gaudi accelerators in vLLM.