Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable HPU support in vLLM #1

Merged
merged 44 commits into from
Feb 19, 2024
Merged

Enable HPU support in vLLM #1

merged 44 commits into from
Feb 19, 2024

Conversation

kzawora-intel
Copy link

This PR introduces basic support for Intel Gaudi accelerators in vLLM.

@kzawora-intel kzawora-intel changed the title Enable HPU for vLLM Enable HPU support in vLLM Jan 30, 2024
Copy link
Author

@kzawora-intel kzawora-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking note of some things that need to be addressed here

@@ -217,7 +217,7 @@ def _init_cache(self) -> None:
# Since we use a shared centralized controller, we take the minimum
# number of blocks across all workers to make sure all the memory
# operators can be applied to all workers.
num_gpu_blocks = min(b[0] for b in num_blocks)
num_gpu_blocks = min(10500, min(b[0] for b in num_blocks))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdvoretc-intel any reason why we set specifically 10500 as min block count?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10500 is a max block count, since we take the minimum between it and whatever we get from the workers. I do not know why this extra value is introduced, and on 1xHPU the block count is actually 9250.

@@ -414,7 +414,7 @@ def capture_model(self, kv_caches: List[KVCache]) -> None:
input_tokens = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda()
input_positions = torch.zeros(max_batch_size, 1,
dtype=torch.long).cuda()
slot_mapping = torch.empty(max_batch_size, 1, dtype=torch.long).cuda()
slot_mapping = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda() # FIXME (kzawora): revert this to torch.empty after bridge bug is fixed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it fixed now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there a ticket submitted for this or do I have to check it myself from time to time to see if the torch.empty works?

vllm/hpu/xops.py Outdated
Comment on lines 1 to 11
###############################################################################
# Copyright (C) 2023 Habana Labs, Ltd. an Intel Company
# All Rights Reserved.
#
# Unauthorized copying of this file or any element(s) within it, via any medium
# is strictly prohibited.
# This file contains Habana Labs, Ltd. proprietary and confidential information
# and is subject to the confidentiality and license agreements under which it
# was provided.
#
###############################################################################
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how relevant is that copyright header in open-source code

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so leave only "# Copyright (C) 2024 Habana Labs, Ltd. an Intel Company"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link

@mdvoretc-intel mdvoretc-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few statements of issues being resolved not being backed up by code inspection. Are the changes (such as the removal of benchmarks/run_benchmark_bloom560m.sh) still local and not yet pushed?

numpy
#torch == 2.1.2
transformers >= 4.36.0 # Required for Mixtral.
#xformers == 0.0.23.post1 # Required for CUDA 12.1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: May want to rephrase the comment here to mention that required functionality is integrated for HPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true. I have some changes local because I'm still testing compatibility in all possible places (tests, benchmarks)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the comment thread resolutions be withheld until the changes land on the PR? The current state makes it harder to track which issues are known, since comments on their instances may be closed without a visible change.

@kzawora-intel kzawora-intel merged commit 512c414 into habana_main Feb 19, 2024
tkrupa-intel pushed a commit to tkrupa-intel/vllm-fork that referenced this pull request Mar 4, 2024
* Enable HPU support in vLLM (HabanaAI#1)

* Enable cache ops for beam search (HabanaAI#3)
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 20, 2024
@kzawora-intel kzawora-intel deleted the mdvoretc/prototype branch October 7, 2024 13:13
iboiko-habana added a commit that referenced this pull request Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants