Enable HPU support in vLLM #1

kzawora-intel · 2024-01-30T11:42:55Z

This PR introduces basic support for Intel Gaudi accelerators in vLLM.

…ache

2. set some inpu metadata from cuda to cpu

2. update hpu paged attention API (for hpu graph compatibility)

kzawora-intel

Taking note of some things that need to be addressed here

benchmarks/benchmark_throughput.py

setup.py

tests/async_engine/test_api_server.py

vllm/entrypoints/llm.py

vllm/hpu/xops.py

vllm/hpu/cache_ops.py

kzawora-intel · 2024-02-01T14:31:04Z

vllm/engine/llm_engine.py

@@ -217,7 +217,7 @@ def _init_cache(self) -> None:
        # Since we use a shared centralized controller, we take the minimum
        # number of blocks across all workers to make sure all the memory
        # operators can be applied to all workers.
-        num_gpu_blocks = min(b[0] for b in num_blocks)
+        num_gpu_blocks = min(10500, min(b[0] for b in num_blocks))


@mdvoretc-intel any reason why we set specifically 10500 as min block count?

10500 is a max block count, since we take the minimum between it and whatever we get from the workers. I do not know why this extra value is introduced, and on 1xHPU the block count is actually 9250.

vllm/model_executor/models/llama.py

kzawora-intel · 2024-02-01T14:33:33Z

vllm/worker/model_runner.py

@@ -414,7 +414,7 @@ def capture_model(self, kv_caches: List[KVCache]) -> None:
        input_tokens = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda()
        input_positions = torch.zeros(max_batch_size, 1,
                                      dtype=torch.long).cuda()
-        slot_mapping = torch.empty(max_batch_size, 1, dtype=torch.long).cuda()
+        slot_mapping = torch.zeros(max_batch_size, 1, dtype=torch.long).cuda() # FIXME (kzawora): revert this to torch.empty after bridge bug is fixed


is it fixed now?

was there a ticket submitted for this or do I have to check it myself from time to time to see if the torch.empty works?

tests/kernels/test_attention.py

tests/kernels/conftest.py

vllm/__init__.py

vllm/entrypoints/llm.py

vllm/worker/worker.py

kzawora-intel · 2024-02-01T14:44:06Z

vllm/hpu/xops.py

+###############################################################################
+# Copyright (C) 2023 Habana Labs, Ltd. an Intel Company
+# All Rights Reserved.
+#
+# Unauthorized copying of this file or any element(s) within it, via any medium
+# is strictly prohibited.
+# This file contains Habana Labs, Ltd. proprietary and confidential information
+# and is subject to the confidentiality and license agreements under which it
+# was provided.
+#
+###############################################################################


I'm not sure how relevant is that copyright header in open-source code

setup.py

vllm/hpu/rotary_embed.py

vllm/entrypoints/llm.py

vllm/entrypoints/api_server.py

mdvoretc-intel

I see a few statements of issues being resolved not being backed up by code inspection. Are the changes (such as the removal of benchmarks/run_benchmark_bloom560m.sh) still local and not yet pushed?

mdvoretc-intel · 2024-02-13T14:27:35Z

requirements-hpu.txt

+numpy
+#torch == 2.1.2
+transformers >= 4.36.0  # Required for Mixtral.
+#xformers == 0.0.23.post1  # Required for CUDA 12.1.


nit: May want to rephrase the comment here to mention that required functionality is integrated for HPU.

Yes that's true. I have some changes local because I'm still testing compatibility in all possible places (tests, benchmarks)

Can the comment thread resolutions be withheld until the changes land on the PR? The current state makes it harder to track which issues are known, since comments on their instances may be closed without a visible change.

tests/async_engine/test_api_server.py

* Enable HPU support in vLLM (HabanaAI#1) * Enable cache ops for beam search (HabanaAI#3)

Xiaotong Chen and others added 24 commits December 19, 2023 12:10

Porting vllm to HPU

e528d06

add hpu cache allocate

d8da01f

move slot_mapping to cpu and add is_prompt in cache_ops.reshape_and_c…

4d1538f

…ache

add bucket to input metadata

c336824

1. limit max block number for lazy mode (TODO)

068c748

2. set some inpu metadata from cuda to cpu

remove bucket for block tables

9a042f7

add run bash script and change benchmark config

1e7e16d

1. modify kv cache structure to tensors

153eb71

2. update hpu paged attention API (for hpu graph compatibility)

add attention mask for generation

9b7e0a7

add multi_query_kv_attention attn_bias

c99eefc

Temp commit

1327be8

Integrate fused kernels for RMSNorm and RoPE

de7799f

Resolve merge conflicts

b839181

Minor Gaudi workarounds, add debugging to stock vLLM API server

00df486

Merge remote-tracking branch 'origin/main' into mdvoretc/prototype

8b20664

Fix post-merge pinned memory segfaults

16b5557

Re-enable sequence decode

2b6ec4e

Maintain GPU compatibility in cache_engine

9d4bd9f

Adjust HPU RoPE for non-query runs

7a0337a

Integrate HPU primitive implementations

6351d41

Add xops bindings

c0d3c69

Cast paged attention inputs to bfloat16

48b26d1

Remove leftover debug calls

aefa573

Update comments on HPU ops

c49b68e

kzawora-intel changed the title ~~Enable HPU for vLLM~~ Enable HPU support in vLLM Jan 30, 2024

kzawora-intel commented Feb 1, 2024

View reviewed changes

Restoring NVIDIA compatibility in setup.py

c5c2a99

kzawora-intel commented Feb 2, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Sebastian Urwan added 2 commits February 7, 2024 12:09

Added HPU-specific requirements

5725b31

Restored full functionality on NVIDIA

97d31b0

surwan-intel reviewed Feb 8, 2024

View reviewed changes

vllm/hpu/rotary_embed.py Outdated Show resolved Hide resolved

surwan-intel reviewed Feb 8, 2024

View reviewed changes

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved

Sebastian Urwan added 5 commits February 8, 2024 12:40

vllm.core cleanup

07671d7

vllm init cleanup

413fb60

vllm.hpu cleanup

a38686e

vllm.benchmarks cleanup

bed7da6

vllm.entrypoint cleanup

0baa2ef

surwan-intel reviewed Feb 12, 2024

View reviewed changes

vllm/entrypoints/api_server.py Outdated Show resolved Hide resolved

mdvoretc-intel reviewed Feb 13, 2024

View reviewed changes

Sebastian Urwan added 5 commits February 13, 2024 16:53

Changed is_hpu logic

1f22aa1

vllm.benchmark cleanup

eb2c22a

Fixed importing condition

e69fca6

tests cleanup

38cc53b

removed dummy printings

54d499a

mdvoretc-intel reviewed Feb 13, 2024

View reviewed changes

tests/async_engine/test_api_server.py Outdated Show resolved Hide resolved

Sebastian Urwan added 6 commits February 13, 2024 16:02

Update test_api_server.py

c0ea99c

restored attention and logprobs tests functionality on Nvidia

ea3ea44

throughput benchmark cleanup

5543642

Changed Habana copyright header

a2acb86

Restored alibi in bloom

956bab7

Added BSD license header

702d8a7

kzawora-intel merged commit 512c414 into habana_main Feb 19, 2024

tkrupa-intel pushed a commit to tkrupa-intel/vllm-fork that referenced this pull request Mar 4, 2024

Merge habana_main branch

0aa7832

* Enable HPU support in vLLM (HabanaAI#1) * Enable cache ops for beam search (HabanaAI#3)

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 20, 2024

kzawora-intel deleted the mdvoretc/prototype branch October 7, 2024 13:13

hlin99 mentioned this pull request Nov 11, 2024

Enable DeepseekV2 Lite/Chat models #482

Closed

iboiko-habana added a commit that referenced this pull request Feb 21, 2025

After review #1

5b373b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable HPU support in vLLM #1

Enable HPU support in vLLM #1

kzawora-intel commented Jan 30, 2024

kzawora-intel left a comment

kzawora-intel Feb 1, 2024

mdvoretc-intel Feb 6, 2024

kzawora-intel Feb 1, 2024

surwan-intel Feb 14, 2024

kzawora-intel Feb 1, 2024

surwan-intel Feb 16, 2024

kzawora-intel Feb 16, 2024

mdvoretc-intel left a comment

mdvoretc-intel Feb 13, 2024

surwan-intel Feb 13, 2024

mdvoretc-intel Feb 13, 2024

Enable HPU support in vLLM #1

Enable HPU support in vLLM #1

Conversation

kzawora-intel commented Jan 30, 2024

kzawora-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdvoretc-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment