[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend #6143

kzawora-intel · 2024-07-04T14:32:08Z

This PR adds initial support for Intel Gaudi backend to vLLM.

Requirements

OS: Ubuntu 22.04 LTS
Python: 3.10
Intel Gaudi accelerator
Intel Gaudi software version 1.18.0

Supported Features

Offline batched inference
Online inference via OpenAI-Compatible Server
HPU autodetection
Paged KV cache with algorithms enabled for Intel Gaudi accelerators
Custom Intel Gaudi implementations of Paged Attention, KV cache ops, prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding, FusedMoE
Tensor parallelism support for multi-card inference
Inference with HPU Graphs for accelerating low-batch latency and throughput

Unsupported Features

Beam search
LoRA adapters
Attention with Linear Biases (ALiBi)
Quantization
Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi devices. Configurations that are not listed may or may not work.

meta-llama/Llama-2-7b on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Llama-2-7b-chat-hf on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3-8B on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3-8B-Instruct on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3.1-8B on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3.1-8B-Instruct on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Llama-2-70b with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Llama-2-70b-chat-hf with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3-70B with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3-70B-Instruct with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3.1-70B with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Meta-Llama-3.1-70B-Instruct with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling

Map of changes

vllm/executor/hpu_executor.py - 1xHPU executor inheriting from ExecutorBase
vllm/executor/ray_hpu_executor.py - Multi-HPU (2x, 4x, 8x) executor inheriting from DistributedGPUExecutor
vllm/engine/attention/backends/hpu_attn.py - Gaudi-specific backend for handling prefill attention and paged attention
vllm/engine/attention/ops/hpu_paged_attn.py - Gaudi backend for handling paged attention
vllm/worker/hpu_worker.py - Gaudi-specific worker for handling distributed inference and executing models, inherited from WorkerBase
vllm/worker/hpu_model_runner.py - Gaudi-specific model executor, inherited from ModelRunnerBase, with input class ModelInputForHPU inherited from ModelRunnerInputBase
vllm/model_executor/layers/*.py - Gaudi-specific forward passes for some operators, otherwise falling back to forward_native
vllm/model_executor/engine/*.py - Routing logic for dispatching proper Gaudi executors
Other changes are either minor, non-invasive workarounds for Gaudi specifically, or utilities.
PR was passed through format.sh with no issues.

WoosukKwon

@kzawora-intel Thanks for submitting the PR, and sorry for the delays in the review! Overall, the PR looks clean and is pretty up to date. Really appreciate it.

I did some preliminary reviews, mostly on the changes on the existing files. While I was able to notice that you managed hard to make the PR less intrusive to the current codebase, I feel it can be further modularized. Small if statements that might look trivial for someone can be intractable and annoying to others.

Will follow up with more reviews later.

vllm/worker/cache_engine.py

docs/source/getting_started/gaudi-installation.rst

WoosukKwon · 2024-07-24T09:44:44Z

vllm/model_executor/models/mixtral.py

+            if is_hpu():
+                torch.hpu.synchronize()


QQ: Why do we need this?

This is a workaround - without synchronization here, loading Mixtral weights resulted in unusually high HPU memory usage, which could end up in OOM before warmup.

vllm/model_executor/layers/vocab_parallel_embedding.py

vllm/model_executor/layers/layernorm.py

vllm/hpu/utils.py

vllm/model_executor/layers/logits_processor.py

Use all possible slot values for dummy blocks to avoid caching issues.

With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without performing synLaunch. The flag has been added to the warmup phase to decrease its execution time.

This fixes a very silly issue where mismatching values of `warmup_mode` flag could cause graph recompilations and eventually memory leaks.

This PR fixes crashes observed on older Synapse builds introduced with #227. Setting PT_COMPILE_ONLY_MODE is not supported in current or older public Synapse builds, but we should not crash because of it, rather we should advise user to use the latest build. Previous behavior: ``` ... INFO 09-06 17:08:37 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:08:37 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -159.6 MiB of host memory (414.9 GiB/1007 GiB used) [rank0]: Traceback (most recent call last): [rank0]: File "/software/users/kzawora/vllm-utils/vllm_hpu_simple_test.py", line 9, in <module> [rank0]: llm = LLM(model="facebook/opt-125m") [rank0]: File "/software/users/kzawora/vllm-fork/vllm/entrypoints/llm.py", line 155, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 456, in from_engine_args [rank0]: engine = cls( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 266, in __init__ [rank0]: self._initialize_kv_caches() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 378, in _initialize_kv_caches [rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/executor/habana_executor.py", line 89, in initialize_cache [rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 202, in initialize_cache [rank0]: self._warm_up_model() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 220, in _warm_up_model [rank0]: self.model_runner.warmup_model(self.hpu_cache[0]) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_model_runner.py", line 1412, in warmup_model [rank0]: with compile_only_mode_context(): [rank0]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/internal/bridge_config.py", line 20, in env_setting [rank0]: get_func = globals()['get_' + var.lower()] [rank0]: KeyError: 'get_pt_compile_only_mode' inc shutdown inc shutdown inc shutdown inc shutdown ``` Current behavior: ``` ... INFO 09-06 17:06:42 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:06:43 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -143.7 MiB of host memory (415 GiB/1007 GiB used) WARNING 09-06 17:06:43 habana_model_runner.py:1419] Cannot use PT_COMPILE_ONLY_MODE. Warmup time will be negatively impacted. Please update Gaudi Software Suite. INFO 09-06 17:06:43 habana_model_runner.py:1336] [Warmup][Prompt][1/23] batch_size:2 seq_len:1024 free_mem:40.28 GiB ... ```

Fixes serving mode issue; due to error in fastapi

This PR contains mask based BGMV implementation for LoRA embedding instead of index-select of LoRA-B weights. Removing special handling in no LoRA case also.

Eliminate two graph breaks for torch.compile mode: 1. [__graph_breaks] torch._dynamo.exc.Unsupported: builtin: eq [<class 'torch._dynamo.variables.misc.GetAttrVariable'>, <class 'torch._dynamo.variables.constant.EnumVariable'>] False 2. [__graph_breaks] torch._dynamo.exc.Unsupported: Tensor.item --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>

RuntimeErrors are not observed anymore on habana_main when disable_tensor_cache is used. This PR enables disable_tensor_cache.

On habana_main the slots are calculated by adding an offset to the block which breaks the check for _PAD_SLOT_ID. Reworked it so that in case of _PAD_BLOCK_ID we're automatically inserting the right value.

Porting PT Profiler from: 81a23a7 and e805b88

…a/vllm_v0_6_0_rebase

…m_v0_6_0_rebase

WoosukKwon

LGTM! Thanks for addressing all the comments! Let me do the final checks on whether the PR affects the perf/functionality of other hardware backends, before the PR is merged.

WoosukKwon · 2024-10-30T18:59:18Z

@kzawora-intel Can you please rebase with the main branch again? The PR here cannot run TP > 1 on Nvidia GPUs, because of a recent bug in custom all reduce kernels. The bug is fixed in the main branch.

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

kzawora-intel · 2024-10-31T09:23:40Z

@WoosukKwon I rebased the code, it is now up to date until 5608e61.

zhouyuan

Just one nit on the supported intel devices

docs/source/index.rst

mergify · 2024-11-04T06:41:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. @kzawora-intel please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: Yuan <yuan.zhou@outlook.com>

kzawora-intel · 2024-11-05T13:52:01Z

@WoosukKwon PR #9938 introduced some API mismatches in workers/model_runners, I've updated this PR to reflect that change. HPU now works properly on latest main branch (up to 93dee88) with TP=1 and TP=2.

WoosukKwon · 2024-11-06T09:09:52Z

@kzawora-intel Just merged the PR. Really appreciate all of you & your team's efforts on the PR. Congrats!! 🎉

This PR adds all commits before vllm-project#6143 without vllm-project#6143.

vllm-project#6143 got merged, but it's based on an older revision of HPU components. This PR aligns the two.

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com>

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

warlock135 · 2024-11-26T06:01:58Z

@kzawora-intel The file vllm/model_executor/sampling_metadata.py should also be merged (or applying vllm-fork PR246).
Currently, vLLM raises the following error when using the HPU inference backend: "RuntimeError: Device type HPU is not supported for torch.Generator() API"

…-project#6143) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com>

WoosukKwon added the Gaudi label Jul 5, 2024

WoosukKwon self-assigned this Jul 24, 2024

WoosukKwon reviewed Jul 24, 2024

View reviewed changes

WoosukKwon mentioned this pull request Jul 26, 2024

[Misc][TPU] Support TPU in initialize_ray_cluster #6812

Merged

jikunshang and others added 26 commits September 6, 2024 01:30

fix rotary embedding

d2e2854

Avoiding torch.index_select for embedding LoRA–B

97bd0fd

Remove special handling of no-LoRA case

ededdaf

Update test

b507cc4

Fix formatting

016f343

Dispersed dummy slots (#243)

d9fa7cf

Use all possible slot values for dummy blocks to avoid caching issues.

Use PT_COMPILE_ONLY_MODE during warmup (#227)

7488c58

With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without performing synLaunch. The flag has been added to the warmup phase to decrease its execution time.

Do not pass warmup_mode to execute_model_kwargs (#229)

17447ed

This fixes a very silly issue where mismatching values of `warmup_mode` flag could cause graph recompilations and eventually memory leaks.

Hardcode fastapi version due to pydantic error (#255)

00f1333

Fixes serving mode issue; due to error in fastapi

Mask based BGMV implementation for LoRA Embedding (#247)

b764610

This PR contains mask based BGMV implementation for LoRA embedding instead of index-select of LoRA-B weights. Removing special handling in no LoRA case also.

Merge remote-tracking branch 'upstream/main' into HEAD

2fed15b

Merge remote-tracking branch 'origin/habana_main' into HEAD

f74fe23

format.sh

e2c8b5a

i did not drink my afternoon coffee and made an oopsie

4194195

Add disable_tensor_cache=True to HPUGraph capture (#252)

4052bdb

RuntimeErrors are not observed anymore on habana_main when disable_tensor_cache is used. This PR enables disable_tensor_cache.

do not build core ext on hpu

c9bf908

Fix dispersed slots (#261)

69df1e7

On habana_main the slots are calculated by adding an offset to the block which breaks the check for _PAD_SLOT_ID. Reworked it so that in case of _PAD_BLOCK_ID we're automatically inserting the right value.

Skip compilation warnings during warmup phase (#262)

53f96b7

fix tensor parallelism

d436d38

add missing functions

61b6fbb

Port PT Profiler to habana_main (#256)

2091161

Porting PT Profiler from: 81a23a7 and e805b88

Merge remote-tracking branch 'origin/habana_main' into private/kzawor…

c9bdcbe

…a/vllm_v0_6_0_rebase

Merge remote-tracking branch 'upstream/main' into private/kzawora/vll…

8e41fb5

…m_v0_6_0_rebase

mergify bot added the documentation Improvements or additions to documentation label Oct 29, 2024

Merge branch 'main' into habana_upstream

acec97b

WoosukKwon approved these changes Oct 30, 2024

View reviewed changes

mergify bot added the ci/build label Oct 30, 2024

Merge remote-tracking branch 'upstream/main' into HEAD

bc0bf43

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

zhouyuan reviewed Nov 4, 2024

View reviewed changes

docs/source/index.rst Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 4, 2024

kzawora-intel and others added 2 commits November 4, 2024 11:52

Update docs/source/index.rst

01b190e

Co-authored-by: Yuan <yuan.zhou@outlook.com>

Merge branch 'main' into habana_upstream

ede1280

mergify bot removed the needs-rebase label Nov 4, 2024

kzawora-intel added 2 commits November 5, 2024 15:35

Merge remote-tracking branch 'upstream/main' into HEAD

bb512dd

Conform to new worker/model_runner APIs

c9ce231

WoosukKwon merged commit a02a50e into vllm-project:main Nov 6, 2024
30 of 31 checks passed

WoosukKwon mentioned this pull request Nov 6, 2024

[Hotfix] Fix ruff errors #10073

Merged

This was referenced Nov 6, 2024

Align fork with HPU upstream code HabanaAI/vllm-fork#465

Merged

Nov 6 rebase (sans vllm-project#6143) HabanaAI/vllm-fork#468

Merged

kzawora-intel added a commit to HabanaAI/vllm-fork that referenced this pull request Nov 6, 2024

Nov 6 rebase (sans vllm-project#6143) (#468)

5eb7f3d

This PR adds all commits before vllm-project#6143 without vllm-project#6143.

michalkuligowski added a commit to HabanaAI/vllm-fork that referenced this pull request Nov 6, 2024

Align fork with HPU upstream code (#465)

60b981e

vllm-project#6143 got merged, but it's based on an older revision of HPU components. This PR aligns the two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend #6143

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend #6143

kzawora-intel commented Jul 4, 2024 •

edited

Loading

WoosukKwon left a comment

WoosukKwon Jul 24, 2024

kzawora-intel Jul 30, 2024

WoosukKwon left a comment •

edited

Loading

WoosukKwon commented Oct 30, 2024

kzawora-intel commented Oct 31, 2024

zhouyuan left a comment

mergify bot commented Nov 4, 2024

kzawora-intel commented Nov 5, 2024 •

edited

Loading

WoosukKwon commented Nov 6, 2024

warlock135 commented Nov 26, 2024 •

edited

Loading

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend #6143

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend #6143

Conversation

kzawora-intel commented Jul 4, 2024 • edited Loading

Requirements

Supported Features

Unsupported Features

Supported Configurations

Map of changes

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Jul 24, 2024

Choose a reason for hiding this comment

kzawora-intel Jul 30, 2024

Choose a reason for hiding this comment

WoosukKwon left a comment • edited Loading

Choose a reason for hiding this comment

WoosukKwon commented Oct 30, 2024

kzawora-intel commented Oct 31, 2024

zhouyuan left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 4, 2024

kzawora-intel commented Nov 5, 2024 • edited Loading

WoosukKwon commented Nov 6, 2024

warlock135 commented Nov 26, 2024 • edited Loading

kzawora-intel commented Jul 4, 2024 •

edited

Loading

WoosukKwon left a comment •

edited

Loading

kzawora-intel commented Nov 5, 2024 •

edited

Loading

warlock135 commented Nov 26, 2024 •

edited

Loading