Skip to content

Commit

Permalink
Align fork with HPU upstream code (#465)
Browse files Browse the repository at this point in the history
vllm-project#6143 got merged, but it's
based on an older revision of HPU components. This PR aligns the two.
  • Loading branch information
michalkuligowski authored Nov 6, 2024
2 parents 0a17a2e + 843ae37 commit 60b981e
Show file tree
Hide file tree
Showing 5 changed files with 27 additions and 18 deletions.
33 changes: 21 additions & 12 deletions docs/source/getting_started/gaudi-installation.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
vLLM with Intel® Gaudi® AI Accelerators
=========================================
Installation with Intel® Gaudi® AI Accelerators
===============================================

This README provides instructions on running vLLM with Intel Gaudi devices.

Expand All @@ -22,22 +22,22 @@ Requirements


Quick start using Dockerfile
============================
----------------------------
.. code:: console
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
.. tip::
If you're observing the following error: ``docker: Error response from daemon: Unknown runtime specified habana.``, please refer to "Install Using Containers" section of `Intel Gaudi Software Stack and Driver Installation <https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html`__. Make sure you have ``habana-container-runtime`` package installed and that ```habana`` container runtime is registered.
If you're observing the following error: ``docker: Error response from daemon: Unknown runtime specified habana.``, please refer to "Install Using Containers" section of `Intel Gaudi Software Stack and Driver Installation <https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html>`__. Make sure you have ``habana-container-runtime`` package installed and that ``habana`` container runtime is registered.


Build from source
=================
-----------------

Environment verification
------------------------
~~~~~~~~~~~~~~~~~~~~~~~~

To verify that the Intel Gaudi software was correctly installed, run:

Expand All @@ -49,11 +49,11 @@ To verify that the Intel Gaudi software was correctly installed, run:
$ pip list | grep neural # verify that neural_compressor is installed
Refer to `Intel Gaudi Software Stack
Verification <https://docs.habana.ai/en/latest/Installation_Guide/Platform_Upgrade_and_Unboxing.html#system-verifications-and-final-tests>`__
Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade>`__
for more details.

Run Docker Image
----------------
~~~~~~~~~~~~~~~~

It is highly recommended to use the latest Docker image from Intel Gaudi
vault. Refer to the `Intel Gaudi
Expand All @@ -68,7 +68,16 @@ Use the following commands to run a Docker image:
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
Build and Install vLLM
---------------------------
~~~~~~~~~~~~~~~~~~~~~~

To build and install vLLM from source, run:

.. code:: console
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python setup.py develop
Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__, run the following:

Expand All @@ -77,16 +86,16 @@ Currently, the latest features and performance optimizations are developed in Ga
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -e .
$ python setup.py develop
Supported Features
==================

- `Offline batched
inference <https://github.com/HabanaAI/vllm-fork/blob/habana_main/docs/source/getting_started/quickstart.rst#offline-batched-inference>`__
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
- Online inference via `OpenAI-Compatible
Server <https://github.com/HabanaAI/vllm-fork/blob/habana_main/docs/source/getting_started/quickstart.rst#openai-compatible-server>`__
Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`__
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
Expand Down
3 changes: 1 addition & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism and pipeline parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* (Experimental) Support for Intel® Gaudi® 2 accelerators
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* Prefix caching support
* Multi-lora support

Expand Down
2 changes: 1 addition & 1 deletion requirements-hpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
-r requirements-common.txt

# Dependencies for HPU code
ray == 2.32.0
ray
triton
pandas
tabulate
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ def get_vllm_version() -> str:
neuron_version = str(get_neuronxcc_version())
if neuron_version != MAIN_CUDA_VERSION:
neuron_version_str = neuron_version.replace(".", "")[:3]
version += f"+neuron{neuron_version_str}"
version += f"{sep}neuron{neuron_version_str}"
elif _is_hpu():
# Get the Intel Gaudi Software Suite version
gaudi_sw_version = str(get_gaudi_sw_version())
Expand Down
5 changes: 3 additions & 2 deletions vllm/model_executor/layers/logits_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,9 @@ def _prune_hidden_states(
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata,
) -> torch.Tensor:
# NOTE(kzawora): This is needed for Gaudi - in some scenarios (warmup,
# profile_run) we might not have selected_token_indices, so we skip pruning.
# NOTE(kzawora): The if guard is needed for Gaudi - in some scenarios
# (warmup, profile_run) we might not have selected_token_indices,
# so we skip pruning.
if sampling_metadata.selected_token_indices is not None:
return hidden_states.index_select(
0, sampling_metadata.selected_token_indices)
Expand Down

0 comments on commit 60b981e

Please sign in to comment.