Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align fork with HPU upstream code #465

Merged
merged 4 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 21 additions & 12 deletions docs/source/getting_started/gaudi-installation.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
vLLM with Intel® Gaudi® AI Accelerators
=========================================
Installation with Intel® Gaudi® AI Accelerators
===============================================

This README provides instructions on running vLLM with Intel Gaudi devices.

Expand All @@ -22,22 +22,22 @@ Requirements


Quick start using Dockerfile
============================
----------------------------
.. code:: console

$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env


.. tip::
If you're observing the following error: ``docker: Error response from daemon: Unknown runtime specified habana.``, please refer to "Install Using Containers" section of `Intel Gaudi Software Stack and Driver Installation <https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html`__. Make sure you have ``habana-container-runtime`` package installed and that ```habana`` container runtime is registered.
If you're observing the following error: ``docker: Error response from daemon: Unknown runtime specified habana.``, please refer to "Install Using Containers" section of `Intel Gaudi Software Stack and Driver Installation <https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html>`__. Make sure you have ``habana-container-runtime`` package installed and that ``habana`` container runtime is registered.


Build from source
=================
-----------------

Environment verification
------------------------
~~~~~~~~~~~~~~~~~~~~~~~~

To verify that the Intel Gaudi software was correctly installed, run:

Expand All @@ -49,11 +49,11 @@ To verify that the Intel Gaudi software was correctly installed, run:
$ pip list | grep neural # verify that neural_compressor is installed

Refer to `Intel Gaudi Software Stack
Verification <https://docs.habana.ai/en/latest/Installation_Guide/Platform_Upgrade_and_Unboxing.html#system-verifications-and-final-tests>`__
Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade>`__
for more details.

Run Docker Image
----------------
~~~~~~~~~~~~~~~~

It is highly recommended to use the latest Docker image from Intel Gaudi
vault. Refer to the `Intel Gaudi
Expand All @@ -68,7 +68,16 @@ Use the following commands to run a Docker image:
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

Build and Install vLLM
---------------------------
~~~~~~~~~~~~~~~~~~~~~~

To build and install vLLM from source, run:

.. code:: console

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python setup.py develop


Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__, run the following:

Expand All @@ -77,16 +86,16 @@ Currently, the latest features and performance optimizations are developed in Ga
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -e .
$ python setup.py develop


Supported Features
==================

- `Offline batched
inference <https://github.com/HabanaAI/vllm-fork/blob/habana_main/docs/source/getting_started/quickstart.rst#offline-batched-inference>`__
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
- Online inference via `OpenAI-Compatible
Server <https://github.com/HabanaAI/vllm-fork/blob/habana_main/docs/source/getting_started/quickstart.rst#openai-compatible-server>`__
Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`__
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
Expand Down
3 changes: 1 addition & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism and pipeline parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* (Experimental) Support for Intel® Gaudi® 2 accelerators
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* Prefix caching support
* Multi-lora support

Expand Down
2 changes: 1 addition & 1 deletion requirements-hpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
-r requirements-common.txt

# Dependencies for HPU code
ray == 2.32.0
ray
triton
pandas
tabulate
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@ def get_vllm_version() -> str:
neuron_version = str(get_neuronxcc_version())
if neuron_version != MAIN_CUDA_VERSION:
neuron_version_str = neuron_version.replace(".", "")[:3]
version += f"+neuron{neuron_version_str}"
version += f"{sep}neuron{neuron_version_str}"
elif _is_hpu():
# Get the Intel Gaudi Software Suite version
gaudi_sw_version = str(get_gaudi_sw_version())
Expand Down
5 changes: 3 additions & 2 deletions vllm/model_executor/layers/logits_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,9 @@ def _prune_hidden_states(
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata,
) -> torch.Tensor:
# NOTE(kzawora): This is needed for Gaudi - in some scenarios (warmup,
# profile_run) we might not have selected_token_indices, so we skip pruning.
# NOTE(kzawora): The if guard is needed for Gaudi - in some scenarios
# (warmup, profile_run) we might not have selected_token_indices,
# so we skip pruning.
if sampling_metadata.selected_token_indices is not None:
return hidden_states.index_select(
0, sampling_metadata.selected_token_indices)
Expand Down
Loading