Skip to content

Commit

Permalink
Prepare for open sourcing (#80)
Browse files Browse the repository at this point in the history
Initial code drop with Spyre support

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Nikolaos Papandreou <npo@zurich.ibm.com>
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nikolaos Papandreou <npo@zurich.ibm.com>
Co-authored-by: TRAVIS JOHNSON <tsjohnso@us.ibm.com>
Co-authored-by: Burkhard Ringlein <NGL@zurich.ibm.com>
Co-authored-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
Co-authored-by: Maximilien Philippe Marie de Bayser <mbayser@br.ibm.com>
  • Loading branch information
9 people authored and GitHub Enterprise committed Dec 19, 2024
1 parent 772a667 commit 3e43bb2
Show file tree
Hide file tree
Showing 39 changed files with 3,748 additions and 20 deletions.
2 changes: 2 additions & 0 deletions .yapfignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
collect_env.py

vllm/model_executor/model_loader/spyre_setup.py
28 changes: 28 additions & 0 deletions Dockerfile.spyre
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Global Args #################################################################
ARG BASE_UBI_IMAGE_TAG=9.4
ARG PYTHON_VERSION=3.12

# Base Layer ##################################################################
FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS base
ARG PYTHON_VERSION
ENV PYTHON_VERSION=${PYTHON_VERSION}
WORKDIR /workspace/vllm

# Install some basic utilities ##################################################################
RUN microdnf update -y && microdnf install -y \
python${PYTHON_VERSION}-devel python${PYTHON_VERSION}-pip python${PYTHON_VERSION}-wheel git vim gcc g++\
&& microdnf clean all

# Install build dependencies ##################################################################
RUN --mount=type=bind,source=requirements-build.txt,target=requirements-build.txt \
python3.12 -m pip install --upgrade pip && \
pip install -r requirements-build.txt

# Build vLLM ##################################################################
COPY . .

ENV VLLM_TARGET_DEVICE=spyre
RUN --mount=type=bind,source=.git,target=.git \
pip install --no-build-isolation -v -e .

CMD ["/bin/bash"]
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,76 @@ Easy, fast, and cheap LLM serving for everyone

---

## What is the purpose of this fork?

This is a private fork of vLLM that we are using to develop support for IBM Research's AI accelerator (Spyre).
The idea is that the main branch of this repo should not diverge significantly from upstream beyond changes required to enable Spyre.
We will try to rebase against upstream frequently and we plan to contribute these changes to the upstream repository in the future.

---
## Supported IBM Granite models on Spyre

| Model | 3b | 7b | 8b | 13b | 20b |
|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
| **llama** | NO<sup>1</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-3b-code-base) | YES<sup>2</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-7b-base) | YES<sup>3</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-8b-code-base) | X | X |
| **gpt big code** | YES<sup>4</sup> <br> [-](tom) | X | X | YES<sup>5</sup> <br> [-](tom) | YES<sup>6</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-20b-code-base) |



YES &nbsp;= &nbsp;working on Spyre
NO&nbsp;&nbsp;&nbsp;= &nbsp;not yet working on Spyre
X &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= &nbsp;no weights available


#### Path to models

1 : ```/models/granite-3b-code-base```<br>
2 : ```/models/granite-7b-base```<br>
3 : ```/models/granite-8b-code-base```<br>
4 : ```/models/granite-3b-base```<br>
5 : ```/models/granite-13b-base```<br>
6 : ```/models/granite-20b-code-base```<br><br>
(PVC in dev pod)<br>
## Running ***offline*** demo on Spyre

```bash
python3 examples/offline_inference_spyre.py
```
## Running ***online*** demo on Spyre

### Batch size 1
Log in to the same pod with two terminal windows and launch the server in one and submit requests from the other.

**1st terminal window**: Set up the server with a model provided at \<path> [above](#path-to-models) (slow, takes a long time due to Spyre compilation):
```bash
python3 -m vllm.entrypoints.openai.api_server --model <path> --max-model-len=2048 --block-size=2048
```
Optionally set the desired prompt padding (*default 64*) to any multiple of 64 and specify the maximal number of generated output tokens (*default 20*) with **VLLM_SPYRE_WARMUP_PROMPT_LENS** and **VLLM_SPYRE_WARMUP_NEW_TOKENS**:
```bash
export VLLM_SPYRE_WARMUP_PROMPT_LENS=64
export VLLM_SPYRE_WARMUP_NEW_TOKENS=20
```
before starting the server.
**2nd terminal window**: When the above warmup has completed, submit sample prompts for LLM completion (fast):
```bash
python3 examples/spyre_warmup_online_client.py
```
### Batch size 4/8

Before launching the server specify the batch size to be used (below set to 8) via the environment variable **VLLM_SPYRE_WARMUP_BATCH_SIZES** (*default 1*):
```bash
export VLLM_SPYRE_WARMUP_BATCH_SIZES=4
```

Finally continue as described [above](#batch-size-1) by launching the server in the 1st terminal window.
Before submitting prompts from the 2nd terminal window make sure to specify the batch size (same as set via **VLLM_SPYRE_WARMUP_BATCH_SIZES**) in the [client script](./examples/spyre_warmup_online_client.py) (line 44).
### Example notebooks

- [./examples/online_inference_spyre.ipynb](./examples/online_inference_spyre.ipynb)
- [./examples/offline_inference_spyre.ipynb](./examples/offline_inference_spyre.ipynb)


---
*Latest News* 🔥
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
Expand Down
60 changes: 60 additions & 0 deletions examples/offline_inference_multi_spyre.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import gc
import os
import time

from vllm import LLM, SamplingParams

max_tokens = 3

os.environ["VLLM_SPYRE_WARMUP_PROMPT_LENS"] = '64'
os.environ["VLLM_SPYRE_WARMUP_NEW_TOKENS"] = str(max_tokens)
os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '1'

# stuff for multi-spyre
os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
os.environ["DISTRIBUTED_STRATEGY_IGNORE_MODULES"] = "WordEmbedding"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"

# Sample prompts.
template = (
"Below is an instruction that describes a task. Write a response that "
"appropriately completes the request. Be polite in your response to the "
"user.\n\n### Instruction:\n{}\n\n### Response:")
prompt1 = template.format(
"Provide a list of instructions for preparing chicken soup for a family "
"of four.")
prompts = [
prompt1,
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=max_tokens,
temperature=0.0,
ignore_eos=True)
# Create an LLM.
llm = LLM(
model="/models/llama-194m",
tokenizer="/models/llama-194m",
max_model_len=2048,
block_size=2048,
device="spyre",
tensor_parallel_size=2,
)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
print("=============== GENERATE")
t0 = time.time()
outputs = llm.generate(prompts, sampling_params)
print("Time elaspsed for %d tokens is %.2f sec" %
(len(outputs[0].outputs[0].token_ids), time.time() - t0))
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print(output.outputs[0])

# needed to prevent ugly stackdump caused by sigterm
del llm
gc.collect()
Loading

0 comments on commit 3e43bb2

Please sign in to comment.