Prepare for open sourcing (#80)

Initial code drop with Spyre support Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Nikolaos Papandreou <npo@zurich.ibm.com> Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nikolaos Papandreou <npo@zurich.ibm.com> Co-authored-by: TRAVIS JOHNSON <tsjohnso@us.ibm.com> Co-authored-by: Burkhard Ringlein <NGL@zurich.ibm.com> Co-authored-by: Yannick Schnider <Yannick.Schnider1@ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Co-authored-by: Maximilien Philippe Marie de Bayser <mbayser@br.ibm.com>
IBM · Dec 19, 2024 · 3e43bb2 · 3e43bb2
1 parent 772a667
commit 3e43bb2
Show file tree

Hide file tree

Showing 39 changed files with 3,748 additions and 20 deletions.
diff --git a/.yapfignore b/.yapfignore
@@ -1 +1,3 @@
 collect_env.py
+
+vllm/model_executor/model_loader/spyre_setup.py
diff --git a/Dockerfile.spyre b/Dockerfile.spyre
@@ -0,0 +1,28 @@
+# Global Args #################################################################
+ARG BASE_UBI_IMAGE_TAG=9.4
+ARG PYTHON_VERSION=3.12
+
+# Base Layer ##################################################################
+FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS base
+ARG PYTHON_VERSION
+ENV PYTHON_VERSION=${PYTHON_VERSION}
+WORKDIR /workspace/vllm
+
+# Install some basic utilities ##################################################################
+RUN microdnf update -y && microdnf install -y \
+    python${PYTHON_VERSION}-devel python${PYTHON_VERSION}-pip python${PYTHON_VERSION}-wheel git vim gcc g++\
+    && microdnf clean all
+
+# Install build dependencies ##################################################################
+RUN --mount=type=bind,source=requirements-build.txt,target=requirements-build.txt \
+    python3.12 -m pip install --upgrade pip && \
+    pip install -r requirements-build.txt
+
+# Build vLLM ##################################################################
+COPY . . 
+
+ENV VLLM_TARGET_DEVICE=spyre
+RUN --mount=type=bind,source=.git,target=.git \
+    pip install --no-build-isolation -v -e .
+
+CMD ["/bin/bash"]
diff --git a/README.md b/README.md
@@ -15,6 +15,76 @@ Easy, fast, and cheap LLM serving for everyone
 
 ---
 
+## What is the purpose of this fork?
+
+This is a private fork of vLLM that we are using to develop support for IBM Research's AI accelerator (Spyre). 
+The idea is that the main branch of this repo should not diverge significantly from upstream beyond changes required to enable Spyre.
+We will try to rebase against upstream frequently and we plan to contribute these changes to the upstream repository in the future. 
+
+---
+## Supported IBM Granite models on Spyre
+
+| Model        | 3b     | 7b    | 8b     | 13b     | 20b    |
+|:------------:|:------------:|:------------:|:------------:|:------------:|:------------:|
+| **llama** | NO<sup>1</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-3b-code-base) | YES<sup>2</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-7b-base) | YES<sup>3</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-8b-code-base) | X | X |
+| **gpt big code** | YES<sup>4</sup> <br> [-](tom)  | X | X | YES<sup>5</sup> <br> [-](tom) | YES<sup>6</sup> <br> [weights](https://huggingface.co/ibm-granite/granite-20b-code-base) |
+
+
+
+YES &nbsp;= &nbsp;working on Spyre   
+NO&nbsp;&nbsp;&nbsp;= &nbsp;not yet working on Spyre   
+X &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= &nbsp;no weights available
+
+
+#### Path to models 
+
+1 : ```/models/granite-3b-code-base```<br>
+2 : ```/models/granite-7b-base```<br>
+3 : ```/models/granite-8b-code-base```<br>
+4 : ```/models/granite-3b-base```<br>
+5 : ```/models/granite-13b-base```<br>
+6 : ```/models/granite-20b-code-base```<br><br>
+(PVC in dev pod)<br>
+## Running ***offline*** demo on Spyre
+
+```bash
+python3 examples/offline_inference_spyre.py
+```
+## Running ***online*** demo on Spyre
+
+### Batch size 1
+Log in to the same pod with two terminal windows and launch the server in one and submit requests from the other. 
+
+**1st terminal window**: Set up the server with a model provided at \<path> [above](#path-to-models) (slow, takes a long time due to Spyre compilation):
+```bash
+python3 -m vllm.entrypoints.openai.api_server --model <path> --max-model-len=2048 --block-size=2048
+```
+Optionally set the desired prompt padding (*default 64*) to any multiple of 64 and specify the maximal number of generated output tokens (*default 20*) with **VLLM_SPYRE_WARMUP_PROMPT_LENS** and **VLLM_SPYRE_WARMUP_NEW_TOKENS**: 
+```bash
+export VLLM_SPYRE_WARMUP_PROMPT_LENS=64
+export VLLM_SPYRE_WARMUP_NEW_TOKENS=20
+```
+before starting the server.
+**2nd terminal window**: When the above warmup has completed, submit sample prompts for LLM completion (fast):
+```bash
+python3 examples/spyre_warmup_online_client.py 
+```
+### Batch size 4/8
+
+Before launching the server specify the batch size to be used (below set to 8) via the environment variable **VLLM_SPYRE_WARMUP_BATCH_SIZES** (*default 1*):
+```bash
+export VLLM_SPYRE_WARMUP_BATCH_SIZES=4
+```
+
+Finally continue as described [above](#batch-size-1) by launching the server in the 1st terminal window. 
+Before submitting prompts from the 2nd terminal window make sure to specify the batch size (same as set via **VLLM_SPYRE_WARMUP_BATCH_SIZES**) in the [client script](./examples/spyre_warmup_online_client.py) (line 44). 
+### Example notebooks
+
+- [./examples/online_inference_spyre.ipynb](./examples/online_inference_spyre.ipynb)
+- [./examples/offline_inference_spyre.ipynb](./examples/offline_inference_spyre.ipynb)
+
+
+---
 *Latest News* 🔥
 - [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing).
 - [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!

diff --git a/examples/offline_inference_multi_spyre.py b/examples/offline_inference_multi_spyre.py
@@ -0,0 +1,60 @@
+import gc
+import os
+import time
+
+from vllm import LLM, SamplingParams
+
+max_tokens = 3
+
+os.environ["VLLM_SPYRE_WARMUP_PROMPT_LENS"] = '64'
+os.environ["VLLM_SPYRE_WARMUP_NEW_TOKENS"] = str(max_tokens)
+os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '1'
+
+# stuff for multi-spyre
+os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1"
+os.environ["DISTRIBUTED_STRATEGY_IGNORE_MODULES"] = "WordEmbedding"
+os.environ["MASTER_ADDR"] = "localhost"
+os.environ["MASTER_PORT"] = "12355"
+
+# Sample prompts.
+template = (
+    "Below is an instruction that describes a task. Write a response that "
+    "appropriately completes the request. Be polite in your response to the "
+    "user.\n\n### Instruction:\n{}\n\n### Response:")
+prompt1 = template.format(
+    "Provide a list of instructions for preparing chicken soup for a family "
+    "of four.")
+prompts = [
+    prompt1,
+]
+
+# Create a sampling params object.
+sampling_params = SamplingParams(max_tokens=max_tokens,
+                                 temperature=0.0,
+                                 ignore_eos=True)
+# Create an LLM.
+llm = LLM(
+    model="/models/llama-194m",
+    tokenizer="/models/llama-194m",
+    max_model_len=2048,
+    block_size=2048,
+    device="spyre",
+    tensor_parallel_size=2,
+)
+
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+print("=============== GENERATE")
+t0 = time.time()
+outputs = llm.generate(prompts, sampling_params)
+print("Time elaspsed for %d tokens is %.2f sec" %
+      (len(outputs[0].outputs[0].token_ids), time.time() - t0))
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+print(output.outputs[0])
+
+# needed to prevent ugly stackdump caused by sigterm
+del llm
+gc.collect()
Original file line number	Diff line number	Diff line change
		@@ -1 +1,3 @@
		collect_env.py

		vllm/model_executor/model_loader/spyre_setup.py