Add GPU example for Janus-Pro (#12869)

* Add example for Janus-Pro * Update model link * Fixes * Fixes --------- Co-authored-by: ATMxsp01 <shou.xu@intel.com> Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
intel · Feb 21, 2025 · 1e00bed · 1e00bed
1 parent 21d6a78
commit 1e00bed
Show file tree

Hide file tree

Showing 5 changed files with 318 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -336,6 +336,7 @@ Over 70 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
 | MiniCPM-Llama3-V-2_5 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5) | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |
 | MiniCPM-V-2_6 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6) | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |
 | MiniCPM-o-2_6 | | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-o-2_6/) |
+| Janus-Pro | | [link](python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/) |
 | StableDiffusion | | [link](python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion) |
 | Bce-Embedding-Base-V1 | | | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Embedding) |
 | Speech_Paraformer-Large | | | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |

diff --git a/README.zh-CN.md b/README.zh-CN.md
@@ -336,6 +336,7 @@ See the demo of running [*Text-Generation-WebUI*](https://ipex-llm.readthedocs.i
 | MiniCPM-Llama3-V-2_5 |  | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-Llama3-V-2_5) | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |
 | MiniCPM-V-2_6 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm-v-2_6) | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-V-2_6) | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |
 | MiniCPM-o-2_6 | | [link](python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-o-2_6/) |
+| Janus-Pro | | [link](python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/) |
 | StableDiffusion | | [link](python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion) |
 | Bce-Embedding-Base-V1 | | | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Embedding) |
 | Speech_Paraformer-Large | | | [Python link](python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal) |

diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-o-2_6/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/MiniCPM-o-2_6/README.md
@@ -58,12 +58,12 @@ pip install moviepy
 > [!NOTE]
 > We will update for runtime configuration on more Intel GPU platforms.
 
-### 1. Example: Chat in Omni Mode
+## 1. Example: Chat in Omni Mode
 In [omni.py](./omni.py), we show a use case for a MiniCPM-V-2_6 model to chat in omni mode with IPEX-LLM INT4 optimizations on Intel GPUs. In this example, the model will take a video as input, and conduct inference based on the images and audio of this video.
 
 For example, the video input shows a clip of an athlete swimming, with background audio asking "What the athlete is doing?". Then the model in omni mode should inference based on the images of the video and the question in audio.
 
-#### 1.1 Running example
+### 1.1 Running example
 
 ```bash
 python omni.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --video-path VIDEO_PATH
@@ -80,10 +80,10 @@ Arguments info:
 > [!TIP]
 > You could just ignore the warning regarding `Some weights of the model checkpoint at xxx were not used when initializing MiniCPMO`.
 
-### 2. Example: Chat with text/audio/image input
+## 2. Example: Chat with text/audio/image input
 In [chat.py](./chat.py), we show a use case for a MiniCPM-V-2_6 model to chat based on text/audio/image, or a combination of two of them, with IPEX-LLM INT4 optimizations on Intel GPUs.
 
-#### 2.1 Running example
+### 2.1 Running example
 
 - Chat with text input
   ```bash
@@ -126,9 +126,9 @@ Arguments info:
 > [!TIP]
 > You could just ignore the warning regarding `Some weights of the model checkpoint at xxx were not used when initializing MiniCPMO`.
 
-#### 2.2 Sample Outputs
+### 2.2 Sample Outputs
 
-##### [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
+#### [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
 
 The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
 

diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/README.md
@@ -0,0 +1,180 @@
+# Janus-Pro
+In this directory, you will find examples on how you could apply IPEX-LLM low-bit optimizations on Janus-Pro model on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B) and [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B) as reference Janus-Pro models.
+
+In the following examples, we will guide you to apply IPEX-LLM optimizations on Janus-Pro models for text/image inputs.
+
+## 0. Requirements & Installation
+
+To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
+
+### 0.1 Install IPEX-LLM
+
+- For **Intel Core™ Ultra Processors (Series 2) with processor number 2xxV (code name Lunar Lake)** on Windows:
+  ```cmd
+  conda create -n llm python=3.11 libuv
+  conda activate llm
+
+  :: or --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/lnl/cn/
+  pip install --pre --upgrade ipex-llm[xpu_lnl] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/lnl/us/
+  ``` 
+- For **Intel Arc B-Series GPU (code name Battlemage)** on Linux:
+  ```cmd
+  conda create -n llm python=3.11
+  conda activate llm
+
+  # or --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
+  pip install --pre --upgrade ipex-llm[xpu-arc] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+  ``` 
+
+> [!NOTE]
+> We will update for installation on more Intel GPU platforms.
+
+###  0.2 Install Required Pacakges for Janus-Pro
+
+First, you need to clone `deepseek-ai/Janus` from GitHub.
+
+```bash
+git clone https://github.com/deepseek-ai/Janus.git
+```
+
+Then you can install the requirements for Janus-Pro models.
+
+```bash
+conda activate llm
+cd Janus
+
+# refer to https://github.com/deepseek-ai/Janus?tab=readme-ov-file#janus-pro
+pip install -e .
+
+pip install transformers==4.45.0
+pip install accelerate==0.33.0
+pip install "trl<0.12.0"
+
+cd ..
+```
+
+### 0.3 Runtime Configuration
+
+- For **Intel Core™ Ultra Processors (Series 2) with processor number 2xxV (code name Lunar Lake)** on Windows:
+  ```cmd
+  set SYCL_CACHE_PERSISTENT=1
+  ``` 
+- For **Intel Arc B-Series GPU (code name Battlemage)** on Linux:
+  ```bash
+  unset OCL_ICD_VENDOR
+  export SYCL_CACHE_PERSISTENT=1
+  ``` 
+
+> [!NOTE]
+> We will update for runtime configuration on more Intel GPU platforms.
+
+## 1. Example: Predict Tokens using `generate()` API
+In [generate.py](./generate.py), we show a use case for a Janus-Pro model to predict the next N tokens using `generate()` API based on text/image inputs, or a combination of two of them, with IPEX-LLM low-bit optimizations on Intel GPUs.
+
+### 1.1 Running example
+
+- Generate with text input
+  - [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
+    ```bash
+    python generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
+    ```
+  - [deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
+    ```bash
+    python generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --low-bit "sym_int8" --prompt PROMPT --n-predict N_PREDICT
+    ```
+
+- Generate with text + image inputs
+  - [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
+    ```bash
+    python generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --image-path IMAGE_PATH --n-predict N_PREDICT
+    ```
+  - [deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
+    ```bash
+    python generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --low-bit "sym_int8" --prompt PROMPT --image-path IMAGE_PATH --n-predict N_PREDICT
+    ```
+
+> [!NOTE]
+> For `deepseek-ai/Janus-Pro-1B`, we recommand IPEX-LLM INT8 (`sym_int8`) optimizations.
+
+Arguments info:
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for Janus-Pro model (e.g. `deepseek-ai/Janus-Pro-7B` or `deepseek-ai/Janus-Pro-1B`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'deepseek-ai/Janus-Pro-7B'`.
+- `--prompt PROMPT`: argument defining the text input. It is default to be `'Describe the image in detail.'` when `--image-path` is provided. Otherwise, it is default to be `'What is AI?'`.
+- `--image-path IMAGE_PATH`: argument defining the image input.
+- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
+- `--low-bit LOW_BIT`: argument defining the low bit optimizations that will be applied to the model.
+
+### 1.2 Sample Outputs
+The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
+
+<a href="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"><img width=400px src="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg" ></a><br>
+http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
+
+
+#### [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
+
+
+- Chat with text + image inputs
+  ```log
+  Inference time: xxxx s
+  -------------------- Input Image Path --------------------
+  5602445367_3504763978_z.jpg
+  -------------------- Input Prompt (Formatted) --------------------
+  You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
+
+  <|User|>: <image_placeholder>
+  Describe the image in detail.
+
+  <|Assistant|>:
+  -------------------- Chat Output --------------------
+  The image shows a young child holding a small plush toy. The child is wearing a pink and white striped dress with a red and white bow on the shoulder.
+  ```
+
+- Chat with only text input:
+  ```log
+  Inference time: xxxx s
+  -------------------- Input Image Path --------------------
+  None
+  -------------------- Input Prompt (Formatted) --------------------
+  You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
+
+  <|User|>: What is AI?
+
+  <|Assistant|>:
+  -------------------- Chat Output --------------------
+  AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think, learn, and make decisions like humans.
+  ```
+
+#### [deepseek-ai/Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B)
+
+
+- Chat with text + image inputs
+  ```log
+  Inference time: xxxx s
+  -------------------- Input Image Path --------------------
+  5602445367_3504763978_z.jpg
+  -------------------- Input Prompt (Formatted) --------------------
+  You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
+
+  <|User|>: <image_placeholder>
+  Describe the image in detail.
+
+  <|Assistant|>:
+  -------------------- Chat Output --------------------
+  The image shows a young child holding a small plush teddy bear. The teddy bear is dressed in a pink outfit with a polka-dotted tutu
+
+  ```
+
+- Chat with only text input:
+  ```log
+  Inference time: xxxx s
+  -------------------- Input Image Path --------------------
+  None
+  -------------------- Input Prompt (Formatted) --------------------
+  You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
+
+  <|User|>: What is AI?
+
+  <|Assistant|>:
+  -------------------- Chat Output --------------------
+  AI stands for Artificial Intelligence. It is a branch of computer science that aims to create intelligent machines that can perform tasks that typically require human intelligence, such as learning
+  ```
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/janus-pro/generate.py
@@ -0,0 +1,130 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import time
+import torch
+import argparse
+from ipex_llm.transformers import AutoModelForCausalLM
+from janus.models import VLChatProcessor
+from janus.utils.io import load_pil_images
+
+if __name__ == '__main__': 
+    parser = argparse.ArgumentParser(description='Predict Tokens using generate() API for Janus-Pro model')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="deepseek-ai/Janus-Pro-7B",
+                        help='The Hugging Face repo id for the Janus-Pro model to be downloaded'
+                             ', or the path to the checkpoint folder')
+    parser.add_argument('--image-path', type=str,
+                        help='The path to the image for inference.')
+    parser.add_argument('--prompt', type=str,
+                        help='Prompt for inference.')
+    parser.add_argument('--n-predict', type=int, default=32,
+                        help='Max tokens to predict')
+    parser.add_argument('--low-bit', type=str, default="sym_int4",
+                        help='Low bit optimizations that will be applied to the model.')
+
+    args = parser.parse_args()
+
+    model_path = args.repo_id_or_model_path
+    model_name = os.path.basename(model_path)
+    prompt = args.prompt
+    image_path = args.image_path
+    if prompt is None:
+        if image_path is not None and os.path.exists(image_path):
+            prompt = "Describe the image in detail."
+        else:
+            prompt = "What is AI?"
+
+    # The following code is adapted from 
+    # https://github.com/deepseek-ai/Janus?tab=readme-ov-file#multimodal-understanding
+    vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
+    tokenizer = vl_chat_processor.tokenizer
+
+    model_vl = AutoModelForCausalLM.from_pretrained(
+        model_path, 
+        load_in_low_bit=args.low_bit,
+        optimize_model=True,
+        trust_remote_code=True,
+        modules_to_not_convert=["vision_model"]
+    ).eval()
+
+    model_vl = model_vl.half().to('xpu')
+
+    if image_path is not None and os.path.exists(image_path):
+        conversation = [
+            {
+                "role": "<|User|>",
+                "content": f"<image_placeholder>\n{prompt}",
+                "images": [image_path],
+            },
+            {"role": "<|Assistant|>", "content": ""},
+        ]
+    else:
+        conversation = [
+            {
+                "role": "<|User|>",
+                "content": f"{prompt}",
+            },
+            {"role": "<|Assistant|>", "content": ""},
+        ]
+
+    # load images and prepare for inputs
+    pil_images = load_pil_images(conversation)
+    prepare_inputs = vl_chat_processor(
+        conversations=conversation, images=pil_images, force_batchify=True
+    )
+
+    prepare_inputs = prepare_inputs.to(device='xpu', dtype=torch.half)
+
+    # run image encoder to get the image embeddings
+    inputs_embeds = model_vl.prepare_inputs_embeds(**prepare_inputs)
+
+    with torch.inference_mode():
+        # ipex_llm model needs a warmup, then inference time can be accurate
+        outputs = model_vl.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=prepare_inputs.attention_mask,
+            pad_token_id=tokenizer.eos_token_id,
+            bos_token_id=tokenizer.bos_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+            max_new_tokens=args.n_predict,
+            do_sample=False,
+            use_cache=True,
+        )
+
+        st = time.time()
+        # run the model to get the response
+        outputs = model_vl.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=prepare_inputs.attention_mask,
+            pad_token_id=tokenizer.eos_token_id,
+            bos_token_id=tokenizer.bos_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+            max_new_tokens=args.n_predict,
+            do_sample=False,
+            use_cache=True,
+        )
+        ed = time.time()
+
+        reponse = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
+
+    print(f'Inference time: {ed-st} s')
+    print('-'*20, 'Input Image Path', '-'*20)
+    print(image_path)
+    print('-'*20, 'Input Prompt (Formatted)', '-'*20)
+    print(f"{prepare_inputs['sft_format'][0]}")
+    print('-'*20, 'Chat Output', '-'*20)
+    print(reponse)