Merge remote-tracking branch 'origin/master' into lora_in_other_pipel…

…ines
slyalin · Oct 3, 2024 · bdce2a5 · bdce2a5
2 parents 06f4ac6 + 9177756
commit bdce2a5
Show file tree

Hide file tree

Showing 2 changed files with 115 additions and 89 deletions.
diff --git a/llm_bench/python/README.md b/llm_bench/python/README.md
@@ -1,140 +1,165 @@
-# Benchmarking script for large language models
+# Benchmarking Script for Large Language Models
 
-This script provides a unified approach to estimate performance for Large Language Models.
-It is based on pipelines provided by Optimum-Intel and allows to estimate performance for
-pytorch and openvino models, using almost the same code and precollected models.
+This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models.
 
-## Usage 
 
-### 1. Start a Python virtual environment
+### 1. Prepare Python Virtual Environment for LLM Benchmarking
 
 ``` bash
-python3 -m venv python-env
-source python-env/bin/activate
+python3 -m venv ov-llm-bench-env
+source ov-llm-bench-env/bin/activate
 pip install --upgrade pip
-pip install -r requirements.txt
+
+git clone  https://github.com/openvinotoolkit/openvino.genai.git
+cd openvino.genai/llm_bench/python/
+pip install -r requirements.txt  
 ```
-> Note:
-> If you are using an existing python environment, recommend following command to use all the dependencies with latest versions:  
-> pip install -U --upgrade-strategy eager -r requirements.txt
 
-### 2. Convert a model to OpenVINO IR
-   
-The optimum-cli tool allows you to convert models from Hugging Face to the OpenVINO IR format. More detailed info about tool usage can be found in [Optimum Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export)
+> Note:
+> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions:  
+> `pip install -U --upgrade-strategy eager -r requirements.txt`
 
-Prerequisites:
-install conversion dependencies using `requirements.txt`
+#### (Optional) Hugging Face Login :
 
-Usage:
+Login to Hugging Face if you want to use non-public models:
 
 ```bash
-optimum-cli export openvino --model <MODEL_NAME> --weight-format <PRECISION> <OUTPUT_DIR>
+huggingface-cli login
 ```
 
-Paramters:
-* `--model <MODEL_NAME>` - <MODEL_NAME> model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located. 
-* `--weight-format` - precision for model conversion fp32, fp16, int8, int4
-* `<OUTPUT_DIR>` - output directory for saving OpenVINO model.
+### 2. Convert Model to OpenVINO IR Format
+
+The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format. 
+- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export). 
+- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html).
+- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html).
 
-Usage example:
-```bash
-optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
-```
+**Usage:**
 
-the result of running the command will have the following file structure:
+```bash
+optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR>
 
-    |-llama-2-7b-chat
-      |-pytorch
-        |-dldt
-           |-FP16
-              |-openvino_model.xml
-              |-openvino_model.bin
-              |-config.json
-              |-generation_config.json
-              |-tokenizer_config.json
-              |-tokenizer.json
-              |-tokenizer.model
-              |-special_tokens_map.json
+optimum-cli export openvino -h # For detailed information
+```
 
-### 3. Benchmarking
+* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located. 
+* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4`
+* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model.
 
-Prerequisites:
-install benchmarking dependencies using `requirements.txt`
+**NOTE:** 
+- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`.
 
-``` bash
-pip install -r requirements.txt
+**Example:**
+```bash
+optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
 ```
-note: **You can specify the installed OpenVINO version through pip install**
-``` bash
-# e.g. 
-pip install openvino==2023.3.0
+**Resulting file structure:**
+
+```console
+    models
+    └── llama-2-7b-chat
+        ├── config.json
+        ├── generation_config.json
+        ├── openvino_detokenizer.bin
+        ├── openvino_detokenizer.xml
+        ├── openvino_model.bin
+        ├── openvino_model.xml
+        ├── openvino_tokenizer.bin
+        ├── openvino_tokenizer.xml
+        ├── special_tokens_map.json
+        ├── tokenizer_config.json
+        ├── tokenizer.json
+        └── tokenizer.model
 ```
 
-### 4. Run the following command to test the performance of one LLM model
+### 3. Benchmark LLM Model
+
+To benchmark the performance of the LLM, use the following command:
+
 ``` bash
 python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
 # e.g.
-python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
-python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
-python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2
+python benchmark.py -m models/llama-2-7b-chat/ -n 2
+python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2
+python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2
 ```
-Parameters:
-* `-m` - model path
-* `-d` - inference device (default=cpu)
-* `-r` - report csv
-* `-f` - framework (default=ov)
-* `-p` - interactive prompt text
-* `-pf` - path of JSONL file including interactive prompts
-* `-n` - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0)
-* `-ic` - limit the output token size (default 512) of text_gen and code_gen models.
-
 
+**Parameters:**
+- `-m`: Path to the model.
+- `-d`: Inference device (default: CPU).
+- `-r`: Path to the CSV report.
+- `-f`: Framework (default: ov).
+- `-p`: Interactive prompt text.
+- `-pf`: Path to a JSONL file containing prompts.
+- `-n`: Number of iterations (default: 0, the first iteration is excluded).
+- `-ic`: Limit the output token size (default: 512) for text generation and code generation models.
+
+**Additional options:**
 ``` bash
 python ./benchmark.py -h # for more information
 ```
 
-## Running `torch.compile()`
+#### Benchmarking the Original PyTorch Model:
+To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt`
 
-The option `--torch_compile_backend` uses `torch.compile()` to speed up
-the PyTorch code by compiling it into optimized kernels using a selected backend.
+```bash
+# Download PyTorch Model
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
+# Benchmark with PyTorch Framework
+python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt
+```
 
-Prerequisites: install benchmarking dependencies using requirements.txt
+> **Note:** If needed, You can install a specific OpenVINO version using pip:
+> ``` bash
+> # e.g. 
+> pip install openvino==2024.4.0
+> # Optional, install the openvino nightly package if needed.
+> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. 
+> pip uninstall openvino
+> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
+> ```
 
-``` bash
-pip install -r requirements.txt
-```
+## 4. Benchmark LLM with `torch.compile()`
+
+The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend.
 
-In order to run the `torch.compile()` on CUDA GPU, install additionally the nightly PyTorch version:
+Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally:
 
 ```bash
-pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
 ```
 
-Add the option `--torch_compile_backend` with the desired backend: `pytorch` or `openvino` (default) while running the benchmarking script:
+To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example:
 
 ```bash
 python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino
 ```
 
-## Run on 2 sockets platform
+> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch:
+>
+> ```bash
+> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
+> ```
+
 
-benchmark.py sets openvino.properties.streams.num(1) by default
+## 5. Running on 2-Socket Platforms
 
-| OpenVINO version    | Behaviors                                       |
+The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior.
+
+| OpenVINO Version    | Behaviors                                       |
 |:--------------------|:------------------------------------------------|
-| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. |
-| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. |
+| Before 2024.0.0     | streams.num(1) <br>execute on 2 sockets.        |
+| 2024.0.0            | streams.num(1) <br>execute on the same socket as the APP is running on. |
 
-numactl on Linux or --load_config for benchmark.py can be used to change the behaviors.
+For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets.
+```json
+{
+  "INFERENCE_NUM_THREADS": <NUMBER>
+} 
+``` 
+`<NUMBER>` is the number of total physical cores in 2 sockets.
 
-For example, --load_config config.json as following in OpenVINO 2024.0.0 will result in streams.num(1) and execute on 2 sockets.
-```
-{"INFERENCE_NUM_THREADS":<NUMBER>}
-```
-`<NUMBER>` is the number of total physical cores in 2 sockets
+## 6. Additional Resources
 
-## Additional Resources
-### 1. NOTE
-> If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors.
-### 2. Image generation
-> To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)**
+- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
+- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.
diff --git a/llm_bench/python/convert.py b/llm_bench/python/convert.py
@@ -1464,6 +1464,7 @@ def main():
     add_stateful_model_arguments(parser)
 
     args = parser.parse_args()
+    log.warning("[DEPRECATED] Not for production use! Please use the 'optimum-intel' to generate the IRs. For details, please check: https://github.com/openvinotoolkit/openvino.genai/blob/master/llm_bench/python/README.md#2-convert-model-to-openvino-ir-format")
     log.info(f"openvino runtime version: {get_version()}")
     model_type = get_convert_model_type(args.model_id.lower())
     converter = converters[model_type]