Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into lora_in_other_pipel…
Browse files Browse the repository at this point in the history
…ines
  • Loading branch information
slyalin committed Oct 3, 2024
2 parents 06f4ac6 + 9177756 commit bdce2a5
Show file tree
Hide file tree
Showing 2 changed files with 115 additions and 89 deletions.
203 changes: 114 additions & 89 deletions llm_bench/python/README.md
Original file line number Diff line number Diff line change
@@ -1,140 +1,165 @@
# Benchmarking script for large language models
# Benchmarking Script for Large Language Models

This script provides a unified approach to estimate performance for Large Language Models.
It is based on pipelines provided by Optimum-Intel and allows to estimate performance for
pytorch and openvino models, using almost the same code and precollected models.
This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models.

## Usage

### 1. Start a Python virtual environment
### 1. Prepare Python Virtual Environment for LLM Benchmarking

``` bash
python3 -m venv python-env
source python-env/bin/activate
python3 -m venv ov-llm-bench-env
source ov-llm-bench-env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai/llm_bench/python/
pip install -r requirements.txt
```
> Note:
> If you are using an existing python environment, recommend following command to use all the dependencies with latest versions:
> pip install -U --upgrade-strategy eager -r requirements.txt

### 2. Convert a model to OpenVINO IR
The optimum-cli tool allows you to convert models from Hugging Face to the OpenVINO IR format. More detailed info about tool usage can be found in [Optimum Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export)
> Note:
> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions:
> `pip install -U --upgrade-strategy eager -r requirements.txt`
Prerequisites:
install conversion dependencies using `requirements.txt`
#### (Optional) Hugging Face Login :

Usage:
Login to Hugging Face if you want to use non-public models:

```bash
optimum-cli export openvino --model <MODEL_NAME> --weight-format <PRECISION> <OUTPUT_DIR>
huggingface-cli login
```

Paramters:
* `--model <MODEL_NAME>` - <MODEL_NAME> model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located.
* `--weight-format` - precision for model conversion fp32, fp16, int8, int4
* `<OUTPUT_DIR>` - output directory for saving OpenVINO model.
### 2. Convert Model to OpenVINO IR Format

The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format.
- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export).
- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html).
- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html).

Usage example:
```bash
optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
```
**Usage:**

the result of running the command will have the following file structure:
```bash
optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR>

|-llama-2-7b-chat
|-pytorch
|-dldt
|-FP16
|-openvino_model.xml
|-openvino_model.bin
|-config.json
|-generation_config.json
|-tokenizer_config.json
|-tokenizer.json
|-tokenizer.model
|-special_tokens_map.json
optimum-cli export openvino -h # For detailed information
```

### 3. Benchmarking
* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located.
* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4`
* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model.

Prerequisites:
install benchmarking dependencies using `requirements.txt`
**NOTE:**
- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`.

``` bash
pip install -r requirements.txt
**Example:**
```bash
optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
```
note: **You can specify the installed OpenVINO version through pip install**
``` bash
# e.g.
pip install openvino==2023.3.0
**Resulting file structure:**

```console
models
└── llama-2-7b-chat
├── config.json
├── generation_config.json
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_model.bin
├── openvino_model.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
```

### 4. Run the following command to test the performance of one LLM model
### 3. Benchmark LLM Model

To benchmark the performance of the LLM, use the following command:

``` bash
python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
# e.g.
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2
python benchmark.py -m models/llama-2-7b-chat/ -n 2
python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2
```
Parameters:
* `-m` - model path
* `-d` - inference device (default=cpu)
* `-r` - report csv
* `-f` - framework (default=ov)
* `-p` - interactive prompt text
* `-pf` - path of JSONL file including interactive prompts
* `-n` - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0)
* `-ic` - limit the output token size (default 512) of text_gen and code_gen models.


**Parameters:**
- `-m`: Path to the model.
- `-d`: Inference device (default: CPU).
- `-r`: Path to the CSV report.
- `-f`: Framework (default: ov).
- `-p`: Interactive prompt text.
- `-pf`: Path to a JSONL file containing prompts.
- `-n`: Number of iterations (default: 0, the first iteration is excluded).
- `-ic`: Limit the output token size (default: 512) for text generation and code generation models.

**Additional options:**
``` bash
python ./benchmark.py -h # for more information
```

## Running `torch.compile()`
#### Benchmarking the Original PyTorch Model:
To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt`

The option `--torch_compile_backend` uses `torch.compile()` to speed up
the PyTorch code by compiling it into optimized kernels using a selected backend.
```bash
# Download PyTorch Model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
# Benchmark with PyTorch Framework
python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt
```

Prerequisites: install benchmarking dependencies using requirements.txt
> **Note:** If needed, You can install a specific OpenVINO version using pip:
> ``` bash
> # e.g.
> pip install openvino==2024.4.0
> # Optional, install the openvino nightly package if needed.
> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification.
> pip uninstall openvino
> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
> ```
``` bash
pip install -r requirements.txt
```
## 4. Benchmark LLM with `torch.compile()`
The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend.
In order to run the `torch.compile()` on CUDA GPU, install additionally the nightly PyTorch version:
Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally:
```bash
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
```
Add the option `--torch_compile_backend` with the desired backend: `pytorch` or `openvino` (default) while running the benchmarking script:
To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example:

```bash
python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino
```

## Run on 2 sockets platform
> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch:
>
> ```bash
> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
> ```
benchmark.py sets openvino.properties.streams.num(1) by default
## 5. Running on 2-Socket Platforms
| OpenVINO version | Behaviors |
The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior.
| OpenVINO Version | Behaviors |
|:--------------------|:------------------------------------------------|
| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. |
| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. |
| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. |
| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. |
numactl on Linux or --load_config for benchmark.py can be used to change the behaviors.
For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets.
```json
{
"INFERENCE_NUM_THREADS": <NUMBER>
}
```
`<NUMBER>` is the number of total physical cores in 2 sockets.
For example, --load_config config.json as following in OpenVINO 2024.0.0 will result in streams.num(1) and execute on 2 sockets.
```
{"INFERENCE_NUM_THREADS":<NUMBER>}
```
`<NUMBER>` is the number of total physical cores in 2 sockets
## 6. Additional Resources

## Additional Resources
### 1. NOTE
> If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors.
### 2. Image generation
> To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)**
- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.
1 change: 1 addition & 0 deletions llm_bench/python/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -1464,6 +1464,7 @@ def main():
add_stateful_model_arguments(parser)

args = parser.parse_args()
log.warning("[DEPRECATED] Not for production use! Please use the 'optimum-intel' to generate the IRs. For details, please check: https://github.com/openvinotoolkit/openvino.genai/blob/master/llm_bench/python/README.md#2-convert-model-to-openvino-ir-format")
log.info(f"openvino runtime version: {get_version()}")
model_type = get_convert_model_type(args.model_id.lower())
converter = converters[model_type]
Expand Down

0 comments on commit bdce2a5

Please sign in to comment.