Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf Hints docs and General Opt Guide refactoring #10815

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c6e5d6c
Brushed the general optimization page
myshevts Mar 1, 2022
4d3b97b
Opt GUIDE, WIP
myshevts Mar 4, 2022
3acbc10
perf hints doc placeholder
myshevts Mar 4, 2022
c782010
WIP
myshevts Mar 5, 2022
cf9acf8
WIP2
myshevts Mar 5, 2022
8b474d0
WIP 3
myshevts Mar 9, 2022
21c0271
added streams and few other details
myshevts Mar 9, 2022
c0d403d
fixed titles, misprints etc
myshevts Mar 10, 2022
fe9ddc4
Perf hints
myshevts Mar 10, 2022
ae1e581
movin the runtime optimizations intro
myshevts Mar 10, 2022
e2dfcd3
fixed link
myshevts Mar 10, 2022
48ce613
Apply suggestions from code review
myshevts Mar 11, 2022
23a11b8
some details on the FIL and other means when pure inference time is n…
myshevts Mar 14, 2022
9624d27
shuffled according to general->use-case->device-specifics flow, minor…
myshevts Mar 14, 2022
0f2dc93
next iter
myshevts Mar 15, 2022
6dd9dbd
section on optimizing for tput and latency
myshevts Mar 15, 2022
3fd22c4
couple of links to the features support matrix
myshevts Mar 15, 2022
45c8d15
Links, brushing, dedicated subsections for Latency/FIL/Tput
myshevts Mar 15, 2022
c463e47
had to make the link less specific (otherwise docs compilations fails)
myshevts Mar 15, 2022
2223c51
removing the Temp/Should be moved to the Opt Guide
myshevts Mar 15, 2022
7dd51f7
shuffled the tput/latency/etc info into separated documents. also the…
myshevts Mar 16, 2022
0b8b1de
fixed toc for ov_dynamic_shapes.md
myshevts Mar 16, 2022
bbbdda2
referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs com…
myshevts Mar 16, 2022
9bd2d25
fixed main product TOC, removed ref from the second-level items
myshevts Mar 16, 2022
94d3935
reviewers remarks
myshevts Mar 16, 2022
a77c7e4
reverted the openvino_docs_OV_UG_NoDynamicShapes
myshevts Mar 16, 2022
895f5d5
reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_…
myshevts Mar 16, 2022
76d3b08
"No dynamic shapes" to the "Dynamic shapes" as TOC
myshevts Mar 16, 2022
2537a54
removed duplication
myshevts Mar 16, 2022
5c6d649
minor brushing
myshevts Mar 16, 2022
caa90c5
Caching to the next level in TOC
myshevts Mar 16, 2022
a48210e
brushing
myshevts Mar 16, 2022
6bb649a
more on the perf counters ( for latency and dynamic cases)
myshevts Mar 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/IE_PLUGIN_DG/QuantizedNetworks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ For more details about low-precision model representation please refer to this [
During the model load each plugin can interpret quantization rules expressed in *FakeQuantize* operations:
- Independently based on the definition of *FakeQuantize* operation.
- Using a special library of low-precision transformations (LPT) which applies common rules for generic operations,
such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](@ref openvino_docs_IE_DG_Int8Inference).
such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](../OV_Runtime_UG/Int8Inference.md).

Here we provide only a high-level overview of the interpretation rules of FakeQuantize.
At runtime each FakeQuantize can be split into two independent operations: **Quantize** and **Dequantize**.
Expand Down
105 changes: 23 additions & 82 deletions docs/MO_DG/prepare_model/Getting_performance_numbers.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,19 @@ When evaluating performance of your model with the OpenVINO Runtime, you must me

- Track separately the operations that happen outside the OpenVINO Runtime, like video decoding.

> **NOTE**: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to [Embedding Preprocessing Computation](Additional_Optimizations.md)
> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [_runtime_ preprocessing optimizations](../../optimization_guide/dldt_deployment_optimization_common).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link does not work. Looks like .md is missed at the end


## Tip 2. Getting Credible Performance Numbers

You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:

- If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results.
- For time values that range too much, use geomean.
- For time values that range too much, consider geomean.
- Beware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, for better performance data reproducibility consider fixing the device frequency. However the end to end (application) benchmarking should be also performed under real operational conditions.

Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for code examples for the performance measurements. Almost every sample, except interactive demos, has a `-ni` option to specify the number of iterations.
## Tip 3. Measure Reference Performance Numbers with OpenVINO's benchmark_app

## Getting performance numbers using OpenVINO tool

To get performance numbers use our Benchmark app.

[Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample is the best performance reference.
To get performance numbers, use the dedicated [Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample which is the best way to produce the performance reference.
It has a lot of device-specific knobs, but the primary usage is as simple as:
```bash
$ ./benchmark_app –d GPU –m <model> -i <input>
Expand All @@ -36,35 +33,25 @@ $ ./benchmark_app –d CPU –m <model> -i <input>
```
to execute on the CPU instead.

For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param).
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams.

Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](../../OV_Runtime_UG/supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output.
Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
Each of the [OpenVINO supported devices](../../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers performance settings that have command-line equivalents in the [Benchmark App](../../../samples/cpp/benchmark_app/README.md).
While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest always starting the performance evaluation with the [OpenVINO High-Level Performance Hints](../../OV_Runtime_UG/performance_hints.md) first:
- benchmark_app **-hint tput** -d 'device' -m 'path to your model'
- benchmark_app **-hint latency** -d 'device' -m 'path to your model'

## Comparing Performance with Native/Framework Code

When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:

- Wrap exactly the inference execution (refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for examples).
- Wrap exactly the inference execution (refer to the [Benchmark App](../../../samples/cpp/benchmark_app/README.md) for examples).
- Do not include model loading time.
- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, Caffe\* allows to auto-populate the input with random values. Notice that it might give different performance than on real images.
- Similarly, for correct performance comparison, make sure the access pattern, for example, input layouts, is optimal for OpenVINO Runtime (currently, it is NCHW).
- Any user-side pre-processing should be tracked separately.
- Make sure to try the same environment settings that the framework developers recommend, for example, for TensorFlow*. In many cases, things that are more machine friendly, like respecting NUMA (see <a href="#cpu-checklist">CPU Checklist</a>), might work well for the OpenVINO Runtime as well.
- If applicable, use batching.
- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` support, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well.

## Using Tools <a name="using-tools"></a>

Whether you are tuning for the first time or doing advanced performance optimization, you need a tool that provides accurate insights. Intel&reg; VTune&trade; Amplifier gives you the tool to mine it and interpret the profiling data.

Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these.
- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, beware of random values that can be used to populate the inputs.
- Consider [Image Pre-processing and Conversion](../../OV_Runtime_UG/preprocessing_overview.md), while any user-side pre-processing should be tracked separately.
- When applicable, leverage the [Dynamic Shapes support](../../OV_Runtime_UG/ov_dynamic_shapes.md)
- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can / should we refer to inference_precision hint here?


### Internal Inference Performance Counters <a name="performance-counters"></a>

Almost every sample (inspect command-line options for a specific sample with `-h`) supports a `-pc` command that outputs internal execution breakdown. Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for the actual OpenVINO Runtime API behind that.
## Internal Inference Performance Counters and Execution Graphs <a name="performance-counters"></a>
Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs.
Both [C++](../../../samples/cpp/benchmark_app/README.md) and [Python](../../../tools/benchmark_tool/README.md) versions of the `benchmark_app` supports a `-pc` command-line parameter that outputs internal execution breakdown.

Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same):

Expand All @@ -76,58 +63,12 @@ fc6_nChw8c_nchw EXECUTED layerType: Reorder realTime: 20
out_fc6 EXECUTED layerType: Output realTime: 3 cpu: 3 execType: unknown
relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0 cpu: 0 execType: undef
```
This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution.
Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific [Netron-viewable](https://netron.app/) graph to the specified file.

This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. Also, the `unknown` stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN.

Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the `Reorder` re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the <a href="#device-specific-tips">Few Device-Specific Tips</a>, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels.

Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its `cpu`/host time is really small compared to the actual `realTime`):

```
subgraph1: squeeze1x1 EXECUTED layerType: Convolution realTime: 227 cpu:3 execType: GPU
subgraph2: detection_out EXECUTED layerType: DetectionOutput realTime: 121 cpu:121 execType: unknown
```

As mentioned earlier, `unknown` here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path.
Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available:

```
subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED layerType: preprocessing realTime: 129 cpu: 129
subgraph1: 2. input transfer to DDR:EXECUTED layerType: realTime: 201 cpu: 0
subgraph1: 3. FPGA execute time:EXECUTED layerType: realTime: 3808 cpu: 0 subgraph1: 4. output transfer from DDR:EXECUTED layerType: realTime: 55 cpu: 0
subgraph1: 5. FPGA output postprocessing:EXECUTED layerType: realTime: 7 cpu: 7
subgraph1: 6. softmax/copy: EXECUTED layerType: realTime: 2 cpu: 2
subgraph2: out_prob: NOT_RUN layerType: Output realTime: 0 cpu: 0
subgraph2: prob: EXECUTED layerType: SoftMax realTime: 10 cpu: 10
Total time: 4212 microseconds
```

The `softmax/copy` is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data).

### Intel&reg; VTune&trade; Examples <a name="vtune-examples"></a>

All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel&reg; VTune&trade; timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown.

When choosing the Analysis type in Intel&reg; VTune&trade; Amplifier, make sure to select the **Analyze user tasks, events, and counters** option:

![](vtune_option.png)

See the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-task-analysis) for details.

Example of Inference Engine calls:

- On the Intel VTune Amplifier timeline.
Notice that `Task_runNOThrow` is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution:
Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise.
Also, especially when performance-debugging the [latency case](../../optimization_guide/dldt_deployment_optimization_latency.md) notice that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example running single [OpenVINO stream](../../optimization_guide/dldt_deployment_optimization_tput.md) with multiple requests would produce nearly identical counters as running single inference request, yet the actual latency can be quite different.

![](vtune_timeline.png)

- In the Intel VTune Amplifier **Top-down view**, grouped by the **Task Domain**.
Notice the `Task_runNoThrow` and `MKLDNN _INFER` that are bracketing the actual Intel MKL-DNN kernels execution:

![](vtune_topdown_view.jpg)

Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels.
Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the [dynamically-shaped inputs](../../OV_Runtime_UG/ov_dynamic_shapes.md) should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data).

Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for <a href="#optimizing-custom-kernels">optimizing custom kernels</a>. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel&reg; VTune&trade; Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)).
OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.
15 changes: 15 additions & 0 deletions docs/OV_Runtime_UG/multi_device.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,21 @@ The Multi-Device plugin supports FP16 IR files. The CPU plugin automatically upc
### See Also
[Supported Devices](supported_plugins/Supported_Devices.md)

## Performance Considerations for the Multi-Device Execution
This section covers few recommendations for the multi-device execution (applicable for both Python and C++):
- MULTI usually performs best when the fastest device is specified first in the list of the devices.
This is particularly important when the request-level parallelism is not sufficient
(e.g. the number of request in the flight is not enough to saturate all devices).
- Just like with any throughput-oriented execution, it is highly recommended to query the optimal number of inference requests directly from the instance of the `ov:compiled_model`.
Please refer to the code of the `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md), for more details.
- Notice that for example CPU+GPU execution performs better with certain knobs
which you can find in the code of the same [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample.
One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams to amortize slower
communication of inference completion from the device to the host.
- Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests
and device-specific 'worker' requests that are being actually scheduled behind the scene.
To facilitate the copy savings, it is recommended to run the requests in the order that they were created.

## Introducing the Multi-Device Plugin (Python)

@sphinxdirective
Expand Down
4 changes: 2 additions & 2 deletions docs/OV_Runtime_UG/openvino_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
openvino_docs_IE_DG_supported_plugins_AUTO
openvino_docs_OV_UG_Running_on_multiple_devices
openvino_docs_OV_UG_Hetero_execution
openvino_docs_OV_UG_Performance_Hints
openvino_docs_OV_UG_Automatic_Batching
openvino_docs_IE_DG_network_state_intro
openvino_docs_OV_Runtime_UG_Python_API_exclusives
openvino_2_0_transition_guide
openvino_docs_OV_Should_be_in_performance


@endsphinxdirective

## Introduction
Expand Down
19 changes: 0 additions & 19 deletions docs/OV_Runtime_UG/openvino_temporary.md

This file was deleted.

11 changes: 10 additions & 1 deletion docs/OV_Runtime_UG/ov_dynamic_shapes.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# Dynamic Shapes {#openvino_docs_OV_UG_DynamicShapes}

@sphinxdirective

.. toctree::
:maxdepth: 1
:hidden:

openvino_docs_OV_UG_NoDynamicShapes

@endsphinxdirective

As it was demonstrated in the [Changing Input Shapes](ShapeInference.md) article, there are models that support changing of input shapes before model compilation in `Core::compile_model`.
Reshaping models provides an ability to customize the model input shape for exactly that size that is required in the end application.
This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios.


## When to Apply Dynamic Shapes

Conventional "static" model reshaping works well when it can be done once per many model inference calls with the same shape.
Expand Down
Loading