-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf Hints docs and General Opt Guide refactoring #10815
Merged
ilya-lavrenov
merged 33 commits into
openvinotoolkit:releases/2022/1
from
myshevts:perf-hints-docs
Mar 17, 2022
Merged
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
c6e5d6c
Brushed the general optimization page
myshevts 4d3b97b
Opt GUIDE, WIP
myshevts 3acbc10
perf hints doc placeholder
myshevts c782010
WIP
myshevts cf9acf8
WIP2
myshevts 8b474d0
WIP 3
myshevts 21c0271
added streams and few other details
myshevts c0d403d
fixed titles, misprints etc
myshevts fe9ddc4
Perf hints
myshevts ae1e581
movin the runtime optimizations intro
myshevts e2dfcd3
fixed link
myshevts 48ce613
Apply suggestions from code review
myshevts 23a11b8
some details on the FIL and other means when pure inference time is n…
myshevts 9624d27
shuffled according to general->use-case->device-specifics flow, minor…
myshevts 0f2dc93
next iter
myshevts 6dd9dbd
section on optimizing for tput and latency
myshevts 3fd22c4
couple of links to the features support matrix
myshevts 45c8d15
Links, brushing, dedicated subsections for Latency/FIL/Tput
myshevts c463e47
had to make the link less specific (otherwise docs compilations fails)
myshevts 2223c51
removing the Temp/Should be moved to the Opt Guide
myshevts 7dd51f7
shuffled the tput/latency/etc info into separated documents. also the…
myshevts 0b8b1de
fixed toc for ov_dynamic_shapes.md
myshevts bbbdda2
referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs com…
myshevts 9bd2d25
fixed main product TOC, removed ref from the second-level items
myshevts 94d3935
reviewers remarks
myshevts a77c7e4
reverted the openvino_docs_OV_UG_NoDynamicShapes
myshevts 895f5d5
reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_…
myshevts 76d3b08
"No dynamic shapes" to the "Dynamic shapes" as TOC
myshevts 2537a54
removed duplication
myshevts 5c6d649
minor brushing
myshevts caa90c5
Caching to the next level in TOC
myshevts a48210e
brushing
myshevts 6bb649a
more on the perf counters ( for latency and dynamic cases)
myshevts File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,22 +9,19 @@ When evaluating performance of your model with the OpenVINO Runtime, you must me | |
|
||
- Track separately the operations that happen outside the OpenVINO Runtime, like video decoding. | ||
|
||
> **NOTE**: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to [Embedding Preprocessing Computation](Additional_Optimizations.md) | ||
> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [_runtime_ preprocessing optimizations](../../optimization_guide/dldt_deployment_optimization_common). | ||
|
||
## Tip 2. Getting Credible Performance Numbers | ||
|
||
You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections: | ||
|
||
- If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results. | ||
- For time values that range too much, use geomean. | ||
- For time values that range too much, consider geomean. | ||
- Beware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, for better performance data reproducibility consider fixing the device frequency. However the end to end (application) benchmarking should be also performed under real operational conditions. | ||
|
||
Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for code examples for the performance measurements. Almost every sample, except interactive demos, has a `-ni` option to specify the number of iterations. | ||
## Tip 3. Measure Reference Performance Numbers with OpenVINO's benchmark_app | ||
|
||
## Getting performance numbers using OpenVINO tool | ||
|
||
To get performance numbers use our Benchmark app. | ||
|
||
[Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample is the best performance reference. | ||
To get performance numbers, use the dedicated [Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample which is the best way to produce the performance reference. | ||
It has a lot of device-specific knobs, but the primary usage is as simple as: | ||
```bash | ||
$ ./benchmark_app –d GPU –m <model> -i <input> | ||
|
@@ -36,35 +33,25 @@ $ ./benchmark_app –d CPU –m <model> -i <input> | |
``` | ||
to execute on the CPU instead. | ||
|
||
For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param). | ||
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams. | ||
|
||
Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](../../OV_Runtime_UG/supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output. | ||
Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance. | ||
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API. | ||
Each of the [OpenVINO supported devices](../../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers performance settings that have command-line equivalents in the [Benchmark App](../../../samples/cpp/benchmark_app/README.md). | ||
While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest always starting the performance evaluation with the [OpenVINO High-Level Performance Hints](../../OV_Runtime_UG/performance_hints.md) first: | ||
- benchmark_app **-hint tput** -d 'device' -m 'path to your model' | ||
- benchmark_app **-hint latency** -d 'device' -m 'path to your model' | ||
|
||
## Comparing Performance with Native/Framework Code | ||
|
||
When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible: | ||
|
||
- Wrap exactly the inference execution (refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for examples). | ||
- Wrap exactly the inference execution (refer to the [Benchmark App](../../../samples/cpp/benchmark_app/README.md) for examples). | ||
- Do not include model loading time. | ||
- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, Caffe\* allows to auto-populate the input with random values. Notice that it might give different performance than on real images. | ||
- Similarly, for correct performance comparison, make sure the access pattern, for example, input layouts, is optimal for OpenVINO Runtime (currently, it is NCHW). | ||
- Any user-side pre-processing should be tracked separately. | ||
- Make sure to try the same environment settings that the framework developers recommend, for example, for TensorFlow*. In many cases, things that are more machine friendly, like respecting NUMA (see <a href="#cpu-checklist">CPU Checklist</a>), might work well for the OpenVINO Runtime as well. | ||
- If applicable, use batching. | ||
- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` support, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well. | ||
|
||
## Using Tools <a name="using-tools"></a> | ||
|
||
Whether you are tuning for the first time or doing advanced performance optimization, you need a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data. | ||
|
||
Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these. | ||
- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, beware of random values that can be used to populate the inputs. | ||
- Consider [Image Pre-processing and Conversion](../../OV_Runtime_UG/preprocessing_overview.md), while any user-side pre-processing should be tracked separately. | ||
- When applicable, leverage the [Dynamic Shapes support](../../OV_Runtime_UG/ov_dynamic_shapes.md) | ||
- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can / should we refer to inference_precision hint here? |
||
|
||
### Internal Inference Performance Counters <a name="performance-counters"></a> | ||
|
||
Almost every sample (inspect command-line options for a specific sample with `-h`) supports a `-pc` command that outputs internal execution breakdown. Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for the actual OpenVINO Runtime API behind that. | ||
## Internal Inference Performance Counters and Execution Graphs <a name="performance-counters"></a> | ||
Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. | ||
Both [C++](../../../samples/cpp/benchmark_app/README.md) and [Python](../../../tools/benchmark_tool/README.md) versions of the `benchmark_app` supports a `-pc` command-line parameter that outputs internal execution breakdown. | ||
|
||
Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same): | ||
|
||
|
@@ -76,58 +63,12 @@ fc6_nChw8c_nchw EXECUTED layerType: Reorder realTime: 20 | |
out_fc6 EXECUTED layerType: Output realTime: 3 cpu: 3 execType: unknown | ||
relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0 cpu: 0 execType: undef | ||
``` | ||
This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. | ||
tsavina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific [Netron-viewable](https://netron.app/) graph to the specified file. | ||
|
||
This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. Also, the `unknown` stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN. | ||
|
||
Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the `Reorder` re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the <a href="#device-specific-tips">Few Device-Specific Tips</a>, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels. | ||
|
||
Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its `cpu`/host time is really small compared to the actual `realTime`): | ||
|
||
``` | ||
subgraph1: squeeze1x1 EXECUTED layerType: Convolution realTime: 227 cpu:3 execType: GPU | ||
… | ||
subgraph2: detection_out EXECUTED layerType: DetectionOutput realTime: 121 cpu:121 execType: unknown | ||
… | ||
``` | ||
|
||
As mentioned earlier, `unknown` here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path. | ||
Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available: | ||
|
||
``` | ||
subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED layerType: preprocessing realTime: 129 cpu: 129 | ||
subgraph1: 2. input transfer to DDR:EXECUTED layerType: realTime: 201 cpu: 0 | ||
subgraph1: 3. FPGA execute time:EXECUTED layerType: realTime: 3808 cpu: 0 subgraph1: 4. output transfer from DDR:EXECUTED layerType: realTime: 55 cpu: 0 | ||
subgraph1: 5. FPGA output postprocessing:EXECUTED layerType: realTime: 7 cpu: 7 | ||
subgraph1: 6. softmax/copy: EXECUTED layerType: realTime: 2 cpu: 2 | ||
subgraph2: out_prob: NOT_RUN layerType: Output realTime: 0 cpu: 0 | ||
subgraph2: prob: EXECUTED layerType: SoftMax realTime: 10 cpu: 10 | ||
Total time: 4212 microseconds | ||
``` | ||
|
||
The `softmax/copy` is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data). | ||
|
||
### Intel® VTune™ Examples <a name="vtune-examples"></a> | ||
|
||
All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown. | ||
|
||
When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the **Analyze user tasks, events, and counters** option: | ||
|
||
 | ||
|
||
See the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-task-analysis) for details. | ||
|
||
Example of Inference Engine calls: | ||
|
||
- On the Intel VTune Amplifier timeline. | ||
Notice that `Task_runNOThrow` is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution: | ||
Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise. | ||
Also, especially when performance-debugging the [latency case](../../optimization_guide/dldt_deployment_optimization_latency.md) notice that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example running single [OpenVINO stream](../../optimization_guide/dldt_deployment_optimization_tput.md) with multiple requests would produce nearly identical counters as running single inference request, yet the actual latency can be quite different. | ||
|
||
 | ||
|
||
- In the Intel VTune Amplifier **Top-down view**, grouped by the **Task Domain**. | ||
Notice the `Task_runNoThrow` and `MKLDNN _INFER` that are bracketing the actual Intel MKL-DNN kernels execution: | ||
|
||
 | ||
|
||
Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels. | ||
Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the [dynamically-shaped inputs](../../OV_Runtime_UG/ov_dynamic_shapes.md) should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data). | ||
|
||
Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for <a href="#optimizing-custom-kernels">optimizing custom kernels</a>. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)). | ||
OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this link does not work. Looks like .md is missed at the end