From cbfb8a1678117f22954b1469b9ea1f913d92e3a6 Mon Sep 17 00:00:00 2001 From: Maxim Shevtsov Date: Thu, 17 Mar 2022 11:09:13 +0300 Subject: [PATCH] Perf Hints docs and General Opt Guide refactoring (#10815) * Brushed the general optimization page * Opt GUIDE, WIP * perf hints doc placeholder * WIP * WIP2 * WIP 3 * added streams and few other details * fixed titles, misprints etc * Perf hints * movin the runtime optimizations intro * fixed link * Apply suggestions from code review Co-authored-by: Tatiana Savina * some details on the FIL and other means when pure inference time is not the only factor * shuffled according to general->use-case->device-specifics flow, minor brushing * next iter * section on optimizing for tput and latency * couple of links to the features support matrix * Links, brushing, dedicated subsections for Latency/FIL/Tput * had to make the link less specific (otherwise docs compilations fails) * removing the Temp/Should be moved to the Opt Guide * shuffled the tput/latency/etc info into separated documents. also the following docs moved from the temp into specific feature, general product desc or corresponding plugins - openvino_docs_IE_DG_Model_caching_overview - openvino_docs_IE_DG_Int8Inference - openvino_docs_IE_DG_Bfloat16Inference - openvino_docs_OV_UG_NoDynamicShapes * fixed toc for ov_dynamic_shapes.md * referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs compilation errors * fixed main product TOC, removed ref from the second-level items * reviewers remarks * reverted the openvino_docs_OV_UG_NoDynamicShapes * reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_DG_Int8Inference * "No dynamic shapes" to the "Dynamic shapes" as TOC * removed duplication * minor brushing * Caching to the next level in TOC * brushing * more on the perf counters ( for latency and dynamic cases) Co-authored-by: Tatiana Savina --- docs/IE_PLUGIN_DG/QuantizedNetworks.md | 2 +- .../Getting_performance_numbers.md | 105 ++---- docs/OV_Runtime_UG/multi_device.md | 15 + docs/OV_Runtime_UG/openvino_intro.md | 4 +- docs/OV_Runtime_UG/openvino_temporary.md | 19 -- docs/OV_Runtime_UG/ov_dynamic_shapes.md | 11 +- docs/OV_Runtime_UG/performance_hints.md | 138 ++++++++ docs/OV_Runtime_UG/supported_plugins/CPU.md | 11 +- docs/OV_Runtime_UG/supported_plugins/GPU.md | 19 +- docs/documentation.md | 1 + docs/img/BATCH_device.PNG | 3 + docs/index.rst | 2 +- .../dldt_deployment_optimization_common.md | 51 +++ .../dldt_deployment_optimization_guide.md | 315 ++---------------- ...eployment_optimization_guide_additional.md | 70 ---- .../dldt_deployment_optimization_hints.md | 22 ++ .../dldt_deployment_optimization_latency.md | 35 ++ .../dldt_deployment_optimization_tput.md | 68 ++++ .../dldt_optimization_guide.md | 34 +- .../model_optimization_guide.md | 1 + docs/snippets/dldt_optimization_guide9.cpp | 1 - docs/snippets/ov_auto_batching.cpp | 9 + docs/snippets/ov_auto_batching.py | 8 + 23 files changed, 462 insertions(+), 482 deletions(-) delete mode 100644 docs/OV_Runtime_UG/openvino_temporary.md create mode 100644 docs/OV_Runtime_UG/performance_hints.md create mode 100644 docs/img/BATCH_device.PNG create mode 100644 docs/optimization_guide/dldt_deployment_optimization_common.md delete mode 100644 docs/optimization_guide/dldt_deployment_optimization_guide_additional.md create mode 100644 docs/optimization_guide/dldt_deployment_optimization_hints.md create mode 100644 docs/optimization_guide/dldt_deployment_optimization_latency.md create mode 100644 docs/optimization_guide/dldt_deployment_optimization_tput.md diff --git a/docs/IE_PLUGIN_DG/QuantizedNetworks.md b/docs/IE_PLUGIN_DG/QuantizedNetworks.md index fb7880b66fce61..0c8ad29c234991 100644 --- a/docs/IE_PLUGIN_DG/QuantizedNetworks.md +++ b/docs/IE_PLUGIN_DG/QuantizedNetworks.md @@ -9,7 +9,7 @@ For more details about low-precision model representation please refer to this [ During the model load each plugin can interpret quantization rules expressed in *FakeQuantize* operations: - Independently based on the definition of *FakeQuantize* operation. - Using a special library of low-precision transformations (LPT) which applies common rules for generic operations, -such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](@ref openvino_docs_IE_DG_Int8Inference). +such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](../OV_Runtime_UG/Int8Inference.md). Here we provide only a high-level overview of the interpretation rules of FakeQuantize. At runtime each FakeQuantize can be split into two independent operations: **Quantize** and **Dequantize**. diff --git a/docs/MO_DG/prepare_model/Getting_performance_numbers.md b/docs/MO_DG/prepare_model/Getting_performance_numbers.md index 08fe7e6606d1c8..be253fc5709629 100644 --- a/docs/MO_DG/prepare_model/Getting_performance_numbers.md +++ b/docs/MO_DG/prepare_model/Getting_performance_numbers.md @@ -9,22 +9,19 @@ When evaluating performance of your model with the OpenVINO Runtime, you must me - Track separately the operations that happen outside the OpenVINO Runtime, like video decoding. -> **NOTE**: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to [Embedding Preprocessing Computation](Additional_Optimizations.md) +> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [_runtime_ preprocessing optimizations](../../optimization_guide/dldt_deployment_optimization_common). ## Tip 2. Getting Credible Performance Numbers You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections: - If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results. -- For time values that range too much, use geomean. +- For time values that range too much, consider geomean. +- Beware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, for better performance data reproducibility consider fixing the device frequency. However the end to end (application) benchmarking should be also performed under real operational conditions. -Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for code examples for the performance measurements. Almost every sample, except interactive demos, has a `-ni` option to specify the number of iterations. +## Tip 3. Measure Reference Performance Numbers with OpenVINO's benchmark_app -## Getting performance numbers using OpenVINO tool - -To get performance numbers use our Benchmark app. - -[Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample is the best performance reference. +To get performance numbers, use the dedicated [Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample which is the best way to produce the performance reference. It has a lot of device-specific knobs, but the primary usage is as simple as: ```bash $ ./benchmark_app –d GPU –m -i @@ -36,35 +33,25 @@ $ ./benchmark_app –d CPU –m -i ``` to execute on the CPU instead. -For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param). -Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams. - -Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](../../OV_Runtime_UG/supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output. -Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance. -Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API. +Each of the [OpenVINO supported devices](../../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers performance settings that have command-line equivalents in the [Benchmark App](../../../samples/cpp/benchmark_app/README.md). +While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest always starting the performance evaluation with the [OpenVINO High-Level Performance Hints](../../OV_Runtime_UG/performance_hints.md) first: + - benchmark_app **-hint tput** -d 'device' -m 'path to your model' + - benchmark_app **-hint latency** -d 'device' -m 'path to your model' ## Comparing Performance with Native/Framework Code When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible: -- Wrap exactly the inference execution (refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for examples). +- Wrap exactly the inference execution (refer to the [Benchmark App](../../../samples/cpp/benchmark_app/README.md) for examples). - Do not include model loading time. -- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, Caffe\* allows to auto-populate the input with random values. Notice that it might give different performance than on real images. -- Similarly, for correct performance comparison, make sure the access pattern, for example, input layouts, is optimal for OpenVINO Runtime (currently, it is NCHW). -- Any user-side pre-processing should be tracked separately. -- Make sure to try the same environment settings that the framework developers recommend, for example, for TensorFlow*. In many cases, things that are more machine friendly, like respecting NUMA (see CPU Checklist), might work well for the OpenVINO Runtime as well. -- If applicable, use batching. -- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` support, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well. - -## Using Tools - -Whether you are tuning for the first time or doing advanced performance optimization, you need a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data. - -Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these. +- Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, beware of random values that can be used to populate the inputs. +- Consider [Image Pre-processing and Conversion](../../OV_Runtime_UG/preprocessing_overview.md), while any user-side pre-processing should be tracked separately. +- When applicable, leverage the [Dynamic Shapes support](../../OV_Runtime_UG/ov_dynamic_shapes.md) +- If possible, demand the same accuracy. For example, TensorFlow allows `FP16` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well. -### Internal Inference Performance Counters - -Almost every sample (inspect command-line options for a specific sample with `-h`) supports a `-pc` command that outputs internal execution breakdown. Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for the actual OpenVINO Runtime API behind that. +## Internal Inference Performance Counters and Execution Graphs +Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. +Both [C++](../../../samples/cpp/benchmark_app/README.md) and [Python](../../../tools/benchmark_tool/README.md) versions of the `benchmark_app` supports a `-pc` command-line parameter that outputs internal execution breakdown. Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same): @@ -76,58 +63,12 @@ fc6_nChw8c_nchw EXECUTED layerType: Reorder realTime: 20 out_fc6 EXECUTED layerType: Output realTime: 3 cpu: 3 execType: unknown relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0 cpu: 0 execType: undef ``` +This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. +Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific [Netron-viewable](https://netron.app/) graph to the specified file. -This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. Also, the `unknown` stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN. - -Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the `Reorder` re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the Few Device-Specific Tips, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels. - -Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its `cpu`/host time is really small compared to the actual `realTime`): - -``` -subgraph1: squeeze1x1 EXECUTED layerType: Convolution realTime: 227 cpu:3 execType: GPU -… -subgraph2: detection_out EXECUTED layerType: DetectionOutput realTime: 121 cpu:121 execType: unknown -… -``` - -As mentioned earlier, `unknown` here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path. -Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available: - -``` -subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED layerType: preprocessing realTime: 129 cpu: 129 -subgraph1: 2. input transfer to DDR:EXECUTED layerType: realTime: 201 cpu: 0 -subgraph1: 3. FPGA execute time:EXECUTED layerType: realTime: 3808 cpu: 0 subgraph1: 4. output transfer from DDR:EXECUTED layerType: realTime: 55 cpu: 0 -subgraph1: 5. FPGA output postprocessing:EXECUTED layerType: realTime: 7 cpu: 7 -subgraph1: 6. softmax/copy: EXECUTED layerType: realTime: 2 cpu: 2 -subgraph2: out_prob: NOT_RUN layerType: Output realTime: 0 cpu: 0 -subgraph2: prob: EXECUTED layerType: SoftMax realTime: 10 cpu: 10 -Total time: 4212 microseconds -``` - -The `softmax/copy` is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data). - -### Intel® VTune™ Examples - -All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown. - -When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the **Analyze user tasks, events, and counters** option: - -![](vtune_option.png) - -See the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-task-analysis) for details. - -Example of Inference Engine calls: - -- On the Intel VTune Amplifier timeline. - Notice that `Task_runNOThrow` is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution: +Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise. +Also, especially when performance-debugging the [latency case](../../optimization_guide/dldt_deployment_optimization_latency.md) notice that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example running single [OpenVINO stream](../../optimization_guide/dldt_deployment_optimization_tput.md) with multiple requests would produce nearly identical counters as running single inference request, yet the actual latency can be quite different. - ![](vtune_timeline.png) - -- In the Intel VTune Amplifier **Top-down view**, grouped by the **Task Domain**. - Notice the `Task_runNoThrow` and `MKLDNN _INFER` that are bracketing the actual Intel MKL-DNN kernels execution: - - ![](vtune_topdown_view.jpg) - -Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels. +Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the [dynamically-shaped inputs](../../OV_Runtime_UG/ov_dynamic_shapes.md) should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data). -Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)). +OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view. \ No newline at end of file diff --git a/docs/OV_Runtime_UG/multi_device.md b/docs/OV_Runtime_UG/multi_device.md index 6415b159b08468..3445259711fedf 100644 --- a/docs/OV_Runtime_UG/multi_device.md +++ b/docs/OV_Runtime_UG/multi_device.md @@ -112,6 +112,21 @@ The Multi-Device plugin supports FP16 IR files. The CPU plugin automatically upc ### See Also [Supported Devices](supported_plugins/Supported_Devices.md) +## Performance Considerations for the Multi-Device Execution +This section covers few recommendations for the multi-device execution (applicable for both Python and C++): +- MULTI usually performs best when the fastest device is specified first in the list of the devices. + This is particularly important when the request-level parallelism is not sufficient + (e.g. the number of request in the flight is not enough to saturate all devices). +- Just like with any throughput-oriented execution, it is highly recommended to query the optimal number of inference requests directly from the instance of the `ov:compiled_model`. +Please refer to the code of the `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md), for more details. +- Notice that for example CPU+GPU execution performs better with certain knobs + which you can find in the code of the same [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample. + One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams to amortize slower + communication of inference completion from the device to the host. +- Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests + and device-specific 'worker' requests that are being actually scheduled behind the scene. + To facilitate the copy savings, it is recommended to run the requests in the order that they were created. + ## Introducing the Multi-Device Plugin (Python) @sphinxdirective diff --git a/docs/OV_Runtime_UG/openvino_intro.md b/docs/OV_Runtime_UG/openvino_intro.md index 1322310f7f0b25..e5864a5f9d67d9 100644 --- a/docs/OV_Runtime_UG/openvino_intro.md +++ b/docs/OV_Runtime_UG/openvino_intro.md @@ -16,12 +16,12 @@ openvino_docs_IE_DG_supported_plugins_AUTO openvino_docs_OV_UG_Running_on_multiple_devices openvino_docs_OV_UG_Hetero_execution + openvino_docs_OV_UG_Performance_Hints openvino_docs_OV_UG_Automatic_Batching openvino_docs_IE_DG_network_state_intro openvino_docs_OV_Runtime_UG_Python_API_exclusives openvino_2_0_transition_guide - openvino_docs_OV_Should_be_in_performance - + @endsphinxdirective ## Introduction diff --git a/docs/OV_Runtime_UG/openvino_temporary.md b/docs/OV_Runtime_UG/openvino_temporary.md deleted file mode 100644 index 203f9dfd1e6377..00000000000000 --- a/docs/OV_Runtime_UG/openvino_temporary.md +++ /dev/null @@ -1,19 +0,0 @@ -# Should be moved to performance / extensibility {#openvino_docs_OV_Should_be_in_performance} - -@sphinxdirective - -.. _deep learning inference engine: - -.. toctree:: - :maxdepth: 1 - :hidden: - - openvino_docs_deployment_optimization_guide_dldt_optimization_guide - openvino_docs_IE_DG_Model_caching_overview - openvino_docs_IE_DG_Int8Inference - openvino_docs_IE_DG_Bfloat16Inference - openvino_docs_OV_UG_NoDynamicShapes - -@endsphinxdirective - -## TEMP: should be moved to performance / extensibility guides diff --git a/docs/OV_Runtime_UG/ov_dynamic_shapes.md b/docs/OV_Runtime_UG/ov_dynamic_shapes.md index 14f5d79f34c39f..0208c2fc9740bb 100644 --- a/docs/OV_Runtime_UG/ov_dynamic_shapes.md +++ b/docs/OV_Runtime_UG/ov_dynamic_shapes.md @@ -1,10 +1,19 @@ # Dynamic Shapes {#openvino_docs_OV_UG_DynamicShapes} +@sphinxdirective + +.. toctree:: + :maxdepth: 1 + :hidden: + + openvino_docs_OV_UG_NoDynamicShapes + +@endsphinxdirective + As it was demonstrated in the [Changing Input Shapes](ShapeInference.md) article, there are models that support changing of input shapes before model compilation in `Core::compile_model`. Reshaping models provides an ability to customize the model input shape for exactly that size that is required in the end application. This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios. - ## When to Apply Dynamic Shapes Conventional "static" model reshaping works well when it can be done once per many model inference calls with the same shape. diff --git a/docs/OV_Runtime_UG/performance_hints.md b/docs/OV_Runtime_UG/performance_hints.md new file mode 100644 index 00000000000000..5e81921854b5fa --- /dev/null +++ b/docs/OV_Runtime_UG/performance_hints.md @@ -0,0 +1,138 @@ +# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints} + +Each of the OpenVINO's [supported devices](supported_plugins/Supported_Devices.md) offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding. +Also, while the performance may be optimal for the specific combination of the device and the inferred model, the resulting configuration is not necessarily optimal for another device or model. +The OpenVINO performance hints are the new way to configure the performance with the _portability_ in mind. + +The hints also "reverse" the direction of the configuration in the right fashion: rather than map the application needs to the low-level performance settings, and keep an associated application logic to configure each possible device separately, the idea is to express a target scenario with a single config key and let the *device* to configure itself in response. +As the hints are supported by every OpenVINO device, this is completely portable and future-proof solution. + +Previously, certain level of automatic configuration was coming from the _default_ values of the parameters. For example, number of the CPU streams was deduced from the number of CPU cores, when the `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance) is set. However, the resulting number of streams didn't account for actual compute requirements of the model to be inferred. +The hints, in contrast, respect the actual model, so the parameters for the optimal throughput are calculated for each model individually (based on it's compute versus memory bandwidth requirements and capabilities of the device). + +## Performance Hints: Latency and Throughput +As discussed in the [Optimization Guide](../optimization_guide/dldt_optimization_guide.md) there are few different metrics associated with the inference speed. +Throughput and latency are some of the most critical factors that influence the overall performance of an application. + +This is why, to ease the configuration of the device, the OpenVINO already offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`. +Every OpenVINO device supports these, which makes the things portable and future-proof. +The also allows to do a performance configuration that is fully compatible with the [automatic device selection](./auto_device_selection.md). +A special `ov::hint::PerformanceMode::UNDEFINED` acts same just as specifying no hint. + +Please also see the last section in the document on conducting the performance measurements with the `benchmark_app`. + +Notice that if there are other performance factors (other than inference time) like memory footprint and model load/compilation time are of concern, a typical model may take significantly more time to load with `ov::hint::PerformanceMode::THROUGHPUT` and then consume much more memory, compared to the `ov::hint::PerformanceMode::LATENCY`. + +## Performance Hints: How It Works? +Internally, every device "translates" the value of the hint to the actual performance settings. +For example the `ov::hint::PerformanceMode::THROUGHPUT` selects number of CPU or GPU streams. +For the GPU, additionally the optimal batch size is selected and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible (and also if the device supports that [refer to the devices/features support matrix](./supported_plugins/Device_Plugins.md)). + +The resulting (device-specific) settings can be queried back from the instance of the `ov:Compiled_Model`. +Notice that the `benchmark_app`, outputs the actual settings for the THROUGHPUT hint, please the bottom of the output example: + + ``` + $benchmark_app -hint tput -d CPU -m 'path to your favorite model' + ... + [Step 8/11] Setting optimal runtime parameters + [ INFO ] Device: CPU + [ INFO ] { PERFORMANCE_HINT , THROUGHPUT } + ... + [ INFO ] { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 } + [ INFO ] { NUM_STREAMS , 4 } + ... + ``` + +## Using the Performance Hints: Basic API +In the example code-snippet below the `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model: +@sphinxdirective + +.. tab:: C++ + + .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp + :language: cpp + :fragment: [compile_model] + +.. tab:: Python + + .. doxygensnippet:: docs/snippets/ov_auto_batching.py + :language: python + :fragment: [compile_model] + +@endsphinxdirective + +## Additional (Optional) Hints from the App +Let's take an example of an application that processes 4 video streams. The most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. +As discussed previosly, for the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options: +@sphinxdirective + +.. tab:: C++ + + .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp + :language: cpp + :fragment: [hint_num_requests] + +.. tab:: Python + + .. doxygensnippet:: docs/snippets/ov_auto_batching.py + :language: python + :fragment: [hint_num_requests] + +@endsphinxdirective + +## Optimal Number of Inference Requests +Using the hints assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously: +@sphinxdirective + +.. tab:: C++ + + .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp + :language: cpp + :fragment: [query_optimal_num_requests] + +.. tab:: Python + + .. doxygensnippet:: docs/snippets/ov_auto_batching.py + :language: python + :fragment: [query_optimal_num_requests] + +@endsphinxdirective + +While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**, for efficiency (device utilization) reasons. + +Also, notice that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes the machine features. +To make your application fully scalable, prefer to query the `ov::optimal_number_of_infer_requests` directly. + +## Prefer Async API +The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and use of the `ov::InferRequest::wait()` (or callbacks). Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md). + Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios). + +## Combining the Hints and Individual Low-Level Settings +While sacrificing the portability at a some extent, it is possible to combine the hints with individual device-specific settings. +For example, you can let the device prepare a configuration `ov::hint::PerformanceMode::THROUGHPUT` while overriding any specific value: +@sphinxdirective + +.. tab:: C++ + + .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp + :language: cpp + :fragment: [hint_plus_low_level] + +.. tab:: Python + + .. doxygensnippet:: docs/snippets/ov_auto_batching.py + :language: python + :fragment: [hint_plus_low_level] + + +@endsphinxdirective +## Testing the Performance of The Hints with the Benchmark_App +The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the performance hints for a particular device: + - benchmark_app **-hint tput** -d 'device' -m 'path to your model' + - benchmark_app **-hint latency** -d 'device' -m 'path to your model' +- Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc): +- - benchmark_app **-hint none -nstreams 1** -d 'device' -m 'path to your model' + + +### See Also +[Supported Devices](./supported_plugins/Supported_Devices.md) \ No newline at end of file diff --git a/docs/OV_Runtime_UG/supported_plugins/CPU.md b/docs/OV_Runtime_UG/supported_plugins/CPU.md index 950b46c2decbd8..71c3dab270eafa 100644 --- a/docs/OV_Runtime_UG/supported_plugins/CPU.md +++ b/docs/OV_Runtime_UG/supported_plugins/CPU.md @@ -1,5 +1,14 @@ # CPU device {#openvino_docs_OV_UG_supported_plugins_CPU} +@sphinxdirective + +.. toctree:: + :maxdepth: 1 + :hidden: + + openvino_docs_IE_DG_Bfloat16Inference + +@endsphinxdirective ## Introducing the CPU Plugin The CPU plugin was developed to achieve high performance of neural networks on CPU, using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). @@ -121,7 +130,7 @@ CPU-specific settings: | KEY_CPU_THREADS_NUM | positive integer values| 0 | Specifies the number of threads that CPU plugin should use for inference. Zero (default) means using all (logical) cores| | KEY_CPU_BIND_THREAD | YES/NUMA/NO | YES | Binds inference threads to CPU cores. 'YES' (default) binding option maps threads to cores - this works best for static/synthetic scenarios like benchmarks. The 'NUMA' binding is more relaxed, binding inference threads only to NUMA nodes, leaving further scheduling to specific cores to the OS. This option might perform better in the real-life/contended scenarios. Note that for the latency-oriented cases (number of the streams is less or equal to the number of NUMA nodes, see below) both YES and NUMA options limit number of inference threads to the number of hardware cores (ignoring hyper-threading) on the multi-socket machines. | | KEY_CPU_THROUGHPUT_STREAMS | KEY_CPU_THROUGHPUT_NUMA, KEY_CPU_THROUGHPUT_AUTO, or positive integer values| 1 | Specifies number of CPU "execution" streams for the throughput mode. Upper bound for the number of inference requests that can be executed simultaneously. All available CPU cores are evenly distributed between the streams. The default value is 1, which implies latency-oriented behavior for single NUMA-node machine, with all available cores processing requests one by one. On the multi-socket (multiple NUMA nodes) machine, the best latency numbers usually achieved with a number of streams matching the number of NUMA-nodes.
KEY_CPU_THROUGHPUT_NUMA creates as many streams as needed to accommodate NUMA and avoid associated penalties.
KEY_CPU_THROUGHPUT_AUTO creates bare minimum of streams to improve the performance; this is the most portable option if you don't know how many cores your target machine has (and what would be the optimal number of streams). Note that your application should provide enough parallel slack (for example, run many inference requests) to leverage the throughput mode.
Non-negative integer value creates the requested number of streams. If a number of streams is 0, no internal streams are created and user threads are interpreted as stream master threads.| -| KEY_ENFORCE_BF16 | YES/NO| YES | The name for setting to execute in bfloat16 precision whenever it is possible. This option lets plugin know to downscale the precision where it sees performance benefits from bfloat16 execution. Such option does not guarantee accuracy of the network, you need to verify the accuracy in this mode separately, based on performance and accuracy results. It should be your decision whether to use this option or not. | +| KEY_ENFORCE_BF16 | YES/NO| YES | The name for setting to execute in [bfloat16 precision](../Bfloat16Inference.md) whenever it is possible. This option lets plugin know to downscale the precision where it sees performance benefits from bfloat16 execution. Such option does not guarantee accuracy of the network, you need to verify the accuracy in this mode separately, based on performance and accuracy results. It should be your decision whether to use this option or not. | > **NOTE**: To disable all internal threading, use the following set of configuration parameters: `KEY_CPU_THROUGHPUT_STREAMS=0`, `KEY_CPU_THREADS_NUM=1`, `KEY_CPU_BIND_THREAD=NO`. diff --git a/docs/OV_Runtime_UG/supported_plugins/GPU.md b/docs/OV_Runtime_UG/supported_plugins/GPU.md index d7d932d2e8c49a..7099ccc307bd16 100644 --- a/docs/OV_Runtime_UG/supported_plugins/GPU.md +++ b/docs/OV_Runtime_UG/supported_plugins/GPU.md @@ -83,7 +83,7 @@ See [low-precision optimization guide](@ref pot_docs_LowPrecisionOptimizationGui Floating-point precision of a GPU primitive is selected based on operation precision in IR except [compressed f16 IR form](../../MO_DG/prepare_model/FP16_Compression.md) which is executed in f16 precision. -> **NOTE**: Harware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via `ov::device::capabilities` property. +> **NOTE**: Hardware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via `ov::device::capabilities` property. [Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices. @@ -99,8 +99,8 @@ See [Multi-device execution page](../multi_device.md) for more details. ### Automatic batching GPU plugin is capable of reporting `ov::max_batch_size` and `ov::optimal_batch_size` metrics with respect to the current hardware platform and model, -thus automatic batching can be applied in cases when `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set -or device is specified as `"BATCH:GPU"`. +thus automatic batching is automatically enabled when `ov::optimal_batch_size` is > 1 and `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set. +Alternatively it can be enabled explicitly via the device notion, e.g. `"BATCH:GPU"`. @sphinxdirective @@ -110,7 +110,7 @@ or device is specified as `"BATCH:GPU"`. :language: cpp :fragment: [compile_model_batch_plugin] -.. tab:: Bacthing via throughput hint +.. tab:: Batching via throughput hint .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp :language: cpp @@ -215,6 +215,17 @@ Below is the list of such operations: The behavior depends on specific parameters of the operations and hardware configuration. + +## GPU Performance Checklist: Summary +Since the OpenVINO relies on the OpenCL™ kernels for the GPU implementation. Thus, many general OpenCL tips apply: +- Prefer `FP16` inference precision over `FP32`, as the Model Optimizer can generate both variants and the `FP32` is default. Also, consider [int8 inference](../Int8Inference.md) +- Try to group individual infer jobs by using [automatic batching](../automatic_batching.md) +- Consider [caching](../Model_caching_overview.md) to minimize model load time +- If your application is simultaneously using the inference on the CPU or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use [CPU configuration options](./CPU.md) to limit number of inference threads for the CPU plugin. +- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If the _CPU_ utilization is a concern, consider the dedicated referenced in this document. Notice that this option might increase the inference latency, so consider combining with multiple GPU streams or [throughput performance hints](../performance_hints.md). +- When operating media inputs consider [remote tensors API of the GPU Plugin](./GPU_RemoteTensor_API.md). + + ## See Also * [Supported Devices](Supported_Devices.md) * [Optimization guide](@ref openvino_docs_optimization_guide_dldt_optimization_guide) diff --git a/docs/documentation.md b/docs/documentation.md index d5460f3a280ecc..e4a18481222c27 100644 --- a/docs/documentation.md +++ b/docs/documentation.md @@ -29,6 +29,7 @@ openvino_docs_optimization_guide_dldt_optimization_guide openvino_docs_MO_DG_Getting_Performance_Numbers openvino_docs_model_optimization_guide + openvino_docs_deployment_optimization_guide_dldt_optimization_guide openvino_docs_tuning_utilities openvino_docs_performance_benchmarks diff --git a/docs/img/BATCH_device.PNG b/docs/img/BATCH_device.PNG new file mode 100644 index 00000000000000..97245cef2826e2 --- /dev/null +++ b/docs/img/BATCH_device.PNG @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d1461f042894cd61c2812f12ffa566e1723fdd16a1ee8398321e58d309143475 +size 123115 diff --git a/docs/index.rst b/docs/index.rst index bd472b359b7c50..5aa299039aed2e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -78,7 +78,7 @@ OpenVINO™ Documentation

Tune & Optimize

-

Use quantization, pruning, and sparsity algorithms to make your application as efficient as possible.

+

Model-level (e.g. quantization) and Runtime (i.e. application) -level optimizations to make your inference as fast as possible.

Performance
Benchmarks

diff --git a/docs/optimization_guide/dldt_deployment_optimization_common.md b/docs/optimization_guide/dldt_deployment_optimization_common.md new file mode 100644 index 00000000000000..5844230245d3e6 --- /dev/null +++ b/docs/optimization_guide/dldt_deployment_optimization_common.md @@ -0,0 +1,51 @@ +# General Optimizations {#openvino_docs_deployment_optimization_guide_common} + +## Inputs Pre-processing with OpenVINO + +In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code: +- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, to the weights of the first convolution). Please see [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md). +- Let the OpenVINO accelerate other means of [Image Pre-processing and Conversion](../OV_Runtime_UG/preprocessing_overview.md). +- Note that in many cases, you can directly share the (input) data with the OpenVINO, for example consider [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md). + +## Prefer OpenVINO Async API
+The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and executes immediately (effectively serializing the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()`. Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md). + +A typical use-case for the `ov::InferRequest::infer()` is running a dedicated application thread per source of inputs (e.g. a camera), so that every step (frame capture, processing, results parsing and associated logic) is kept serial within the thread. +In contrast, the `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` allow the application to continue its activities and poll or wait for the inference completion when really needed. So one reason for using asynchronous code is _efficiency_. + +**NOTE**: Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based, below) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios). + +Let's see how the OpenVINO Async API can improve overall throughput rate of the application. The key advantage of the Async approach is as follows: while a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the inference to complete. + +In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages. + +You can compare the pseudo-codes for the regular and async-based approaches: + +- In the regular way, the frame is captured with OpenCV and then immediately processed:
+ +@snippet snippets/dldt_optimization_guide8.cpp part8 + +![Intel® VTune™ screenshot](../img/vtune_regular.png) + +- In the "true" async mode, the `NEXT` request is populated in the main (application) thread, while the `CURRENT` request is processed:
+ +@snippet snippets/dldt_optimization_guide9.cpp part9 + +![Intel® VTune™ screenshot](../img/vtune_async.png) + +The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results. +Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for complete examples of the Async API in action. + +### Notes on Callbacks +Notice that the Async's `ov::InferRequest::wait()` waits for the specific request only. However, running multiple inference requests in parallel provides no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait`. The most scalable approach is using callbacks (set via the `ov::InferRequest::set_callback`) that are executed upon completion of the request. The callback functions will be used by the OpenVINO runtime to notify on the results (or errors. +This is more event-driven approach. + +Few important points on the callbacks: +- It is the application responsibility to ensure that any callback function is thread-safe +- Although executed asynchronously by a dedicated threads the callbacks should NOT include heavy operations (e.g. I/O) and/or blocking calls. Keep the work done by any callback to a minimum. + +## "get_tensor" Idiom + +`get_tensor` is a recommended way to populate the inference inputs (and read back the outputs), as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs tensors are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen. +Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md). +In contrast, the `set_tensor` is a preferable way to handle remote tensors, [for example with the GPU device](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md). diff --git a/docs/optimization_guide/dldt_deployment_optimization_guide.md b/docs/optimization_guide/dldt_deployment_optimization_guide.md index f26c0dd558aced..fe13deb6801823 100644 --- a/docs/optimization_guide/dldt_deployment_optimization_guide.md +++ b/docs/optimization_guide/dldt_deployment_optimization_guide.md @@ -1,303 +1,44 @@ -# Deployment Optimization Guide {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide} +# Runtime Inference Optimizations {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide} @sphinxdirective .. toctree:: :maxdepth: 1 :hidden: - - openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional + + openvino_docs_deployment_optimization_guide_common + openvino_docs_deployment_optimization_guide_latency + openvino_docs_deployment_optimization_guide_tput + openvino_docs_deployment_optimization_guide_hints @endsphinxdirective -To optimize your performance results during runtime step it is possible to experiment with: +## Deployment Optimizations Overview {#openvino_docs_deployment_optimization_guide_overview} +Runtime or deployment optimizations focus is tuning of the inference parameters (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_. -* Preprocess +Here, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers. +In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. +Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold. -* Throughput mode +Each of the [OpenVINO supported devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) offers low-level performance configuration. This allows to leverage the optimal model performance on the _specific_ device, but may require careful re-tuning when the model or device has changed. +**If the performance portability is of concern, consider using the [OpenVINO High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) first.** -* Async API +Finally, how the full-stack application uses the inference component _end-to-end_ is important. +For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. As detailed in the section on the [general optimizations](./dldt_deployment_optimization_common.md), the inputs population can be performed asynchronously to the inference. Also, in many cases the (image) [pre-processing can be offloaded to the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md). For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md) to efficiently connect the data input pipeline and the model inference. +These are common performance tricks that help both latency and throughput scenarios. -* Lowering inference precision + Similarly, the _model-level_ optimizations like [quantization that unlocks the int8 inference](../OV_Runtime_UG/Int8Inference.md) are general and help any scenario. As referenced in the [performance introduction topic](./dldt_optimization_guide.md), these are covered in the [dedicated document](./model_optimization_guide.md). Additionally, the `ov::hint::inference_precision` allows the devices to trade the accuracy for the performance at the _runtime_ (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model). + +Further documents cover the _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md). -* Device optimization +[General, application-level optimizations](./dldt_deployment_optimization_common.md): + +* Inputs Pre-processing with the OpenVINO -* Combination of devices +* Async API and 'get_tensor' Idiom -## Preprocess - -### Letting the Inference Engine Accelerate Image Pre-processing/Conversion - -In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code: -- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, weights of the first convolution). See Model Optimizer Knobs Related to Performance. -- If regular 8-bit per channel images are your native media (for instance, decoded frames), do not convert to the `FP32` on your side, as this is something that plugins can accelerate. Use the `InferenceEngine::Precision::U8` as your input format:
- -@snippet snippets/dldt_optimization_guide1.cpp part1 - -Note that in many cases, you can directly share the (input) data with the Inference Engine. - -## Throughput Mode - -One way to increase computational efficiency is batching, which combines many (potentially tens) of input images to achieve optimal throughput. Internally, the execution resources are split/pinned into execution *streams*. Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines. - -![](../img/THROUGHPUT.svg) - -Run the Benchmark App and play with number of infer requests running in parallel, next section. Try different values of the -nstreams argument from 1 to a number of CPU cores and find one that provides the best performance. - -The throughput mode relaxes the requirement to saturate the CPU by using a large batch: running multiple independent inference requests in parallel often gives much better performance, than using a batch only. This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance. Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API. - -## Inference Engine Async API - -Inference Engine Async API can improve overall frame rate of the application. While accelerator is busy with the inference, the application can continue doing things on the host rather than wait for the inference to complete. - -In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages. - -You can compare the pseudo-codes for the regular and async-based approaches: - -- In the regular way, the frame is captured with OpenCV and then immediately processed:
- -@snippet snippets/dldt_optimization_guide8.cpp part8 - -![Intel® VTune™ screenshot](../img/vtune_regular.png) - -- In the "true" async mode, the `NEXT` request is populated in the main (application) thread, while the `CURRENT` request is processed:
- -@snippet snippets/dldt_optimization_guide9.cpp part9 - -![Intel® VTune™ screenshot](../img/vtune_async.png) - -The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results. - -There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the HDDL and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to Heterogeneity. - -Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy. - -Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) (which has both latency and throughput-oriented modes) for complete examples of the Async API in action. - -### Request-Based API and “GetBlob” Idiom - -Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous) `Infer` into `StartAsync` and `Wait` (see Inference Engine Async API). - -More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network: - -```sh - -@snippet snippets/dldt_optimization_guide6.cpp part6 -``` - -`GetBlob` is a recommend way to communicate with the network, as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs blobs are mapped to the host (which is fast) if the `GetBlob` is used. But if you called the `SetBlob`, the copy (from/to the blob you have set) into the internal GPU plugin structures will happen. - -### Performance Aspects of Running Multiple Requests Simultaneously - -If your application simultaneously executes multiple infer requests: - -- For the CPU, the best solution, you can use the CPU "throughput" mode. -- If latency is of more concern, you can try the `EXCLUSIVE_ASYNC_REQUESTS` [configuration option](../OV_Runtime_UG/supported_plugins/CPU.md) that limits the number of the simultaneously executed requests for all (executable) networks that share the specific device to just one: - -@snippet snippets/dldt_optimization_guide7.cpp part7 - -For more information on the executable networks notation, see Request-Based API and “GetBlob” Idiom. - -- The heterogeneous device uses the `EXCLUSIVE_ASYNC_REQUESTS` by default. - -- `KEY_EXCLUSIVE_ASYNC_REQUESTS` option affects only device queues of the individual application. - -- For GPU, the actual work is serialized by a plugin and/or a driver anyway. - -- Finally, for any VPU flavor, using multiple requests is a must for achieving good throughput. - -In the Inference Engine, there is no notion of requests priorities. It is left to the user side (for example, not queuing the low priority infer request, until another higher priority is waiting). Notice that it would require additional logic to synchronize between executable networks (queues) in your application code. - -## Automatic Lowering of the Inference Precision - -Inference precision directly affects the performance. - -Model Optimizer can produce an IR with different precision. For example, an FP16 IR initially targets VPU and GPU devices, while, for example, for the CPU, an FP16 IR is typically up-scaled to the regular FP32 automatically upon loading. But notice that further device-specific inference precision settings are available, -for example, [8-bit integer](../OV_Runtime_UG/Int8Inference.md) or [bfloat16](../OV_Runtime_UG/Bfloat16Inference.md), which is specific to the CPU inference, below. -Note that for the [Multi-Device execution](../OV_Runtime_UG/multi_device.md) that supports automatic inference on multiple devices in parallel, you can use an FP16 IR (no need for FP32). -You can find more information, including preferred data types for specific devices, in the -[Supported Devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) document. - - -By default, plugins enable the optimizations that allow lower precision if the acceptable range of accuracy is preserved. -For example, for the CPU that supports the AVX512_BF16 instructions, an FP16/FP32 model is converted to a [bfloat16](../OV_Runtime_UG/Bfloat16Inference.md) IR to accelerate inference. - -To compare the associated speedup, run the example command below to disable this feature on the CPU device with the AVX512_BF16 support and get regular FP32 execution: - -```sh -$ benchmark_app -m -enforcebf16=false - ``` - -Notice that for quantized (e.g. INT8) models the bfloat16 calculations (of the layers that remain in FP32) is disabled by default. -Refer to the [CPU Plugin documentation](../OV_Runtime_UG/supported_plugins/CPU.md) for more details. - -Similarly, the GPU device automatically executes FP16 for the layers that remain in FP16 in the quantized models (assuming that the FP16 model was quantized). -Refer to the ENABLE_FP16_FOR_QUANTIZED_MODELS key in the [GPU Plugin documentation](../OV_Runtime_UG/supported_plugins/GPU.md). - -## Device Optimizations - -The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, Intel® Vision Accelerator Design with Intel® Movidius™ Vision Processing Units (VPU)), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance. - -### CPU Checklist - -CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected. - -The only hint you can get from that is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the `jit_avx2` when inspecting the internal inference performance counters (and additional '_int8' postfix for [int8 inference](../OV_Runtime_UG/Int8Inference.md)). If you are an advanced user, you can further trace the CPU execution with (see Intel® VTune™). - -Internally, the Inference Engine has a threading abstraction level, which allows for compiling the [open source version](https://github.com/opencv/dldt) with either Intel® Threading Building Blocks (Intel® TBB) which is now default, or OpenMP* as an alternative parallelism solution. When using inference on the CPU, this is particularly important to align threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section. - - Since R1 2019, the OpenVINO™ toolkit comes pre-compiled with Intel TBB, - so any OpenMP* API or environment settings (like `OMP_NUM_THREADS`) has no effect. - Certain tweaks (like number of threads used for inference on the CPU) are still possible via [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md). - Finally, the OpenVINO CPU inference is NUMA-aware, please refer to the Tips for inference on NUMA systems section. - -Other general recommendations: -- Usually, batching improves CPU performance. However, the need to gather frames in the batch might complicate the application logic. Instead, you can keep a separate infer request per camera or other source of input and process the requests in parallel. For more information, see the next section. -- If your application simultaneously performs inference of multiple models on the same CPU, make sure you do not oversubscribe the machine. See Performance Aspects of Running Multiple Requests Simultaneously for more information. -- Notice that the heterogeneous execution might implicitly load the CPU. For details, refer to the Heterogeneity section. -- Consider [8-bit integer inference on the CPU](../OV_Runtime_UG/Int8Inference.md). - -#### Throughput Mode for CPU -Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. -In fact, the OpenVINO does support the "throughput" mode for the CPU, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the overall throughput. - -Internally, the execution resources are split/pinned into execution "streams". -This feature usually provides much better performance for the networks than batching. This is especially true for the many-core server machines: -![](../img/cpu_streams_explained_1.png) - -Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, and much less within CNN ops): -![](../img/cpu_streams_explained.png) - -Try the [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample and play with number of streams running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. -For example, on an 8-core CPU, compare the `-nstreams 1` (which is a legacy, latency-oriented scenario) to the 2, 4, and 8 streams. - -In addition, you can play with the batch size to find the throughput sweet spot. - -If your application is hard or impossible to change in accordance with the multiple-requests logic, consider the "multiple-instance" trick to improve the throughput: -- For multi-socket execution, it is recommended to set [`KEY_CPU_THREADS_NUM`](../OV_Runtime_UG/supported_plugins/CPU.md) to the number of cores per socket, and run as many instances of the application as you have sockets. -- Similarly, for extremely lightweight networks (running faster than 1ms) and/or many-core machines (16+ cores), try limiting the number of CPU inference threads to just `#‍phys` cores and further, while trying to saturate the machine with running multiple instances of the application. - -### GPU Checklist - -Inference Engine relies on the [Compute Library for Deep Neural Networks (clDNN)](https://01.org/cldnn) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply: - -- Prefer `FP16` over `FP32`, as the Model Optimizer can generate both variants and the `FP32` is default. -- Try to group individual infer jobs by using batches. -- Notice that using the GPU introduces one-time overhead (order of few seconds) of compiling the OpenCL kernels. The compilation happens upon loading the network to the GPU plugin and does not affect the inference time. -- If your application is simultaneously using the inference on the CPU or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md) to limit number of inference threads for the CPU plugin. -- In the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If the _CPU_ utilization is a concern, consider the `KEY_CLDND_PLUGIN_THROTTLE` configuration option. - -> **NOTE**: See the [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) code for a usage example. -Notice that while disabling the polling, this option might reduce the GPU performance, so usually this option is used with multiple [GPU streams](../OV_Runtime_UG/supported_plugins/GPU.md). - - -### Intel® Movidius™ Myriad™ X Visual Processing Unit and Intel® Vision Accelerator Design with Intel® Movidius™ VPUs - -Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum four infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for more information. - -Intel® Vision Accelerator Design with Intel® Movidius™ VPUs requires to keep at least 32 inference requests in flight to fully saturate the device. - -## Heterogeneity - -Heterogeneous execution (constituted by the dedicated Inference Engine [“Hetero” device](../OV_Runtime_UG/hetero_execution.md)) enables to schedule a network inference to the multiple devices. - -### Typical Heterogeneous Scenarios of Concern - -The primary points for executing a network in heterogeneous mode are as follows: - -- Calculate the heaviest pieces of the network with an accelerator while falling back to the CPU for the layers that are not supported by the accelerator.
- This is particularly useful when certain custom (user) kernels are implemented only for the CPU (and much harder or even impossible to implement for the accelerator). - -- Use all available compute devices more efficiently, for example, by running branches of the network on the different devices. - -### Heterogeneous Flow - -The execution through heterogeneous plugin has three distinct steps: - -1. **Applying affinity setting for the layers**, that is, binding them to the devices. - - - This can be done automatically using *fallback priorities*, or on the *per-layer* basis. - - - The affinity setting is made before loading the network to the (heterogeneous) plugin, so this is always a **static** setup with respect to execution. - -2. **Loading a network to the heterogeneous plugin**, which internally splits the network into subgraphs.
- You can check the decisions the plugin makes, see Analysing the Heterogeneous Execution. - -3. **Executing the infer requests**. From user’s side, this looks identical to a single-device case, while internally, the subgraphs are executed by actual plugins/devices. - -Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples). - -Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference. - -The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" or helper kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) 'Reorder' entries (see Internal Inference Performance Counters). - -For general details on the heterogeneous mode, refer to the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md). - -### Trying the Heterogeneous Plugin with Inference Engine Samples - -Every Inference Engine sample supports the `-d` (device) option. - -For example, here is a command to run an [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md): - -```sh -./classification_sample_async -m /Model.xml -i /picture.jpg -d HETERO:GPU,CPU -``` - -where: - -- `HETERO` stands for Heterogeneous plugin. -- `GPU,CPU` points to fallback policy with first priority on GPU and further fallback to CPU. - -You can point more than two devices: `-d HETERO:HDDL,GPU,CPU`. - -### General Tips on GPU/CPU Execution - -The following tips are provided to give general guidance on optimizing execution on GPU/CPU devices. - -- Generally, GPU performance is better on heavy kernels (like Convolutions) and large inputs. So if the network inference time is already too small (~1ms of execution time), using the GPU would unlikely give a boost. - -- A typical strategy to start with is to test the CPU-only and GPU-only scenarios first (with samples this is plain `-d CPU` or `-d GPU`). If there are specific kernels that are not supported by the GPU, the best option to try is the `HETERO:GPU,CPU` that automatically applies default splitting (based on the plugins layers support). Then, you can play with the manual affinity settings (for example, to further minimize the number of subgraphs). - -- The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" (or helper) kernels on the CPU. Notice that this includes the granularity considerations. For example, running some (custom) activation on the CPU would result in too many conversions. - -- It is advised to do performance analysis to determine “hotspot” kernels, which should be the first candidates for offloading. At the same time, it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other run-time overheads. - -- Notice that GPU can be busy with other tasks (like rendering). Similarly, the CPU can be in charge for the general OS routines and other application threads (see Note on the App-Level Threading). Also, a high interrupt rate due to many subgraphs can raise the frequency of the one device and drag the frequency of another down. - -- Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease, even comparing to single-device scenario. - -- Mixing the `FP16` (GPU) and `FP32` (CPU) execution results in conversions and, thus, performance issues. If you are seeing a lot of heavy outstanding (compared to the CPU-only execution) Reorders, consider implementing actual GPU kernels. Refer to Internal Inference Performance Counters for more information. - -### Analyzing Heterogeneous Execution - -There is a dedicated configuration option that enables dumping the visualization of the subgraphs created by the heterogeneous mode, please see code example in the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md) - -After enabling the configuration key, the heterogeneous plugin generates two files: - -- `hetero_affinity.dot` - per-layer affinities. This file is generated only if default fallback policy was executed (as otherwise you have set the affinities by yourself, so you know them). -- `hetero_subgraphs.dot` - affinities per sub-graph. This file is written to the disk during execution of `Core::LoadNetwork` for the heterogeneous flow. - -You can use GraphViz\* utility or `.dot` converters (for example, to `.png` or `.pdf`), like xdot\*, available on Linux\* OS with `sudo apt-get install xdot`. - -You can also use performance data (in the [Benchmark App](../../samples/cpp/benchmark_app/README.md), it is an option `-pc`) to get performance data on each subgraph. Again, refer to the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md) and to Internal Inference Performance Counters for a general counters information. - -## Multi-Device Execution -OpenVINO™ toolkit supports automatic multi-device execution, please see [Multi-Device execution](../OV_Runtime_UG/multi_device.md) description. -In the next chapter you can find the device-specific tips, while this section covers few recommendations -for the multi-device execution: -- MULTI usually performs best when the fastest device is specified first in the list of the devices. - This is particularly important when the parallelism is not sufficient - (e.g. the number of request in the flight is not enough to saturate all devices). -- It is highly recommended to query the optimal number of inference requests directly from the instance of the ExecutionNetwork - (resulted from the LoadNetwork call with the specific multi-device configuration as a parameter). -Please refer to the code of the [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample for details. -- Notice that for example CPU+GPU execution performs better with certain knobs - which you can find in the code of the same [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample. - One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams (which is already a default for the GPU) to amortize slower - inference completion from the device to the host. -- Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests - and device-specific 'worker' requests that are being actually scheduled behind the scene. - To facilitate the copy savings, it is recommended to start the requests in the order that they were created - (with ExecutableNetwork's CreateInferRequest). - -Refer to [Deployment Optimization Guide Additional Configurations](dldt_deployment_optimization_guide_additional.md) to read more about performance during deployment step and learn about threading, working with multi-socket CPUs and Basic Interoperability with Other APIs. +Use-case specific optimizations along with some implementation details: + +* Optimizing for [throughput](./dldt_deployment_optimization_tput.md) and [latency](./dldt_deployment_optimization_latency.md) + +* [OpenVINO's high-level performance hints](./dldt_deployment_optimization_hints.md) as the portable, future-proof approach for performance configuration diff --git a/docs/optimization_guide/dldt_deployment_optimization_guide_additional.md b/docs/optimization_guide/dldt_deployment_optimization_guide_additional.md deleted file mode 100644 index fd70c080d6119a..00000000000000 --- a/docs/optimization_guide/dldt_deployment_optimization_guide_additional.md +++ /dev/null @@ -1,70 +0,0 @@ -# Deployment Optimization Guide Additional Configurations {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional} - -To optimize your performance results during runtime step, you can experiment with: - -* multi socket CPUs - -* threading - -* Basic Interoperability with Other APIs - - -## Best Latency on the Multi-Socket CPUs -Note that when latency is of concern, there are additional tips for multi-socket systems. -When input is limited to the single image, the only way to achieve the best latency is to limit execution to the single socket. -The reason is that single image is simply not enough -to saturate more than one socket. Also NUMA overheads might dominate the execution time. -Below is the example command line that limits the execution to the single socket using numactl for the best *latency* value -(assuming the machine with 28 phys cores per socket): -``` -limited to the single socket). -$ numactl -m 0 --physcpubind 0-27 benchmark_app -m -api sync -nthreads 28 - ``` -Note that if you have more than one input, running as many inference requests as you have NUMA nodes (or sockets) -usually gives the same best latency as a single request on the single socket, but much higher throughput. Assuming two NUMA nodes machine: -``` -$ benchmark_app -m -nstreams 2 - ``` -Number of NUMA nodes on the machine can be queried via 'lscpu'. -Please see more on the NUMA support in the [Optimization Guide](../OV_Runtime_UG/multi_device.md). - - - ## Threading - - - As explained in the CPU Checklist section, by default the Inference Engine uses Intel TBB as a parallel engine. Thus, any OpenVINO-internal threading (including CPU inference) uses the same threads pool, provided by the TBB. But there are also other threads in your application, so oversubscription is possible at the application level: -- The rule of thumb is that you should try to have the overall number of active threads in your application equal to the number of cores in your machine. Keep in mind the spare core(s) that the OpenCL driver under the GPU plugin might also need. -- One specific workaround to limit the number of threads for the Inference Engine is using the [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md). -- To avoid further oversubscription, use the same threading model in all modules/libraries that your application uses. Notice that third party components might bring their own threading. For example, using Inference Engine which is now compiled with the TBB by default might lead to [performance troubles](https://www.threadingbuildingblocks.org/docs/help/reference/appendices/known_issues/interoperability.html) when mixed in the same app with another computationally-intensive library, but compiled with OpenMP. You can try to compile the [open source version](https://github.com/opencv/dldt) of the Inference Engine to use the OpenMP as well. But notice that in general, the TBB offers much better composability, than other threading solutions. -- If your code (or third party libraries) uses GNU OpenMP, the Intel® OpenMP (if you have recompiled Inference Engine with that) must be initialized first. This can be achieved by linking your application with the Intel OpenMP instead of GNU OpenMP, or using `LD_PRELOAD` on Linux* OS. - -## Basic Interoperability with Other APIs - -The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the *system* memory. That is, in your code, you should map or copy the data from the API to the CPU address space first. - -For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the [Video Processing Procedures (VPP)](https://software.intel.com/en-us/node/696108). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the `SetBlob`: - -@snippet snippets/dldt_optimization_guide2.cpp part2 - -**WARNING**: The `InferenceEngine::NHWC` layout is not supported natively by most InferenceEngine plugins so internal conversion might happen. - -@snippet snippets/dldt_optimization_guide3.cpp part3 - -Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with `SetBlob` just like in previous example: - -@snippet snippets/dldt_optimization_guide4.cpp part4 - -The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs). Also, it is available only on LInux. - -## OpenCV* Interoperability Example - -Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like `cv::Mat` reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred. - -Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the **planar** and **not interleaved** inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly: - -**WARNING**: The `InferenceEngine::NHWC` layout is not supported natively by most InferenceEngine plugins so internal conversion might happen. - -@snippet snippets/dldt_optimization_guide5.cpp part5 - -Notice that original `cv::Mat`/blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API. - -To learn more about optimizations during developing step, visit [Deployment Optimization Guide](dldt_deployment_optimization_guide.md) page. diff --git a/docs/optimization_guide/dldt_deployment_optimization_hints.md b/docs/optimization_guide/dldt_deployment_optimization_hints.md new file mode 100644 index 00000000000000..c06cfc4caa2e75 --- /dev/null +++ b/docs/optimization_guide/dldt_deployment_optimization_hints.md @@ -0,0 +1,22 @@ +# High-level Performance Hints (Presets) {#openvino_docs_deployment_optimization_guide_hints} + +Traditionally, each of the OpenVINO's [supported devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers a bunch of low-level performance settings. +Tweaking this detailed configuration requires deep architecture understanding. +Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof: +- Even within a family of the devices (like various CPUs), things like different number of CPU cores would eventually result in different execution configuration to be optimal. +- Similarly the optimal batch size is very much specific to the particular instance of the GPU. +- Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model's quantization and other factors add more unknowns to the resulting performance equation. +- Finally, the optimal execution parameters of one device do not transparently map to another device type, for example: + - Both the CPU and GPU devices support the notion of the 'streams' (i.e. inference instances that are executed in parallel, please see `ov::num_streams`), yet the optimal number of the streams is deduced very differently. + +Beyond execution _parameters_ there are potentially many device-specific details like _scheduling_ that greatly affect the performance. +Specifically, GPU-oriented tricks like batching, which combines many (potentially tens) of input images to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the next sections. +The hints allow to really hide _execution_ specifics required to saturate the device. For example, no need to explicitly combine multiple inputs into a batch to achieve good GPU performance. +Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using OpenVINO Async API. + +The only requirement for the application to leverage the throughput is about **running multiple inference requests in parallel**. +OpenVINO's device-specific implementation of the hints will take care of the rest. This allows a developer to greatly simplify the app-logic. + +In summary, when the performance _portability_ is of concern, consider the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md). +Below you can find the implementation details (particularly how the OpenVINO implements the 'throughput' approach) for the specific devices. +Keep in mind that while different throughput-oriented scheduling approaches ([like the batching or other means of executing individual inference requests](./dldt_deployment_optimization_tput.md)) can work together, the hints make these decisions to be transparent to the application. \ No newline at end of file diff --git a/docs/optimization_guide/dldt_deployment_optimization_latency.md b/docs/optimization_guide/dldt_deployment_optimization_latency.md new file mode 100644 index 00000000000000..cf75edc6bc1598 --- /dev/null +++ b/docs/optimization_guide/dldt_deployment_optimization_latency.md @@ -0,0 +1,35 @@ +## Optimizing for the Latency {#openvino_docs_deployment_optimization_guide_latency} + +@sphinxdirective + +.. toctree:: + :maxdepth: 1 + :hidden: + + openvino_docs_IE_DG_Model_caching_overview + +@endsphinxdirective + +## Latency Specifics +A significant fraction of applications focused on the situations where typically a single model is loaded (and single input is used) at a time. +This is a regular "consumer" use case and a default (also for the legacy reasons) performance setup for any OpenVINO device. +Notice that an application can create more than one request if needed (for example to support asynchronous inputs population), the question is really about how many requests are being executed in parallel. + +Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously, or in chain (for example in the inference pipeline). +As expected, the lowest latency is achieved with only one concurrent inference at a moment. Accordingly, any additional concurrency usually results in the latency growing fast. + +However, for example, specific configurations, like multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine. +Thus, human expertise is required to get the most out of the device even in the latency case. Consider using [OpenVINO high-level performance hints](../OV_Runtime_UG/performance_hints.md) instead. + +**NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof. + +In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) to check for the support of the feature by the specific device). + +## First-Inference Latency and Model Load/Compile Time +There are cases when model loading/compilation are heavily contributing to the end-to-end latencies. +For example when the model is used exactly once, or when due to on-device memory limitations the model is unloaded (to free the memory for another inference) and reloaded at some cadence. + +Such a "first-inference latency" scenario however may pose an additional limitation on the model load\compilation time, as inference accelerators (other than the CPU) usually require certain level of model compilation upon loading. +The [model caching](../OV_Runtime_UG/Model_caching_overview.md) is a way to amortize the loading/compilation time over multiple application runs. If the model caching is not possible (as e.g. it requires write permissions for the applications), the CPU device is almost exclusively offers the fastest model load time. Also, consider using the [AUTO device](../OV_Runtime_UG/auto_device_selection.md). It allows to transparently use the CPU for inference, while the actual accelerator loads the model (upon that, the inference hot-swapping also happens automatically). + +Finally, notice that any [throughput-oriented options](./dldt_deployment_optimization_tput.md) may increase the model up time significantly. diff --git a/docs/optimization_guide/dldt_deployment_optimization_tput.md b/docs/optimization_guide/dldt_deployment_optimization_tput.md new file mode 100644 index 00000000000000..5fdfe20bc578a7 --- /dev/null +++ b/docs/optimization_guide/dldt_deployment_optimization_tput.md @@ -0,0 +1,68 @@ +# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput} + +## General Throughput Considerations +As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering the every single request at the minimal delay. +Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously. +Here, the overall application throughput can be significantly improved with the right performance configuration. +Also, if the model is not already compute- or memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel. + +With the OpenVINO there two major means of running the multiple requests simultaneously: batching and "streams", explained in this document. +Yet, different GPUs behave differently with batch sizes, just like different CPUs require different number of execution streams to maximize the throughput. +Predicting inference performance is difficult and and finding optimal execution parameters requires direct experiments measurements. +One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore). +Also, consider [Deep Learning Workbench](https://docs.openvino.ai/latest/workbench_docs_Workbench_DG_Introduction.html). + +Finally, the [automatic multi-device execution](../OV_Runtime_UG/multi_device.md) helps to improve the throughput, please also see the section below. +While the same approach of optimizing the parameters of each device separately does work, the resulting multi-device performance is a fraction (that is different for different models) of the “ideal” (plain sum) performance. + +Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics. +As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios. +The hints are described [here](./dldt_deployment_optimization_hints.md). + +**NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof. + +The rest of the document provides low-level details on the OpenVINO's low-level ways to optimize the throughput. + +## Low-Level Implementation Details +### OpenVINO Streams +As detailed in the section OpenVINO Async API running multiple inference requests asynchronously is important for general application efficiency. +Additionally, most devices support running multiple inference requests in parallel in order to improve the device utilization. The _level_ of the parallelism (i.e. how many requests are really executed in parallel on the device) is commonly referred as a number of 'streams'. Some devices run several requests per stream to amortize the host-side costs. +Notice that streams (that can be considered as independent queues) are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), this makes the streams much more compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes. + +Also, notice that for efficient asynchronous execution, the streams are actually handling inference with special pool of the threads. +So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:compiled_model`. +If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution. + +The usage of multiple streams is an inherently throughput-oriented approach, as every stream requires a dedicated memory to operate in parallel to the rest streams (read-only data like weights are usually shared between all streams). +Also, the streams inflate the load/compilation time. +This is why the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one). + +Finally, the streams are always preferable compared to creating multiple instances of the same model, as weights memory is shared across streams, reducing possible memory consumption. + +### Throughput on the CPU: Internals +In order to best serve multiple inference requests simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the CPU streams. +This provides much better performance for the networks than batching especially for the many-core machines: +![](../img/cpu_streams_explained_1.png) + +Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops): +![](../img/cpu_streams_explained.png) + +Notice that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](../OV_Runtime_UG/Int8Inference.md) hardware acceleration, number of cores, etc). + +### Automatic Batching Internals +While the GPU plugin fully supports general notion of the streams, the associated performance (throughput) improvements are usually modest. +The primary reason is that, while the streams allow to hide the communication overheads and hide certain bubbles in device utilization, running multiple OpenCL kernels on the GPU simultaneously is less efficient, compared to calling a kernel on the multiple inputs at once. + +When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice. Also streams are fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes. +Typically, for 4 and more requests the batching delivers better throughput for the GPUs. Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario. +As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization. +The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from +individual inference requests followed by the actual batched execution, with no programming effort from the user: +![](../img/BATCH_device.PNG) + +Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout. +Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect). + +So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests. + +Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensor data access APIs. diff --git a/docs/optimization_guide/dldt_optimization_guide.md b/docs/optimization_guide/dldt_optimization_guide.md index 85b899faeea867..a90f744ff2bb86 100644 --- a/docs/optimization_guide/dldt_optimization_guide.md +++ b/docs/optimization_guide/dldt_optimization_guide.md @@ -1,28 +1,36 @@ -# Performance Optimization Guide {#openvino_docs_optimization_guide_dldt_optimization_guide} +# Introduction to Performance Optimization {#openvino_docs_optimization_guide_dldt_optimization_guide} +Before exploring possible optimization techniques, let us first define what the inference performance is and how to measure that. +Notice that reported inference performance often tends to focus on the speed of execution. +In fact these are at least four connected factors of accuracy, throughput/latency and efficiency. The rest of the document discusses how to balance these key factors. -Before exploring optimization techniques, let us first define what performance is and how it is measured. +## What Is Inference Performance +Generally, performance means how fast the model processes the live data. Two key metrics are used to measure the performance: latency and throughput are fundamentally interconnected. -## What Is Performance +![](../img/LATENCY_VS_THROUGHPUT.svg) -Performance means how fast the model is in deployment. Two key metrics are used to measure performance: latency and throughput. +Latency measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs executed simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern. +To calculate throughput, divide number of frames that were processed by the processing time. -![](../img/LATENCY_VS_THROUGHPUT.svg) +It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU. Similarly, the image-preprocessing may also contribute significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately. +For the end-to-end scenario though, consider the image pre-processing thru the OpenVINO and the asynchronous execution is a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md). + +"First-inference latency" is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time. + +Finally, memory footprint restrictions is another possible concern when designing an application. While this is a motivation for the _model_ optimization techniques referenced in the next section, notice that the the throughput-oriented execution is usually much more memory-hungry, as detailed in the [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md). -Latency measures inference time (ms) required to process a single input. When it comes to batch input need to measure throughput (images per second or frames per second, FPS). To calculate throughput, divide the number of frames that were processed by the processing time. -## How to measure performance -To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, go to [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page. +> **NOTE**: To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, check [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page. -## How to Improve Performance +## Improving the Performance: Model vs Runtime Optimizations -> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Inference Engine before reffering to the optimization topic. +> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Runtime. -Inside OpenVINO there are two ways how to get better performance numbers: optimize the model, which is called **model optimization** or tune parameters of execution, which is also **deployment optimization**. Note, that it is possible to combine both types of optimizations. +With the OpenVINO there are two primary ways of improving the inference performance, namely model- and runtime-level optimizations. **These two optimizations directions are fully compatible**. - **Model optimization** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md). -- **Deployment optimization** includes tuning inference parameters and optimizing model execution. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md). +- **Runtime (Deployment) optimization** includes tuning of model _execution_ parameters. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md). ## Performance benchmarks -To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Perforance benchmarks](../benchmarks/performance_benchmarks.md) section. \ No newline at end of file +To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Performance benchmarks](../benchmarks/performance_benchmarks.md) section. \ No newline at end of file diff --git a/docs/optimization_guide/model_optimization_guide.md b/docs/optimization_guide/model_optimization_guide.md index 34985d447f9fe7..50469ea5acb1ee 100644 --- a/docs/optimization_guide/model_optimization_guide.md +++ b/docs/optimization_guide/model_optimization_guide.md @@ -8,6 +8,7 @@ pot_README docs_nncf_introduction + openvino_docs_IE_DG_Int8Inference @endsphinxdirective diff --git a/docs/snippets/dldt_optimization_guide9.cpp b/docs/snippets/dldt_optimization_guide9.cpp index dfd746c4b44d37..bdab20e7326713 100644 --- a/docs/snippets/dldt_optimization_guide9.cpp +++ b/docs/snippets/dldt_optimization_guide9.cpp @@ -1,7 +1,6 @@ #include int main() { -using namespace InferenceEngine; //! [part9] while(true) { // capture frame diff --git a/docs/snippets/ov_auto_batching.cpp b/docs/snippets/ov_auto_batching.cpp index dc7716b972711b..4d74ee5be57f76 100644 --- a/docs/snippets/ov_auto_batching.cpp +++ b/docs/snippets/ov_auto_batching.cpp @@ -41,5 +41,14 @@ auto compiled_model = core.compile_model(model, "GPU", //! [hint_num_requests] } +//! [hint_plus_low_level] +{ + // high-level performance hints are compatible with low-level device-specific settings +auto compiled_model = core.compile_model(model, "CPU", + ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT), + ov::inference_num_threads(4)); +} +//! [hint_plus_low_level] + return 0; } diff --git a/docs/snippets/ov_auto_batching.py b/docs/snippets/ov_auto_batching.py index 2c8fa6d701c08e..3f8316ced0ab4f 100644 --- a/docs/snippets/ov_auto_batching.py +++ b/docs/snippets/ov_auto_batching.py @@ -29,3 +29,11 @@ # so that certain parameters (like selected batch size) are automatically accommodated accordingly compiled_model = core.compile_model(model, "GPU", config) # [hint_num_requests] + +# [hint_plus_low_level] +config = {"PERFORMANCE_HINT": "THROUGHPUT", + "INFERENCE_NUM_THREADS": "4"} +# limiting the available parallel slack for the 'throughput' +# so that certain parameters (like selected batch size) are automatically accommodated accordingly +compiled_model = core.compile_model(model, "CPU", config) +# [hint_plus_low_level]] \ No newline at end of file