Skip to content

Commit

Permalink
WIP 3
Browse files Browse the repository at this point in the history
  • Loading branch information
myshevts committed Mar 9, 2022
1 parent bfe54f5 commit fd895b8
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 13 deletions.
4 changes: 2 additions & 2 deletions docs/MO_DG/prepare_model/Getting_performance_numbers.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ You need to build your performance conclusions on reproducible data. Do the perf
- For time values that range too much, consider geomean.


## Getting performance numbers using OpenVINO tool
## Getting performance numbers using OpenVINO's benchmark_app

To get performance numbers please use the dedicated [Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample that is the best way to produce the performance reference.
It has a lot of device-specific knobs, but the primary usage is as simple as:
Expand All @@ -34,7 +34,7 @@ $ ./benchmark_app –d CPU –m <model> -i <input>
to execute on the CPU instead.

Each of the [OpenVINO supported devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) offers a bunch of performance settings that have a command-line equivalents in the [Benchmark App](../../../samples/cpp/benchmark_app/README.md).
While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest to always start the performance evaluation with trying the [OpenVINO High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) first:
While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest to always start the performance evaluation with trying the [OpenVINO High-Level Performance Hints](../../OV_Runtime_UG/performance_hints.md) first:
- benchmark_app **-hint tput** -d 'device' -m 'path to your favorite model'
- benchmark_app **-hint latency** -d 'device' -m 'path to your favorite model'

Expand Down
17 changes: 7 additions & 10 deletions docs/optimization_guide/dldt_deployment_optimization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,17 +57,16 @@ Compared with the batching, the parallelism is somewhat transposed (i.e. perform
![](../img/cpu_streams_explained.png)

### Automatic Batching Internals <a name="ov-auto-batching"></a>
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests toimprove device utilization.
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It essentially it performs transparent inputs gathering from
individual inference requests followed by the actual batched execution, with no programming effort from the user:
![](../img/BATCH_device.PNG)

Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a timeout. Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect). So if your workload experiences the timeouts (which would result i the performance drop, as, when happened, the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value/batch size may yield better performance than large batch size, but coupled with the timeout value that is cannot guarantee accommodating the full number of the required requests.
TBD
Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout. Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect). So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value/batch size may yield better performance than large batch size, but coupled with the timeout value that is cannot guarantee accommodating the full number of the required requests.

## OpenVINO Async API <a name="ov-async-api"></a>

OpenVINO Async API can improve overall throughput rate of the application. While a device is busy with the inference, the application can do other things in parallel rather than wait for the inference to complete.
OpenVINO Async API can improve overall throughput rate of the application. While a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or even scheduling other requests) rather than wait for the inference to complete.

In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.

Expand All @@ -88,11 +87,11 @@ You can compare the pseudo-codes for the regular and async-based approaches:
The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results.
Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for complete examples of the Async API in action.

## Request-Based API and “GetBlob” Idiom <a name="new-request-based-api"></a>
## Request-Based API and the "get_tensor" Idiom <a name="new-request-based-api"></a>

Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous) `Infer` into `StartAsync` and `Wait` (see <a href="#ie-async-api">OpenVINO Async API</a>).
The API of the inference requests offers Sync and Async execution. While the Sync serializes the execution flow, the Async splits (synchronous) `Infer` into `StartAsync` and `Wait` (see <a href="#ie-async-api">OpenVINO Async API</a>).

More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network:
An infer request encapsulates the reference to the "compiled" model and actual inputs/outputs:

```sh

Expand Down Expand Up @@ -144,6 +143,4 @@ relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0
```

This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution.


TODO:execution graphs
Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific (Netron-viewable)[https://netron.app/] graph to the specified file.
1 change: 0 additions & 1 deletion docs/snippets/dldt_optimization_guide9.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#include <ie_core.hpp>

int main() {
using namespace InferenceEngine;
//! [part9]
while(true) {
// capture frame
Expand Down

0 comments on commit fd895b8

Please sign in to comment.