Skip to content

Commit

Permalink
Fix broken links #4
Browse files Browse the repository at this point in the history
  • Loading branch information
natke committed Oct 1, 2020
1 parent 9d89cf4 commit df75b19
Show file tree
Hide file tree
Showing 14 changed files with 198 additions and 84 deletions.
13 changes: 9 additions & 4 deletions docs/how-to/tune-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ nav_order: 1
# ONNX Runtime Performance Tuning
{: .no_toc }

ONNX Runtime gives high performance across a range of hardware options by providing "Execution Providers" to interface to different execution environments. See: [design overview](../resources/high-level-design.md), [supported execution providers](https://github.com/microsoft/onnxruntime#supported-accelerators).
ONNX Runtime gives high performance across a range of hardware options by providing "Execution Providers" to interface to different execution environments. See: [design overview](../resources/high-level-design.md), [supported execution providers](../resources/execution-providers).

Along with this flexibility comes decisions for tuning and usage. For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, wait policy, etc) to improve performance.

Expand Down Expand Up @@ -172,27 +172,32 @@ The most widely used environment variables are:
* ACTIVE will not yield CPU, instead it will have a while loop to check whether the next task is ready
* Use PASSIVE if your CPU usage already high, and use ACTIVE when you want to trade CPU with latency


## Troubleshooting model performance issues

The answers below are troubleshooting suggestions based on common previous user-filed issues and questions. This list is by no means exhaustive and there is a lot of case-by-case fluctuation depending on the model and specific usage scenario. Please use this information to guide your troubleshooting, search through previously filed issues for related topics, and/or file a new issue if your problem is still not resolved.

### Performance Troubleshooting Checklist

Here is a list of things to check through when assessing performance issues.
* Are you using OpenMP? OpenMP will parallelize some of the code for potential performance improvements. This is not recommended for running on single threads.
* Have you enabled all [graph optimizations](../resources/graph-optimizations.md)? The official published packages do enable all by default, but when building from source, check that these are enabled in your build.
* Have you searched through prior filed [Github issues](https://github.com/microsoft/onnxruntime/issues) to see if your problem has been discussed previously? Please do this before filing new issues.
* If using CUDA or TensorRT, do you have the right versions of the dependent libraries installed?

### I need help performance tuning for BERT models.
For BERT models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. We recommend trying out the [BERT optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) to help get started.
### I need help performance tuning for BERT models

For BERT models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. We recommend trying out the [BERT optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers/notebooks) to help get started.

### Why is the model graph not optimized even with graph_optimization_level set to ORT_ENABLE_ALL?

The ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool [remove_initializer_from_input.py](https://github.com/microsoft/onnxruntime/tree/master/tools/python/remove_initializer_from_input.py).

### Why is my model running slower on GPU than CPU?

Depending on which execution provider you're using, it may not have full support for all the operators in your model. Fallback to CPU ops can cause hits in performance speed. Moreover even if an op is implemented by the CUDA execution provider, it may not necessarily assign/place the op to the CUDA EP due to performance reasons. To see the placement decided by ORT, turn on verbose logging and look at the console output.

### My converted Tensorflow model is slow - why?

NCHW and NHWC are two different memory layout for 4-D tensors.

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. The Tensorflow team suggests that on GPU NCHW is faster but on CPU NHWC is sometimes faster in Tensorflow. However, ONNX only supports NCHW. As a result, if the original model is in NHWC format, when the model is converted extra transposes may be added. The [tensorflow-onnx](https://github.com/onnx/tensorflow-onnx) and [keras-onnx](https://github.com/onnx/keras-onnx) converters do remove many of these transposes, but if this doesn't help sufficiently, consider retraining the model using NCHW.
9 changes: 4 additions & 5 deletions docs/reference/api/csharp-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@ The ONNX runtime provides a C# .Net binding for running inference on ONNX models
{:toc}

## NuGet Package

The Microsoft.ML.OnnxRuntime Nuget package includes the precompiled binaries for ONNX runtime, and includes libraries for Windows and Linux platforms with X64 CPUs. The APIs conform to .Net Standard 1.1.

## Sample Code

The unit tests contain several examples of loading models, inspecting input/output node shapes and types, as well as constructing tensors for scoring.

* [../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166](../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166)
* [Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166)

## Getting Started
Here is simple tutorial for getting started with running inference on an existing ONNX model for a given input data. The model is typically trained using any of the well-known training frameworks and exported into the ONNX format. To start scoring using the model, open a session using the `InferenceSession` class, passing in the file path to the model as a parameter.
Expand Down Expand Up @@ -96,9 +97,10 @@ using (var outputs1 = session1.Run(inputs1))
If the model have fixed sized inputs and outputs of numeric tensors, you can use "FixedBufferOnnxValue" to accelerate the inference speed. By using "FixedBufferOnnxValue", the container objects only need to be allocated/disposed one time during multiple InferenceSession.Run() calls. This avoids some overhead which may be beneficial for smaller models where the time is noticeable in the overall running time.

An example can be found at `TestReusingFixedBufferOnnxValueNonStringTypeMultiInferences()`:
* [../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047](../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047)
* [Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047)

## Running on GPU (Optional)

If using the GPU package, simply use the appropriate SessionOptions when creating an InferenceSession.

```cs
Expand Down Expand Up @@ -253,6 +255,3 @@ class OnnxRuntimeException: Exception;
```

The type of Exception that is thrown in most of the error conditions related to Onnx Runtime.



2 changes: 1 addition & 1 deletion docs/reference/api/winrt-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The WinML API is a WinRT API that shipped inside the Windows OS starting with bu

Many customers have asked for a way to use this offering as an application redistributable package.

With our new [layered architecture](InferenceHighLevelDesign.md#the-onnx-runtime-and-windows-os-integration) you can now do this, with some limitations. The WinML APIs have been lifted and mirrored into the Microsoft.AI.MachineLearning namespace in the redistributable.
With our [layered architecture](../../resources/high-level-design.md#the-onnx-runtime-and-windows-os-integration) you can now do this, with some limitations. The WinML APIs have been lifted and mirrored into the Microsoft.AI.MachineLearning namespace in the redistributable.

## Contents
{: .no_toc }
Expand Down
97 changes: 87 additions & 10 deletions docs/reference/execution-providers/DNNL-ExecutionProvider.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,26 @@ For information on how DNNL optimizes subgraphs, see [Subgraph Optimization](./M
{:toc}

## Build

For build instructions, please see the [BUILD page](../../how-to/build.md#dnnl-and-mklml).

## Supported OS

* Ubuntu 16.04
* Windows 10
* Windows 10
* Mac OS X

## Supported backend

* CPU

## Using the DNNL Execution Provider

### C/C++

The DNNLExecutionProvider execution provider needs to be registered with ONNX Runtime to enable in the inference session.
```

```c++
string log_id = "Foo";
auto logging_manager = std::make_unique<LoggingManager>
(std::unique_ptr<ISink>{new CLogSink{}},
Expand All @@ -47,35 +53,38 @@ InferenceSession session_object{so,env};
session_object.RegisterExecutionProvider(std::make_unique<::onnxruntime:: DNNLExecutionProvider >());
status = session_object.Load(model_file_name);
```
The C API details are [here](../api/c-api.md).
### Python
When using the python wheel from the ONNX Runtime built with DNNL execution provider, it will be automatically prioritized over the CPU execution provider. Python APIs details are [here](https://aka.ms/onnxruntime-python).
## Performance Tuning
For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](../../how-to/tune-performance.md)
## Subgraph Optimization
DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive.
DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive.
Subgraph optimization achieves this in the following steps.
1. Parses ONNX Runtime graph and creates an Internal Representation of subgraph..
2. Subgraph Operator (DnnlFunKernel) iterates through DNNL nodes and creates a vector DNNL Kernels
3. Compute Function of DnnlFunKernel iterates and binds data to DNNL primitives in the vector and submits vector for execution.

### Subgraph (IR) Internal Representation
DnnlExecutionProvicer::GetCapability() parses ONNX model graph and creates IR (Internal Representation) of subgraphs of DNNL operators.
Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index.
Unique id for subgraph is set as an attribute.
Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index. Unique id for subgraph is set as an attribute.
DnnlNode has an index to its inputs and outputs and pointer to its parent nodes. DnnlNode directly reads blocked memory from its parent to avoid data reordering.
<p align="left"><img src="/images/mkl-dnn_node.png" /></p>

### Subgraph Classes
Primitive like DnnlConv, DnnlPool, etc are derived from DnnlKernel base class.
The following UML diagram captures Subgraph classes.
Expand All @@ -87,11 +96,78 @@ The following UML diagram captures Subgraph classes.
DnnlExecutionProvicer::Compute() function creates DnnlFuncKernel and call it’s Compute Function.

DnnlFuncKernel::Compute function creates SubgraphPrimitve pool and add the object to a map.
SubgraphPrimitve constructor calls the following member functions
```c++
SubgraphPrimitve::CreatePrimitives()
for (auto& mklnode : mklnodes) {
if (mklnode.name == "Conv") {
kernel.reset(new DnnlConv());
kernels.push_back(kernel);
} else if (mklnode.name == "BatchNormalization-Relu") {
kernel.reset(new DnnlBatchNorm());
context_.kernels.push_back(kernel);
} else if (mklnode.name == "MaxPool") {
kernel.reset(new DnnlPool());
context_.kernels.push_back(kernel);
}
.
.
.
```

In CreatePrimitives method, we iterate DnnlNodes and creates DnnlKernel objects and add DNNL primitive to a vector. It also reads attributes. This is done only once, at first iteration.

```c++
SubgraphPrimitve::Compute()
for (auto& kernel : kernels) {
kernel->Bind(input_tensors, output_tensors);
}
stream->submit(net);
```
In SubgraphPrimitve::Compute() method, we iterate thru Dnnl Kernels and bind input data. Then we submit the vector of Primitives to DNNL stream.
### Subgraph Optimization
DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive.
Subgraph optimization achieves this in the following steps.
1. Parses ONNX Runtime graph and creates an Internal Representation of subgraph..
2. Subgraph Operator (DnnlFunKernel) iterates through DNNL nodes and creates a vector DNNL Kernels
3. Compute Function of DnnlFunKernel iterates and binds data to DNNL primitives in the vector and submits vector for execution.
#### Subgraph (IR) Internal Representation
DnnlExecutionProvicer::GetCapability() parses ONNX model graph and creates IR (Internal Representation) of subgraphs of DNNL operators.
Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index.
Unique id for subgraph is set as an attribute.
DnnlNode has an index to its inputs and outputs and pointer to its parent nodes. DnnlNode directly reads blocked memory from its parent to avoid data reordering.
<p align="left"><img src="images/mkl-dnn_node.png" /></p>
#### Subgraph Classes
Primitive like DnnlConv, DnnlPool, etc are derived from DnnlKernel base class.
The following UML diagram captures Subgraph classes.
<p align="left"><img src="images/mkl-dnn_subgraph.png" /></p>
#### Subgraph Execution
DnnlExecutionProvicer::Compute() function creates DnnlFuncKernel and call it’s Compute Function.
DnnlFuncKernel::Compute function creates SubgraphPrimitve pool and add the object to a map.
SubgraphPrimitve constructor calls the following member functions
```c++
SubgraphPrimitve::CreatePrimitives()
for (auto& mklnode : mklnodes) {
if (mklnode.name == "Conv") {
Expand All @@ -107,10 +183,11 @@ SubgraphPrimitve::CreatePrimitives()
.
.
.
```
```

In CreatePrimitives method, we iterate DnnlNodes and creates DnnlKernel objects and add DNNL primitive to a vector. It also reads attributes. This is done only once, at first iteration.

```
```c++
SubgraphPrimitve::Compute()
for (auto& kernel : kernels) {
kernel->Bind(input_tensors, output_tensors);
Expand Down
Loading

0 comments on commit df75b19

Please sign in to comment.