diff --git a/docs/how-to/tune-performance.md b/docs/how-to/tune-performance.md index 6005adb480641..febd844399742 100644 --- a/docs/how-to/tune-performance.md +++ b/docs/how-to/tune-performance.md @@ -7,7 +7,7 @@ nav_order: 1 # ONNX Runtime Performance Tuning {: .no_toc } -ONNX Runtime gives high performance across a range of hardware options by providing "Execution Providers" to interface to different execution environments. See: [design overview](../resources/high-level-design.md), [supported execution providers](https://github.com/microsoft/onnxruntime#supported-accelerators). +ONNX Runtime gives high performance across a range of hardware options by providing "Execution Providers" to interface to different execution environments. See: [design overview](../resources/high-level-design.md), [supported execution providers](../resources/execution-providers). Along with this flexibility comes decisions for tuning and usage. For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, wait policy, etc) to improve performance. @@ -172,27 +172,32 @@ The most widely used environment variables are: * ACTIVE will not yield CPU, instead it will have a while loop to check whether the next task is ready * Use PASSIVE if your CPU usage already high, and use ACTIVE when you want to trade CPU with latency - ## Troubleshooting model performance issues + The answers below are troubleshooting suggestions based on common previous user-filed issues and questions. This list is by no means exhaustive and there is a lot of case-by-case fluctuation depending on the model and specific usage scenario. Please use this information to guide your troubleshooting, search through previously filed issues for related topics, and/or file a new issue if your problem is still not resolved. ### Performance Troubleshooting Checklist + Here is a list of things to check through when assessing performance issues. * Are you using OpenMP? OpenMP will parallelize some of the code for potential performance improvements. This is not recommended for running on single threads. * Have you enabled all [graph optimizations](../resources/graph-optimizations.md)? The official published packages do enable all by default, but when building from source, check that these are enabled in your build. * Have you searched through prior filed [Github issues](https://github.com/microsoft/onnxruntime/issues) to see if your problem has been discussed previously? Please do this before filing new issues. * If using CUDA or TensorRT, do you have the right versions of the dependent libraries installed? -### I need help performance tuning for BERT models. -For BERT models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. We recommend trying out the [BERT optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) to help get started. +### I need help performance tuning for BERT models + +For BERT models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. We recommend trying out the [BERT optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers/notebooks) to help get started. ### Why is the model graph not optimized even with graph_optimization_level set to ORT_ENABLE_ALL? + The ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool [remove_initializer_from_input.py](https://github.com/microsoft/onnxruntime/tree/master/tools/python/remove_initializer_from_input.py). ### Why is my model running slower on GPU than CPU? + Depending on which execution provider you're using, it may not have full support for all the operators in your model. Fallback to CPU ops can cause hits in performance speed. Moreover even if an op is implemented by the CUDA execution provider, it may not necessarily assign/place the op to the CUDA EP due to performance reasons. To see the placement decided by ORT, turn on verbose logging and look at the console output. ### My converted Tensorflow model is slow - why? + NCHW and NHWC are two different memory layout for 4-D tensors. Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. The Tensorflow team suggests that on GPU NCHW is faster but on CPU NHWC is sometimes faster in Tensorflow. However, ONNX only supports NCHW. As a result, if the original model is in NHWC format, when the model is converted extra transposes may be added. The [tensorflow-onnx](https://github.com/onnx/tensorflow-onnx) and [keras-onnx](https://github.com/onnx/keras-onnx) converters do remove many of these transposes, but if this doesn't help sufficiently, consider retraining the model using NCHW. diff --git a/docs/reference/api/csharp-api.md b/docs/reference/api/csharp-api.md index 49018e1553941..231f6667a6b6b 100644 --- a/docs/reference/api/csharp-api.md +++ b/docs/reference/api/csharp-api.md @@ -17,13 +17,14 @@ The ONNX runtime provides a C# .Net binding for running inference on ONNX models {:toc} ## NuGet Package + The Microsoft.ML.OnnxRuntime Nuget package includes the precompiled binaries for ONNX runtime, and includes libraries for Windows and Linux platforms with X64 CPUs. The APIs conform to .Net Standard 1.1. ## Sample Code The unit tests contain several examples of loading models, inspecting input/output node shapes and types, as well as constructing tensors for scoring. -* [../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166](../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166) +* [Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L166) ## Getting Started Here is simple tutorial for getting started with running inference on an existing ONNX model for a given input data. The model is typically trained using any of the well-known training frameworks and exported into the ONNX format. To start scoring using the model, open a session using the `InferenceSession` class, passing in the file path to the model as a parameter. @@ -96,9 +97,10 @@ using (var outputs1 = session1.Run(inputs1)) If the model have fixed sized inputs and outputs of numeric tensors, you can use "FixedBufferOnnxValue" to accelerate the inference speed. By using "FixedBufferOnnxValue", the container objects only need to be allocated/disposed one time during multiple InferenceSession.Run() calls. This avoids some overhead which may be beneficial for smaller models where the time is noticeable in the overall running time. An example can be found at `TestReusingFixedBufferOnnxValueNonStringTypeMultiInferences()`: -* [../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047](../csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047) +* [Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs#L1047) ## Running on GPU (Optional) + If using the GPU package, simply use the appropriate SessionOptions when creating an InferenceSession. ```cs @@ -253,6 +255,3 @@ class OnnxRuntimeException: Exception; ``` The type of Exception that is thrown in most of the error conditions related to Onnx Runtime. - - - diff --git a/docs/reference/api/winrt-api.md b/docs/reference/api/winrt-api.md index 7d81bc6a24a2b..2469b2cd12d08 100644 --- a/docs/reference/api/winrt-api.md +++ b/docs/reference/api/winrt-api.md @@ -16,7 +16,7 @@ The WinML API is a WinRT API that shipped inside the Windows OS starting with bu Many customers have asked for a way to use this offering as an application redistributable package. -With our new [layered architecture](InferenceHighLevelDesign.md#the-onnx-runtime-and-windows-os-integration) you can now do this, with some limitations. The WinML APIs have been lifted and mirrored into the Microsoft.AI.MachineLearning namespace in the redistributable. +With our [layered architecture](../../resources/high-level-design.md#the-onnx-runtime-and-windows-os-integration) you can now do this, with some limitations. The WinML APIs have been lifted and mirrored into the Microsoft.AI.MachineLearning namespace in the redistributable. ## Contents {: .no_toc } diff --git a/docs/reference/execution-providers/DNNL-ExecutionProvider.md b/docs/reference/execution-providers/DNNL-ExecutionProvider.md index 57b4cf4b2ffb3..8e0f7d7b5165f 100644 --- a/docs/reference/execution-providers/DNNL-ExecutionProvider.md +++ b/docs/reference/execution-providers/DNNL-ExecutionProvider.md @@ -21,20 +21,26 @@ For information on how DNNL optimizes subgraphs, see [Subgraph Optimization](./M {:toc} ## Build + For build instructions, please see the [BUILD page](../../how-to/build.md#dnnl-and-mklml). ## Supported OS + * Ubuntu 16.04 -* Windows 10 +* Windows 10 * Mac OS X ## Supported backend + * CPU ## Using the DNNL Execution Provider + ### C/C++ + The DNNLExecutionProvider execution provider needs to be registered with ONNX Runtime to enable in the inference session. -``` + +```c++ string log_id = "Foo"; auto logging_manager = std::make_unique (std::unique_ptr{new CLogSink{}}, @@ -47,35 +53,38 @@ InferenceSession session_object{so,env}; session_object.RegisterExecutionProvider(std::make_unique<::onnxruntime:: DNNLExecutionProvider >()); status = session_object.Load(model_file_name); ``` + The C API details are [here](../api/c-api.md). ### Python + When using the python wheel from the ONNX Runtime built with DNNL execution provider, it will be automatically prioritized over the CPU execution provider. Python APIs details are [here](https://aka.ms/onnxruntime-python). ## Performance Tuning + For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](../../how-to/tune-performance.md) ## Subgraph Optimization -DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive. +DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive. Subgraph optimization achieves this in the following steps. + 1. Parses ONNX Runtime graph and creates an Internal Representation of subgraph.. 2. Subgraph Operator (DnnlFunKernel) iterates through DNNL nodes and creates a vector DNNL Kernels 3. Compute Function of DnnlFunKernel iterates and binds data to DNNL primitives in the vector and submits vector for execution. - ### Subgraph (IR) Internal Representation + DnnlExecutionProvicer::GetCapability() parses ONNX model graph and creates IR (Internal Representation) of subgraphs of DNNL operators. -Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index. -Unique id for subgraph is set as an attribute. +Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index. Unique id for subgraph is set as an attribute. DnnlNode has an index to its inputs and outputs and pointer to its parent nodes. DnnlNode directly reads blocked memory from its parent to avoid data reordering.

- ### Subgraph Classes + Primitive like DnnlConv, DnnlPool, etc are derived from DnnlKernel base class. The following UML diagram captures Subgraph classes. @@ -87,11 +96,78 @@ The following UML diagram captures Subgraph classes. DnnlExecutionProvicer::Compute() function creates DnnlFuncKernel and call it’s Compute Function. - DnnlFuncKernel::Compute function creates SubgraphPrimitve pool and add the object to a map. SubgraphPrimitve constructor calls the following member functions + +```c++ +SubgraphPrimitve::CreatePrimitives() + for (auto& mklnode : mklnodes) { + if (mklnode.name == "Conv") { + kernel.reset(new DnnlConv()); + kernels.push_back(kernel); + } else if (mklnode.name == "BatchNormalization-Relu") { + kernel.reset(new DnnlBatchNorm()); + context_.kernels.push_back(kernel); + } else if (mklnode.name == "MaxPool") { + kernel.reset(new DnnlPool()); + context_.kernels.push_back(kernel); + } + . + . + . +``` + +In CreatePrimitives method, we iterate DnnlNodes and creates DnnlKernel objects and add DNNL primitive to a vector. It also reads attributes. This is done only once, at first iteration. + +```c++ +SubgraphPrimitve::Compute() + for (auto& kernel : kernels) { + kernel->Bind(input_tensors, output_tensors); + } + stream->submit(net); ``` + +In SubgraphPrimitve::Compute() method, we iterate thru Dnnl Kernels and bind input data. Then we submit the vector of Primitives to DNNL stream. + + +### Subgraph Optimization + +DNNL uses blocked layout (example: nhwc with channels blocked by 16 – nChw16c) to take advantage of vector operations using AVX512. To get best performance, we avoid reorders (example. Nchw16c to nchw) and propagate blocked layout to next primitive. + +Subgraph optimization achieves this in the following steps. + +1. Parses ONNX Runtime graph and creates an Internal Representation of subgraph.. +2. Subgraph Operator (DnnlFunKernel) iterates through DNNL nodes and creates a vector DNNL Kernels +3. Compute Function of DnnlFunKernel iterates and binds data to DNNL primitives in the vector and submits vector for execution. + +#### Subgraph (IR) Internal Representation + +DnnlExecutionProvicer::GetCapability() parses ONNX model graph and creates IR (Internal Representation) of subgraphs of DNNL operators. +Each subgraph contains a vector DnnlNodes, inputs, outputs and attributes for all its DnnlNodes. There can be attributes of same name. So, we prefix attribute names with Node name and its index. +Unique id for subgraph is set as an attribute. + +DnnlNode has an index to its inputs and outputs and pointer to its parent nodes. DnnlNode directly reads blocked memory from its parent to avoid data reordering. + +

+ +#### Subgraph Classes + +Primitive like DnnlConv, DnnlPool, etc are derived from DnnlKernel base class. + +The following UML diagram captures Subgraph classes. + +

+ +#### Subgraph Execution + +DnnlExecutionProvicer::Compute() function creates DnnlFuncKernel and call it’s Compute Function. + +DnnlFuncKernel::Compute function creates SubgraphPrimitve pool and add the object to a map. + +SubgraphPrimitve constructor calls the following member functions + +```c++ SubgraphPrimitve::CreatePrimitives() for (auto& mklnode : mklnodes) { if (mklnode.name == "Conv") { @@ -107,10 +183,11 @@ SubgraphPrimitve::CreatePrimitives() . . . -``` +``` + In CreatePrimitives method, we iterate DnnlNodes and creates DnnlKernel objects and add DNNL primitive to a vector. It also reads attributes. This is done only once, at first iteration. -``` +```c++ SubgraphPrimitve::Compute() for (auto& kernel : kernels) { kernel->Bind(input_tensors, output_tensors); diff --git a/docs/reference/execution-providers/DirectML-ExecutionProvider.md b/docs/reference/execution-providers/DirectML-ExecutionProvider.md index 8988e0c40a324..a28d8cd8257b4 100644 --- a/docs/reference/execution-providers/DirectML-ExecutionProvider.md +++ b/docs/reference/execution-providers/DirectML-ExecutionProvider.md @@ -32,8 +32,6 @@ The DirectML execution provider requires any DirectX 12 capable device. Almost a DirectML is compatible with Windows 10, version 1709 (10.0.16299; RS3, "Fall Creators Update") and newer. - - ## Building from source For general information about building onnxruntime, see [BUILD.md](../../how-to/build.md). @@ -44,38 +42,42 @@ Requirements for building the DirectML execution provider: To build onnxruntime with the DML EP included, supply the `--use_dml` parameter to `build.bat`. e.g. - build.bat --config RelWithDebInfo --build_shared_lib --parallel --use_dml +```powershell +build.bat --config RelWithDebInfo --build_shared_lib --parallel --use_dml +``` The DirectML execution provider supports building for both x64 (default) and x86 architectures. Note that building onnxruntime with the DirectML execution provider enabled causes the the DirectML redistributable package to be automatically downloaded as part of the build. Its use is governed by a license whose text may be found as part of the NuGet package. - - ## Using the DirectML execution provider -When using the [C API](../C_API.md) with a DML-enabled build of onnxruntime (see [Building from source](#building-from-source)), the DirectML execution provider can be enabled using one of the two factory functions included in `include/onnxruntime/core/providers/dml/dml_provider_factory.h`. +When using the [C API](../api/c-api.md) with a DML-enabled build of onnxruntime (see [Building from source](#building-from-source)), the DirectML execution provider can be enabled using one of the two factory functions included in `include/onnxruntime/core/providers/dml/dml_provider_factory.h`. ### `OrtSessionOptionsAppendExecutionProvider_DML` function Creates a DirectML Execution Provider which executes on the hardware adapter with the given `device_id`, also known as the adapter index. The device ID corresponds to the enumeration order of hardware adapters as given by [IDXGIFactory::EnumAdapters](https://docs.microsoft.com/windows/win32/api/dxgi/nf-dxgi-idxgifactory-enumadapters). A `device_id` of 0 always corresponds to the default adapter, which is typically the primary display GPU installed on the system. A negative `device_id` is invalid. - OrtStatus* OrtSessionOptionsAppendExecutionProvider_DML( - _In_ OrtSessionOptions* options, - int device_id - ); +```c +OrtStatus* OrtSessionOptionsAppendExecutionProvider_DML( + _In_ OrtSessionOptions* options, + int device_id + ); +``` ### `OrtSessionOptionsAppendExecutionProviderEx_DML` function Creates a DirectML Execution Provider using the given DirectML device, and which executes work on the supplied D3D12 command queue. The DirectML device and D3D12 command queue must have the same parent [ID3D12Device](https://docs.microsoft.com/windows/win32/api/d3d12/nn-d3d12-id3d12device), or an error will be returned. The D3D12 command queue must be of type `DIRECT` or `COMPUTE` (see [D3D12_COMMAND_LIST_TYPE](https://docs.microsoft.com/windows/win32/api/d3d12/ne-d3d12-d3d12_command_list_type)). If this function succeeds, the inference session once created will maintain a strong reference on both the `dml_device` and `command_queue` objects. - OrtStatus* OrtSessionOptionsAppendExecutionProviderEx_DML( - _In_ OrtSessionOptions* options, - _In_ IDMLDevice* dml_device, - _In_ ID3D12CommandQueue* cmd_queue - ); +```c +OrtStatus* OrtSessionOptionsAppendExecutionProviderEx_DML( + _In_ OrtSessionOptions* options, + _In_ IDMLDevice* dml_device, + _In_ ID3D12CommandQueue* cmd_queue + ); +``` -**See Also** +### See Also [DMLCreateDevice function](https://docs.microsoft.com/windows/win32/api/directml/nf-directml-dmlcreatedevice) [ID3D12Device::CreateCommandQueue method](https://docs.microsoft.com/windows/win32/api/d3d12/nf-d3d12-id3d12device-createcommandqueue) @@ -91,7 +93,7 @@ The DirectML execution provider does not support the use of memory pattern optim If using the onnxruntime C API, you must call `DisableMemPattern` and `SetSessionExecutionMode` functions to set the options required by the DirectML execution provider. -See [onnxruntime\include\onnxruntime\core\session\onnxruntime_c_api.h](../.https://github.com/microsoft/onnxruntime/tree/master/include//onnxruntime/core/session/onnxruntime_c_api.h). +See [onnxruntime\include\onnxruntime\core\session\onnxruntime_c_api.h](https://github.com/microsoft/onnxruntime/tree/master/include//onnxruntime/core/session/onnxruntime_c_api.h). OrtStatus*(ORT_API_CALL* DisableMemPattern)(_Inout_ OrtSessionOptions* options)NO_EXCEPTION; @@ -103,7 +105,7 @@ Additionally, as the DirectML execution provider does not support parallel execu ## Samples -A complete sample of onnxruntime using the DirectML execution provider can be found under [samples/c_cxx/fns_candy_style_transfer](../.https://github.com/microsoft/onnxruntime/tree/master/samples//c_cxx/fns_candy_style_transfer). +A complete sample of onnxruntime using the DirectML execution provider can be found under [samples/c_cxx/fns_candy_style_transfer](https://github.com/microsoft/onnxruntime/tree/master/samples//c_cxx/fns_candy_style_transfer). ## Performance best practices The DirectML execution provider works most efficiently when tensor shapes are known at the time a session is created. This provides a few performance benefits: @@ -119,7 +121,6 @@ In this case, there are three options: - Specify values of named dimensions within model inputs when creating the session using the OnnxRuntime *AddFreeDimensionOverrideByName* ABI. - Edit the model to ensure that an input's free dimension has a [denotation](https://github.com/onnx/onnx/blob/master/docs/DimensionDenotation.md) (such as "DATA_BATCH," or a custom denotation). Then when creating the session, specify the dimension size for each denotation. This can be done using the OnnxRuntime *AddFreeDimensionOverride* ABI. - ## See also [DirectML documentation \(docs.microsoft.com\)](https://docs.microsoft.com/en-us/windows/win32/direct3d12/dml) diff --git a/docs/reference/execution-providers/MIGraphX-ExecutionProvider.md b/docs/reference/execution-providers/MIGraphX-ExecutionProvider.md index 76b274f02c408..3e22fb60a7ff8 100644 --- a/docs/reference/execution-providers/MIGraphX-ExecutionProvider.md +++ b/docs/reference/execution-providers/MIGraphX-ExecutionProvider.md @@ -42,7 +42,7 @@ The C API details are [here](../api/c-api.md). ### Python When using the Python wheel from the ONNX Runtime build with MIGraphX execution provider, it will be automatically prioritized over the default GPU or CPU execution providers. There is no need to separately register the execution -provider. Python APIs details are [here](../python/api_summary.rst#api-summary). +provider. Python APIs details are [here](/python/api_summary). You can check [here](https://github.com/scxiao/ort_test/tree/master/python/run_onnx) for a python script to run an model on either the CPU or MIGraphX Execution Provider. diff --git a/docs/reference/execution-providers/Nuphar-ExecutionProvider.md b/docs/reference/execution-providers/Nuphar-ExecutionProvider.md index a956bbd6c66b3..7b9ede1f33207 100644 --- a/docs/reference/execution-providers/Nuphar-ExecutionProvider.md +++ b/docs/reference/execution-providers/Nuphar-ExecutionProvider.md @@ -10,7 +10,7 @@ nav_order: 8 NUPHAR stands for Neural-network Unified Preprocessing Heterogeneous ARchitecture. As an execution provider in the ONNX Runtime, it is built on top of [TVM](https://github.com/dmlc/tvm) and [LLVM](https://llvm.org) to accelerate ONNX models by compiling nodes in subgraphs into optimized functions via JIT. It also provides JIT caching to save compilation time at runtime. -Developers can tap into the power of Nuphar through ONNX Runtime to accelerate inferencing of ONNX models. The Nuphar execution provider comes with a common ONNX to TVM lowering [library](../../onnxruntime/core/codegen) that can potentially be reused by other execution providers to leverage TVM. With the Nuphar execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic X64 CPU acceleration, especially for quantized recurrent neural networks. Various products at Microsoft have seen up to a 5x improvement in performance with no loss of accuracy, by running quantized LSTMs via the Nuphar execution provider in the ONNX Runtime. +Developers can tap into the power of Nuphar through ONNX Runtime to accelerate inferencing of ONNX models. The Nuphar execution provider comes with a common ONNX to TVM lowering [library](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/codegen) that can potentially be reused by other execution providers to leverage TVM. With the Nuphar execution provider, the ONNX Runtime delivers better inference performance on the same hardware compared to generic X64 CPU acceleration, especially for quantized recurrent neural networks. Various products at Microsoft have seen up to a 5x improvement in performance with no loss of accuracy, by running quantized LSTMs via the Nuphar execution provider in the ONNX Runtime. ## Contents {: .no_toc } @@ -26,19 +26,21 @@ For build instructions, please see the [BUILD page](../../how-to/build.md#nuphar The Nuphar execution provider needs to be registered with ONNX Runtime to enable in the inference session. The C API details are [here](../api/c-api.md). ### Python -You can use the Nuphar execution provider via the python wheel from the ONNX Runtime build. The Nuphar execution provider will be automatically prioritized over the default CPU execution providers, thus no need to separately register the execution provider. Python APIs details are [here](../python/api_summary.rst#api-summary). + +You can use the Nuphar execution provider via the python wheel from the ONNX Runtime build. The Nuphar execution provider will be automatically prioritized over the default CPU execution providers, thus no need to separately register the execution provider. Python APIs details are [here](/python/api_summary). ## Performance and Accuracy Testing -You can test your ONNX model's performance with [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/perftest/README.md), or test accuracy with [onnx_test_runner](../../onnxruntime/test/onnx/README.txt). To run these tools with the Nuphar execution provider, please pass `-e nuphar` in command line options. + +You can test your ONNX model's performance with [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/perftest/README.md), or test accuracy with [onnx_test_runner](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/onnx). To run these tools with the Nuphar execution provider, please pass `-e nuphar` in command line options. Please note that Nuphar uses TVM thread pool and parallel schedule for multi-thread inference performance. When building with OpenMP or MKLML, TVM thread pool would use gomp or iomp as its implementation; otherwise, TVM creates its own thread pool. Because of this, the current default parallel schedule policy is: - Default to on for USE_OPENMP or USE_MKLML. User can use OMP_NUM_THREADS/MKL_NUM_THREADS to control TVM thread pool, as well as TVM_NUM_THREADS - Default to off for none of above. User can use TVM_NUM_THREADS to control TVM thread pool. -This choice is to ensure to get ideal performance with the different build options. When build with USE_OPENMP or USE_MKLML, users would have to avoid thread confliction from OpenMP or MKL with their inference invocations anyway, so parallel schedule is enable to leverage existing thread pool. When not building with gomp or iomp, TVM thread pool is turned off to avoid confliction with user threads. If needed, user can set env or settings with [NUPHAR_PARALLEL_MIN_WORKLOADS](../../onnxruntime/core/providers/nuphar/common/nuphar_settings.cc#L61) to 0 to disable parallel schedule, or to some non-zero value to enable parallel schedule. The non-zero value indicates the minimal number of elements being computed per thread when parallel schedule would be turned on. +This choice is to ensure to get ideal performance with the different build options. When build with USE_OPENMP or USE_MKLML, users would have to avoid thread confliction from OpenMP or MKL with their inference invocations anyway, so parallel schedule is enable to leverage existing thread pool. When not building with gomp or iomp, TVM thread pool is turned off to avoid confliction with user threads. If needed, user can set env or settings with [NUPHAR_PARALLEL_MIN_WORKLOADS](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/common/nuphar_settings.cc#L61) to 0 to disable parallel schedule, or to some non-zero value to enable parallel schedule. The non-zero value indicates the minimal number of elements being computed per thread when parallel schedule would be turned on. ## Model Conversion and Quantization -You may use Python script [model_editor.py](../../onnxruntime/core/providers/nuphar/scripts/model_editor.py) to turn LSTM/GRU/RNN ops to Scan ops for a given model, and then use [model_quantizer.py](../../onnxruntime/core/providers/nuphar/scripts/model_quantizer.py) to quantize MatMul ops into MatMulInteger ops. +You may use Python script [model_editor.py](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/scripts/model_editor.py) to turn LSTM/GRU/RNN ops to Scan ops for a given model, and then use [model_quantizer.py](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/scripts/model_quantizer.py) to quantize MatMul ops into MatMulInteger ops. We use dynamic per-row quantization for inputs of LSTM MatMul, so MatMul becomes three parts: quantization, MatMulInteger and dequantization. Weights for MatMulInteger are statically quantized per-column to int8. We have observed good speed-up and no loss of accuracy with this quantization scheme inside Scan for various LSTM models. @@ -57,9 +59,11 @@ As an experiment, you may test conversion and quantization on [the BiDAF model]( Speed-up in this model is ~20% on Intel Xeon E5-1620v4 (Note that AVX2 is required for Nuphar int8 GEMV performance), when comparing CPU execution provider with the floating point model with LSTM ops, vs. the Nuphar execution provider with quantized MatMulInteger inside Scan ops. Profile shows that most of the cost is in input projection outside of Scan ops, which uses MKL SGEMM. It's worth noting that MKL int8 GEMM is about the same speed as SGEMM in this model, so quantization of SGEMMs outside of Scan won't help performance. We are looking at ways to speedup int8 GEMM for better performance on quantized models. ## JIT caching -You may cache JIT binaries to reduce model loading time spent in JIT, using [create_shared.cmd](../../onnxruntime/core/providers/nuphar/scripts/create_shared.cmd) on Windows with Visual Studio 2017, or [create_shared.sh](../../onnxruntime/core/providers/nuphar/scripts/create_shared.sh) on Linux with gcc. + +You may cache JIT binaries to reduce model loading time spent in JIT, using [create_shared.cmd](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/scripts/create_shared.cmd) on Windows with Visual Studio 2017, or [create_shared.sh](../../onnxruntime/core/providers/nuphar/scripts/create_shared.sh) on Linux with gcc. Windows + ``` REM You need to have Visual Studio 2017 for compile and link. Optionally, you can save model checksum to the output dll with FCIV tool from https://support.microsoft.com/en-us/help/841290 set NUPHAR_CACHE_PATH=\path\to\jit\cache @@ -72,6 +76,7 @@ REM Run Nuphar inference again with cached JIT dll ``` Linux + ```bash # You need to have GCC of the same version Nuphar is built with, for compile and link. Optionally, you can save model checksum to jit.so with md5sum export NUPHAR_CACHE_PATH=/path/to/jit/cache @@ -83,27 +88,31 @@ create_shared.sh -c /path/to/jit/cache/NUPHAR_CACHE_VERSION [-m optional_model_f # run Nuphar inference again with cached JIT dll ``` - ## Debugging ### NGEMM + NGEMM (Nuphar GEMM) is an optimized low-precision GEMM implementation based on compiler techniques. Please refer to our paper for more details of NGEMM: ["NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques"](https://arxiv.org/abs/1910.00178). #### NGEMM Tiling / Permutation Configuration + NGEMM has default tiling parameters, but users can overwrite them through environment variables: + * NUPHAR_IGEMM_TILE_M / NUPHAR_IGEMM_TILE_N / NUPHAR_IGEMM_TILE_K These 3 parameters are the tiling sizes for the corresponding dimensions of GEMM ([M x K] x [K x N]). + Setting them to different values will generate GEMM with different tiling sizes. * NUPHAR_IGEMM_PERMUTE - This enviornment variable is to control the loop permutation in GEMM. + This environment variable is to control the loop permutation in GEMM. + The default is to not apply any loop permutation. Other options are "inner/outer/all",referring to apply permutations to only inner tile loops / only outer loops / both inner and outer loops, respectively. + There are several [environment variables](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/codegen/common/settings.h) to dump debug information during code generation, plus [some more environment variables](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/nuphar/common/nuphar_settings.h) to dump/control the Nuphar execution provider. You can set environment variables prior to inference to dump debug info to the console. To list some most useful ones: -There are several [environment variables](../../onnxruntime/core/codegen/common/settings.h) to dump debug information during code generation, plus [some more environment variables](../../onnxruntime/core/providers/nuphar/common/nuphar_settings.h) to dump/control the Nuphar execution provider. You can set environment variables prior to inference to dump debug info to the console. To list some most useful ones: * CODEGEN_DUMP_LOWER Dumps the lowered function from TVM. @@ -129,13 +138,14 @@ There are several [environment variables](../../onnxruntime/core/codegen/common/ Set it to "1" to dump partitions. ## Settings + When there are conflicts of environment variables running Nuphar in multiple processes, user can specify settings string when creating the Nuphar execution provider. The string comprises of comma separated key:value pairs. Keys should be lower cased environment variable names as shown above, and separated from corresponding values with colon. For example, the equivalent string of setting environment variables of NUPHAR_CACHE_PATH/NUPHAR_CACHE_MODEL_CHECKSUM would be "nuphar_cache_path:, nuphar_cache_model_checksum:". * Using in C/C++ Settings string could be specified when creating execution provider to specify JIT cache path, as well as model checksum: -``` +```c++ OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, 1, "nuphar_cache_path:/path/to/cache, nuphar_cache_model_checksum:")); ``` @@ -143,7 +153,7 @@ OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_Nuphar(session_opti Settings string could be specified when creating session options: -``` +```csharp SessionOptions.MakeSessionOptionWithNupharProvider("nuphar_cache_path:/path/to/cache, nuphar_cache_model_checksum:") ``` @@ -151,29 +161,31 @@ SessionOptions.MakeSessionOptionWithNupharProvider("nuphar_cache_path:/path/to/c Settings string should be passed in before InferenceSession is created, as providers are not currently exposed yet. Here's an example in Python to set cache path and model checksum: -``` +```python nuphar_settings = 'nuphar_cache_path:{}, nuphar_cache_model_checksum:{}'.format(cache_dir, model_checksum) onnxruntime.capi._pybind_state.set_nuphar_settings(nuphar_settings) sess = onnxruntime.InferenceSession(model_path) ``` ## Known issues + * ONNX shape inference dependency - To save runtime JIT cost, Nuphar requires models to have shape inference information from ONNX after model is loaded. Some nodes in ONNX can generate dynamic output tensor shapes from input data value, i.e. ConstantOfShape, Tile, Slice in opset 10, Compress, etc. Those ops may block ONNX shape inference and make the part of graph after such nodes not runnable in Nuphar. +To save runtime JIT cost, Nuphar requires models to have shape inference information from ONNX after model is loaded. Some nodes in ONNX can generate dynamic output tensor shapes from input data value, i.e. ConstantOfShape, Tile, Slice in opset 10, Compress, etc. Those ops may block ONNX shape inference and make the part of graph after such nodes not runnable in Nuphar. - User may use Python script [symbolic_shape_infer.py](../../onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py) to run symbolic shape inference in ONNX model. This script adds output tensor shapes in the model in graph.value_info field, by doing symbolic dimension computation using sympy when there are Shape ops in model. Besides, running symbolic shape inference on ONNX model would make the graph more readable. Note that when using [model_editor.py](../../onnxruntime/core/providers/nuphar/scripts/model_editor.py) to convert models with LSTM/GRU/RNN to Scan, the resulting model may have incomplete shape inference. Running symbolic_shape_infer.py is needed to get the Scan ops in the model to run in Nuphar. Besides, please note that quantization should be the last step, after verified accuracy and performance of the edited floating point model. +User may use Python script [symbolic_shape_infer.py](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py) to run symbolic shape inference in ONNX model. This script adds output tensor shapes in the model in graph.value_info field, by doing symbolic dimension computation using sympy when there are Shape ops in model. Besides, running symbolic shape inference on ONNX model would make the graph more readable. Note that when using [model_editor.py](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/providers/nuphar/scripts/model_editor.py) to convert models with LSTM/GRU/RNN to Scan, the resulting model may have incomplete shape inference. Running symbolic_shape_infer.py is needed to get the Scan ops in the model to run in Nuphar. Besides, please note that quantization should be the last step, after verified accuracy and performance of the edited floating point model. - In addition, user may also manually add shapes to graph.value_info using [onnx.helper.make_tensor_value_info](https://github.com/onnx/onnx/blob/v1.5.0/onnx/helper.py#L290) with model specific knowledge. For example, if you have Hardmax output casted to bool as Compress input condition, then the unknown dimension of the output of Compress is actually 1. +In addition, user may also manually add shapes to graph.value_info using [onnx.helper.make_tensor_value_info](https://github.com/onnx/onnx/blob/v1.5.0/onnx/helper.py#L290) with model specific knowledge. For example, if you have Hardmax output casted to bool as Compress input condition, then the unknown dimension of the output of Compress is actually 1. * Performance benchmark - Current Nuphar's speed-up in quantized RNNs is optimized for AVX2, when running in single thread and batch size is 1. To help understand RNN performance in different configurations, please use Python script [rnn_benchmark.py](../../onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py). For older X64 CPUs that do not support AVX2, quantized model may have worse performance than non-quantized ones. +Current Nuphar's speed-up in quantized RNNs is optimized for AVX2, when running in single thread and batch size is 1. To help understand RNN performance in different configurations, please use Python script [rnn_benchmark.py](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py). For older X64 CPUs that do not support AVX2, quantized model may have worse performance than non-quantized ones. * Patches to TVM - There are some changes/bug fixes in TVM for Nuphar to work properly. We are in the process of contributing them back to TVM, but for now patches are used in [our forked TVM](https://github.com/microsoft/onnxruntime-tvm). To build cleanly from scratch, please run following commands before running build.bat or build.sh: -``` +There are some changes/bug fixes in TVM for Nuphar to work properly. We are in the process of contributing them back to TVM, but for now patches are used in [our forked TVM](https://github.com/microsoft/onnxruntime-tvm). To build cleanly from scratch, please run following commands before running build.bat or build.sh: + +```bash git submodule sync git submodule foreach --recursive git stash git submodule foreach --recursive git clean -fd diff --git a/docs/reference/execution-providers/TensorRT-ExecutionProvider.md b/docs/reference/execution-providers/TensorRT-ExecutionProvider.md index 78bdc2e10b0c2..e70a9acfef61e 100644 --- a/docs/reference/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/reference/execution-providers/TensorRT-ExecutionProvider.md @@ -10,7 +10,7 @@ nav_order: 12 The TensorRT execution provider in the ONNX Runtime makes use of NVIDIA's [TensortRT](https://developer.nvidia.com/tensorrt) Deep Learning inferencing engine to accelerate ONNX model in their family of GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT execution provider with ONNX Runtime. -With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration. +With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration. ## Contents {: .no_toc } @@ -18,16 +18,19 @@ With the TensorRT execution provider, the ONNX Runtime delivers better inferenci * TOC placeholder {:toc} - ## Build -For build instructions, please see the [BUILD page](../../how-to/build.md#tensorrt). + +For build instructions, please see the [BUILD page](../../how-to/build.md#tensorrt). The TensorRT execution provider for ONNX Runtime is built and tested with TensorRT 7.1.3.4. ## Using the TensorRT execution provider + ### C/C++ + The TensorRT execution provider needs to be registered with ONNX Runtime to enable in the inference session. -``` + +```c++ string log_id = "Foo"; auto logging_manager = std::make_unique (std::unique_ptr{new CLogSink{}}, @@ -40,38 +43,47 @@ InferenceSession session_object{so,env}; session_object.RegisterExecutionProvider(std::make_unique<::onnxruntime::TensorrtExecutionProvider>()); status = session_object.Load(model_file_name); ``` + The C API details are [here](../api/c-api.md). #### Shape Inference for TensorRT Subgraphs + If some operators in the model are not supported by TensorRT, ONNX Runtime will partition the graph and only send supported subgraphs to TensorRT execution provider. Because TensorRT requires that all inputs of the subgraphs have shape specified, ONNX Runtime will throw error if there is no input shape info. In this case please run shape inference for the entire model first by running script [here](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py). #### Sample + This example shows how to run Faster R-CNN model on TensorRT execution provider, First, download Faster R-CNN onnx model from onnx model zoo [here](https://github.com/onnx/models/tree/master/vision/object_detection_segmentation/faster-rcnn). Second, infer shapes in the model by running shape inference script [here](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py), -``` + +```bash python symbolic_shape_infer.py --input /path/to/onnx/model/model.onnx --output /path/to/onnx/model/new_model.onnx --auto_merge ``` Third, replace original model with the new model and run onnx_test_runner tool under ONNX Runtime build directory, -``` + +```bash ./onnx_test_runner -e tensorrt /path/to/onnx/model/ ``` ### Python + When using the Python wheel from the ONNX Runtime build with TensorRT execution provider, it will be automatically prioritized over the default GPU or CPU execution providers. There is no need to separately register the execution provider. Python APIs details are . -#### Sample -Please see [this Notebook](../python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) for an example of running a model on GPU using ONNX Runtime through Azure Machine Learning Services. +#### Python Sample + +Please see [this Notebook](https://github.com/microsoft/onnxruntime/blob/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) for an example of running a model on GPU using ONNX Runtime through Azure Machine Learning Services. ## Performance Tuning + For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](../../how-to/tune-performance.md) When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e tensorrt` ## Configuring environment variables + There are four environment variables for TensorRT execution provider. ORT_TENSORRT_MAX_WORKSPACE_SIZE: maximum workspace size for TensorRT engine. diff --git a/docs/resources/compatibility.md b/docs/resources/compatibility.md index 00c681e8348b3..7fe7e5b48db71 100644 --- a/docs/resources/compatibility.md +++ b/docs/resources/compatibility.md @@ -11,8 +11,8 @@ Supporting models based on the standard [ONNX](https://onnx.ai) format, the runt * [Getting ONNX models - tutorials](https://github.com/onnx/tutorials#getting-onnx-models) -ONNX Runtime is up to date and backwards compatible with all operators (both DNN and traditional ML) since ONNX v1.2.1+. [(ONNX compatibility details)](docs/Versioning.md). Newer versions of ONNX Runtime support all models that worked with prior versions, so updates should not break integrations. +ONNX Runtime is up to date and backwards compatible with all operators (both DNN and traditional ML) since ONNX v1.2.1+. [(ONNX compatibility details)](docs/Versioning.md). Newer versions of ONNX Runtime support all models that worked with prior versions, so updates should not break integrations. -* [Supported operators/types](resources/operators/OperatorKernels.md) - * *Operators not supported in the current ONNX spec may be available as a [Contrib Operator](resource/operators/ContribOperators.md)* -* [Extensibility: Add a custom operator/kernel](docs/AddingCustomOp.md) +* [Supported operators/types](https://github.com/microsoft/onnxruntime/blob/master/docs/OperatorKernels.md) + * *Operators not supported in the current ONNX spec may be available as a [Contrib Operator](https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md)* +* [Extensibility: Add a custom operator/kernel](../how-to/add-custom-op.md) diff --git a/docs/resources/high-level_design.md b/docs/resources/high-level-design.md similarity index 98% rename from docs/resources/high-level_design.md rename to docs/resources/high-level-design.md index ed3548ae6c2bf..eba475ae4a16b 100644 --- a/docs/resources/high-level_design.md +++ b/docs/resources/high-level-design.md @@ -73,8 +73,9 @@ the default execution provider or other registered execution providers. The ONNXRuntime execution engine is responsible for running this graph. ## Key design decisions + * Multiple threads can invoke the Run() method on the same -inference session object. See [API doc](C_API.md) for more details. +inference session object. See [API doc](../reference/api/c-api.md) for more details. * To facilitate this, the Compute() function of all kernels is const implying the kernels are stateless. * Implementations of the operators by execution providers are called diff --git a/docs/tutorials/fasterrcnn_csharp.md b/docs/tutorials/fasterrcnn_csharp.md index c7b6044c04739..356a2435abe6f 100644 --- a/docs/tutorials/fasterrcnn_csharp.md +++ b/docs/tutorials/fasterrcnn_csharp.md @@ -8,7 +8,7 @@ nav_order: 3 The sample walks through how to run a pretrained Faster R-CNN object detection ONNX model using the ONNX Runtime C# API. -The source code for this sample is available [here](Program.cs). +The source code for this sample is available [here](https://github.com/microsoft/onnxruntime/blob/master/csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample/Program.cs). ## Contents {: .no_toc } diff --git a/docs/tutorials/mnist_java.md b/docs/tutorials/mnist_java.md index 73058029a36d4..2e82479e2e863 100644 --- a/docs/tutorials/mnist_java.md +++ b/docs/tutorials/mnist_java.md @@ -6,43 +6,52 @@ nav_order: 5 # Character recognition with MNIST in Java {: .no_toc } -Here is simple tutorial for getting started with running inference on an existing ONNX model for a given input data. The model is typically trained using any of the well-known training frameworks and exported into the ONNX format. +Here is simple tutorial for getting started with running inference on an existing ONNX model for a given input data. The model is typically trained using any of the well-known training frameworks and exported into the ONNX format. + Note the code presented below uses syntax available from Java 10 onwards. The Java 8 syntax is similar but more verbose. To start a scoring session, first create the `OrtEnvironment`, then open a session using the `OrtSession` class, passing in the file path to the model as a parameter. - + +```java var env = OrtEnvironment.getEnvironment(); var session = env.createSession("model.onnx",new OrtSession.SessionOptions()); +``` Once a session is created, you can execute queries using the `run` method of the `OrtSession` object. At the moment we support `OnnxTensor` inputs, and models can produce `OnnxTensor`, `OnnxSequence` or `OnnxMap` outputs. The latter two are more likely when scoring models produced by frameworks like scikit-learn. The run call expects a `Map` where the keys match input node names stored in the model. These can be viewed by calling `session.getInputNames()` or `session.getInputInfo()` on an instantiated session. The run call produces a `Result` object, which contains a `Map` representing the output. The `Result` object is `AutoCloseable` and can be used in a try-with-resources statement to prevent references from leaking out. Once the `Result` object is closed, all it's child `OnnxValue`s are closed too. - + +```java OnnxTensor t1,t2; var inputs = Map.of("name1",t1,"name2",t2); try (var results = session.run(inputs)) { - // manipulate the results - } + // manipulate the results + } +``` You can load your input data into OnnxTensor objects in several ways. The most efficient way is to use a `java.nio.Buffer`, but it's possible to use multidimensional arrays too. If constructed using arrays the arrays must not be ragged. +```java FloatBuffer sourceData; // assume your data is loaded into a FloatBuffer long[] dimensions; // and the dimensions of the input are stored here var tensorFromBuffer = OnnxTensor.createTensor(env,sourceData,dimensions); float[][] sourceArray = new float[28][28]; // assume your data is loaded into a float array var tensorFromArray = OnnxTensor.createTensor(env,sourceArray); +``` -Here is a [complete sample program](../java/src/test/java/sample/ScoreMNIST.java) that runs inference on a pretrained MNIST model. +Here is a [complete sample program](https://github.com/microsoft/onnxruntime/blob/master/java/src/test/java/sample/ScoreMNIST.java) that runs inference on a pretrained MNIST model. ## Running on a GPU or with another provider (Optional) To enable other execution providers like GPUs simply turn on the appropriate flag on SessionOptions when creating an OrtSession. +```java int gpuDeviceId = 0; // The GPU device ID to execute on var sessionOptions = new OrtSession.SessionOptions(); sessionOptions.addCUDA(gpuDeviceId); var session = environment.createSession("model.onnx", sessionOptions); +``` The execution providers are preferred in the order they were enabled. diff --git a/docs/tutorials/samples_catalog.md b/docs/tutorials/samples_catalog.md index 954558edac76c..c896b0beb2b0f 100644 --- a/docs/tutorials/samples_catalog.md +++ b/docs/tutorials/samples_catalog.md @@ -30,8 +30,8 @@ This page catalogs code samples for ONNX Runtime, running locally, and on Azure, ## C/C++ * [C: SqueezeNet](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp) -* [C++: model-explorer](https://github.com/microsoft/onnxruntime/tree/master/c_cxx/model-explorer) - single and batch processing -* [C++: SqueezeNet](https://github.com/microsoft/onnxruntime/tree/mastercsharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp) +* [C++: model-explorer](https://github.com/microsoft/onnxruntime/tree/master/samples/c_cxx/model-explorer) - single and batch processing +* [C++: SqueezeNet](https://github.com/microsoft/onnxruntime/tree/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp) ## Java @@ -41,7 +41,6 @@ This page catalogs code samples for ONNX Runtime, running locally, and on Azure, * [Inference with Nodejs](https://github.com/microsoft/onnxruntime/tree/master/samples/nodejs) - --- ## Azure Machine Learning @@ -58,7 +57,7 @@ This page catalogs code samples for ONNX Runtime, running locally, and on Azure, * Inferencing on **CPU** with model conversion for existing (CoreML) model: * [TinyYolo](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb) * Inferencing on **GPU** with **TensorRT** Execution Provider (AKS): - * [FER+](.https://github.com/microsoft/onnxruntime/tree/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) + * [FER+](https://github.com/microsoft/onnxruntime/tree/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) ## Azure IoT Edge @@ -79,5 +78,4 @@ This page catalogs code samples for ONNX Runtime, running locally, and on Azure, ## ML.NET [Object Detection with ONNX Runtime in ML.NET](https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/object-detection-onnx) ---- - +--- \ No newline at end of file diff --git a/images/mkl-dnn_node.png b/images/mkl-dnn_node.png new file mode 100644 index 0000000000000..d2863f8938143 Binary files /dev/null and b/images/mkl-dnn_node.png differ