python session.run() fallback to CPU/CUDA provider for EP failures. #1960

jywu-msft · 2019-09-30T21:22:41Z

python api session.run() fallback mechanism.

Problem Statement: Execution Provider internal failures cause session.run() to fail.
An execution provider's GetCapability() implementation is supposed to accurately return what subgraphs on an ONNX model the EP can execute. However, for 3rd party EP's, sometimes the response is not accurate, leading to Onnxruntime assigning a subgraph to an EP, but the EP fails during runtime.

To satisfy ORT's requirement to be able to execute all valid ONNX models, we introduce a fallback/retry mechanism for the session.Run() api, which is enabled by default.

If Run() fails due to internal EP failure, it will recreate a new session and retry using default execution providers ['CPUExecutionProvider'] or ['CUDAExecutionProvider, 'CPUExecutionProvider'] (if gpu capable)

For example, if session.run() is invoked and the subgraph is assigned to TensorrtExecutionProvider but fails during execution, the session will be recreated and retried using CUDAExecutionProvider.
Similarly, if NGRAPHExecutionProvider fails, a new session is recreated and the session.run() retried using CPUExecutionProvider.

The session is recreated at most once. The retry only happens when EPFail exception is thrown by session.Run(). Thus, other exceptions thrown by ORT will not trigger the retry.
Once the session is recreated and set to default providers, subsequent run() invocations will use the default provider settings. (to prevent the same failure from happening again)

For this PR, NGRAPHExecution provider was updated to return the proper EPFail status code.
Other EP's will be updated in subsequent PRs.

Tested against previously failing onnx backend tests (e.g. test_hardmax_negative_axis) which fails using NGraph. with this PR, it fallsback and retries on CPU and succeeds.

…untime into jywu_py_fallback

HectorSVC · 2019-10-01T18:00:36Z

onnxruntime/python/session.py

+        try:
+            return self._sess.run(output_names, input_feed, run_options)
+        except C.EPFail as err:
+            if self._enable_fallback:


if self._enable_fallback: [](start = 12, length = 25)

Is it possible that for CPUProvider only, if failed, it will try CPU one more time?

no, the intent is to retry only for EPFail status, which should only get returned from EP's Compile() or EP's compute_func

jywu-msft added 7 commits September 30, 2019 09:09

py fallback initial commit.

5a3b995

fixes.

f98d1de

update NGRAPHCustomOp::Initialize() to return Status

b33f68c

fixes in session.py

707716b

FAIL status to EP_FAIL in ngraph custom op

e5520ef

disable fallback for backend api

7fe768a

Merge branch 'jywu_py_fallback' of https://github.com/Microsoft/onnxr…

c546361

…untime into jywu_py_fallback

jywu-msft requested a review from pranavsharma September 30, 2019 21:22

jywu-msft requested a review from a team as a code owner September 30, 2019 21:22

jywu-msft requested a review from HectorSVC October 1, 2019 16:24

HectorSVC reviewed Oct 1, 2019

View reviewed changes

HectorSVC approved these changes Oct 1, 2019

View reviewed changes

pranavsharma approved these changes Oct 2, 2019

View reviewed changes

jywu-msft merged commit f9bf546 into master Oct 2, 2019

jywu-msft deleted the jywu_py_fallback branch October 2, 2019 09:38

MedoX71T mentioned this pull request Apr 6, 2024

[Snyk] Upgrade protobufjs from 7.2.4 to 7.2.6 MedoX71T/onnxruntime#1

Open

Piyush-Bhor mentioned this pull request May 21, 2024

[Snyk] Upgrade protobufjs from 7.2.5 to 7.2.6 Piyush-Bhor/onnxruntime#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python session.run() fallback to CPU/CUDA provider for EP failures. #1960

python session.run() fallback to CPU/CUDA provider for EP failures. #1960

jywu-msft commented Sep 30, 2019

HectorSVC Oct 1, 2019

jywu-msft Oct 1, 2019

python session.run() fallback to CPU/CUDA provider for EP failures. #1960

python session.run() fallback to CPU/CUDA provider for EP failures. #1960

Conversation

jywu-msft commented Sep 30, 2019

HectorSVC Oct 1, 2019

Choose a reason for hiding this comment

jywu-msft Oct 1, 2019

Choose a reason for hiding this comment