Gracefully notify orchestrator in case of a panic in transcoder #2094

leszko · 2021-11-05T15:31:06Z

What does this pull request do? Explain your changes. (required)

Fix transcoder to send an error to orchestrator before crashing. Before, when the transcoder crashed after panic(), the orchestrator had to time out before trying another transcoder. The change is related only to the split O+T topology.

Specific updates (required)

Add recover() in to the transcoder functions (both for Nvidia and the standard CPU transcoding)
Add handling UnrecoverableError in ot_rpc.go to send the error message to orchestrator and only then to panic and crash

How did you test each of these updates (required)

Add an artificial panic(errors.New("Some error")) inside transcoder.go#Transcode() function
Build the project
Start orchestrator + 2 transcoders
Start broadcaster and a start streaming a video
The first transcoder crashes, orchestrator is notified and selects the second transcoder without the timeout

The same test without this PR, makes orchestrator wait for the timeout before selecting the second transcoder.

Does this pull request close any open issues?

fix #2088

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

yondonfu

Looks great! Just a few minor nit comments.

Also, I believe the changes in LocalTranscoder and NvidiaTranscoder should be sufficient, but I wanted to just note that when the node does GPU transcoding, it actually uses the LoadBalancingTranscoder type which also implements the Transcoder interface. Under the hood, the LoadBalancingTranscoder uses NvidiaTranscoder [1]. When LoadBalancingTranscoder's Transcode() method is used, there is actually a nested NvidiaTranscoder.Transcode() call in a goroutine. I believe a panic in NvidiaTranscoder's Transcode call would be caught and returned as an error to LoadBalancingTranscoder and then it would be handled in the ot_rpc.go code the same way that it would be handled if the error was returned directly by NvidiaTranscoder (but feel free to correct me here if there is any disagreement on the expected behavior!).

[1] See this comment for more details #2070 (comment)

core/transcoder.go

server/ot_rpc.go

core/transcoder_test.go

leszko · 2021-11-08T09:27:11Z

I believe a panic in NvidiaTranscoder's Transcode call would be caught and returned as an error to LoadBalancingTranscoder and then it would be handled in the ot_rpc.go code the same way that it would be handled if the error was returned directly by NvidiaTranscoder (but feel free to correct me here if there is any disagreement on the expected behavior!).

Yes, you're right, it'll work exactly as you expected. I just double-checked it. We could consider adding a unit test in lb_test.go, however, I think we already test returning a custom error in lb_test.go#TestLB_ConcurrentSessionErrors(). So, IMO the scenario you described is already covered.

leszko

Thanks for the review @yondonfu . I addressed your comments. PTAL.

core/transcoder.go

core/transcoder_test.go

server/ot_rpc.go

yondonfu

Looks great! Last thing before merging...

We typically like to prefix commit messages with the package that the commit updates (see this doc - we loosely follow the guidelines described) as its helpful when sifting through commit history later on. Would be great to update the commits accordingly and to also squash the commits updating CHANGELOG_PENDING into a single one [1]. Then we're good to go!

[1] We've typically saved this process for the end of the PR review process and at that time rebasing/modifying commit history is considered fair game (generally we try to avoid doing so in middle of the review process just for the reviewer's convenience although there are exceptions).

leszko · 2021-11-09T09:00:43Z

We typically like to prefix commit messages with the package that the commit updates (see this doc - we loosely follow the guidelines described) as its helpful when sifting through commit history later on. Would be great to update the commits accordingly and to also squash the commits updating CHANGELOG_PENDING into a single one [1]. Then we're good to go!

[1] We've typically saved this process for the end of the PR review process and at that time rebasing/modifying commit history is considered fair game (generally we try to avoid doing so in middle of the review process just for the reviewer's convenience although there are exceptions).

Ok, good to know. Then, I think that this PR should be a single commit with the message [core+server] Gracefully notify orchestrator in case of panic in transcoder. When the PR is approved (by both @yondonfu and @jailuthra), I'll just click "Squash and merge" and update the commit message there. Is that ok?

jailuthra

LGTM! 🚢 🥳

When the PR is approved [..], I'll just click "Squash and merge" and update the commit message there.

Yeah 👍 I think it's okay to squash and merge using github UI and edit the commit message there, given the changes in this PR don't touch many things.

For more complicated changes it's preferred to run a local rebase+squash to split independent things in separate commits, and force-push on the PR branch before merging.

I think that this PR should be a single commit with the message [core+server] Gracefully notify orchestrator in case of panic in transcoder

The message can probably be formatted as core,server: Gracefully notify.. or core+server: Gracefully notify.. to be closer to rest of the commit history. I don't think we've used the square brackets [] before.

yondonfu

LGTM! 🥳

leszko added 5 commits November 5, 2021 10:26

Add recovering from panic to transcoder

93e577a

Add recovering from panic to transcoder and OT RPC server

7af47dc

Update CHANGELOG_PENDING.md

5569cd1

Update CHANGELOG_PENDING.md

5106c43

Update CHANGELOG_PENDING.md

ade0f7d

leszko requested review from yondonfu and jailuthra November 5, 2021 15:35

yondonfu reviewed Nov 7, 2021

View reviewed changes

core/transcoder.go Show resolved Hide resolved

server/ot_rpc.go Show resolved Hide resolved

core/transcoder_test.go Show resolved Hide resolved

Update unit test in transcoder_test.go

75d34f5

leszko commented Nov 8, 2021

View reviewed changes

core/transcoder.go Show resolved Hide resolved

core/transcoder_test.go Show resolved Hide resolved

server/ot_rpc.go Show resolved Hide resolved

leszko requested a review from yondonfu November 8, 2021 10:11

yondonfu reviewed Nov 9, 2021

View reviewed changes

leszko requested a review from yondonfu November 9, 2021 09:00

jailuthra approved these changes Nov 9, 2021

View reviewed changes

yondonfu approved these changes Nov 9, 2021

View reviewed changes

leszko merged commit f161de2 into livepeer:master Nov 9, 2021

leszko deleted the 2088-return-error-before-panic branch November 9, 2021 14:22

leszko mentioned this pull request Nov 15, 2021

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

Closed

leszko mentioned this pull request Apr 5, 2022

core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

Merged

5 tasks

yondonfu mentioned this pull request Nov 14, 2022

Only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable errors livepeer/lpms#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully notify orchestrator in case of a panic in transcoder #2094

Gracefully notify orchestrator in case of a panic in transcoder #2094

leszko commented Nov 5, 2021 •

edited

Loading

yondonfu left a comment •

edited

Loading

leszko commented Nov 8, 2021

leszko left a comment

yondonfu left a comment

leszko commented Nov 9, 2021

jailuthra left a comment

yondonfu left a comment

Gracefully notify orchestrator in case of a panic in transcoder #2094

Gracefully notify orchestrator in case of a panic in transcoder #2094

Conversation

leszko commented Nov 5, 2021 • edited Loading

yondonfu left a comment • edited Loading

Choose a reason for hiding this comment

leszko commented Nov 8, 2021

leszko left a comment

Choose a reason for hiding this comment

yondonfu left a comment

Choose a reason for hiding this comment

leszko commented Nov 9, 2021

jailuthra left a comment

Choose a reason for hiding this comment

yondonfu left a comment

Choose a reason for hiding this comment

leszko commented Nov 5, 2021 •

edited

Loading

yondonfu left a comment •

edited

Loading