-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully notify orchestrator in case of a panic in transcoder #2094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few minor nit comments.
Also, I believe the changes in LocalTranscoder
and NvidiaTranscoder
should be sufficient, but I wanted to just note that when the node does GPU transcoding, it actually uses the LoadBalancingTranscoder type which also implements the Transcoder
interface. Under the hood, the LoadBalancingTranscoder
uses NvidiaTranscoder
[1]. When LoadBalancingTranscoder
's Transcode() method is used, there is actually a nested NvidiaTranscoder.Transcode() call in a goroutine. I believe a panic in NvidiaTranscoder
's Transcode
call would be caught and returned as an error to LoadBalancingTranscoder
and then it would be handled in the ot_rpc.go
code the same way that it would be handled if the error was returned directly by NvidiaTranscoder
(but feel free to correct me here if there is any disagreement on the expected behavior!).
[1] See this comment for more details #2070 (comment)
Yes, you're right, it'll work exactly as you expected. I just double-checked it. We could consider adding a unit test in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @yondonfu . I addressed your comments. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Last thing before merging...
We typically like to prefix commit messages with the package that the commit updates (see this doc - we loosely follow the guidelines described) as its helpful when sifting through commit history later on. Would be great to update the commits accordingly and to also squash the commits updating CHANGELOG_PENDING into a single one [1]. Then we're good to go!
[1] We've typically saved this process for the end of the PR review process and at that time rebasing/modifying commit history is considered fair game (generally we try to avoid doing so in middle of the review process just for the reviewer's convenience although there are exceptions).
Ok, good to know. Then, I think that this PR should be a single commit with the message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢 🥳
When the PR is approved [..], I'll just click "Squash and merge" and update the commit message there.
Yeah 👍 I think it's okay to squash and merge using github UI and edit the commit message there, given the changes in this PR don't touch many things.
For more complicated changes it's preferred to run a local rebase+squash to split independent things in separate commits, and force-push on the PR branch before merging.
I think that this PR should be a single commit with the message
[core+server] Gracefully notify orchestrator in case of panic in transcoder
The message can probably be formatted as core,server: Gracefully notify..
or core+server: Gracefully notify..
to be closer to rest of the commit history. I don't think we've used the square brackets []
before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🥳
What does this pull request do? Explain your changes. (required)
Fix transcoder to send an error to orchestrator before crashing. Before, when the transcoder crashed after
panic()
, the orchestrator had to time out before trying another transcoder. The change is related only to the split O+T topology.Specific updates (required)
recover()
in to thetranscoder
functions (both for Nvidia and the standard CPU transcoding)UnrecoverableError
inot_rpc.go
to send the error message to orchestrator and only then to panic and crashHow did you test each of these updates (required)
panic(errors.New("Some error"))
insidetranscoder.go#Transcode()
functionThe same test without this PR, makes orchestrator wait for the timeout before selecting the second transcoder.
Does this pull request close any open issues?
fix #2088
Checklist:
make
runs successfully./test.sh
pass