Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

Closed
yondonfu opened this issue Oct 30, 2021 · 13 comments

Comments

@yondonfu
Copy link
Member

Had a transcoder crash on the newest version of livepeer due to CUDA_ERROR_ILLEGAL_ADDRESS. The transcoder restarted but the Orchestrator could not recover and was stuck with err=no transcoders available. The other two Transcoders did not show any errors, they just stopped transcoding. Here are the relevant log parts from the O and Ts:

See https://discord.com/channels/423160867534929930/426114749370204170/903963246749446184 for attached logs.

@leszko
Copy link
Contributor

leszko commented Nov 15, 2021

Attaching the logs from Discord.
cuda_error.txt

@leszko
Copy link
Contributor

leszko commented Nov 15, 2021

I spent some time on the issue, but I was not able to reproduce it. If the issue repeats, then could someone add the steps to reproduce?

It's possible that we'll not see the issue after the next release, because of this fix #2094, but to be honest I'm not sure about it. There are actually two weird things that happen in the logs.

1. Orchestrator is not able to recover after transcoder(s) crash(es).

It points to the function orchestrator.go:selectTranscoder()

@reubenr0d @darkdarkdragon do you maybe have any clue why this function may return ErrNoTranscodersAvailable even though new transcoders register to the orchestrator?

2. Transcoders 1&2 loop time out

Transcoder 3 fails, that's fine. However, the weird thing is that at this point Transcoders 1&2 keep printing these logs.

Oct 30 05:51:36  livepeer[2159602]: I1030 05:51:36.162803 2159602 lb.go:181] LB: Transcode loop timed out for 9043b617_0
Oct 30 05:51:36  livepeer[2159602]: I1030 05:51:36.228699 2159602 lb.go:123] LB: Deleted transcode session for 9043b617_0

It's like they sent some data to transcode, it failed and waiting in the loop for the result....

@yondonfu
Copy link
Member Author

do you maybe have any clue why this function may return ErrNoTranscodersAvailable even though new transcoders register to the orchestrator?

I'm not sure if this could be the cause of the problem here, but it looks like the one scenario where selectTranscoder() does return ErrNoTranscodersAvailable even when there are transcoders available is here - if the transcoder at the head of the transcoder queue is at capacity selectTranscoder() assumes that the rest of the transcoders in the queue are at capacity as well and just returns ErrNoTranscodersAvailable. Makes me wonder whether the load for transcoders isn't being updated correctly so that it looks like all the transcoders are at capacity even when they're not.

However, the weird thing is that at this point Transcoders 1&2 keep printing these logs.

These logs are expected whenever a transcoder hasn't received any segments for a session for a period of time. At that point, the transcoder will consider the session as timed out and will remove the session and cleanup any associated state/resources for it. So, in this case, those logs would indicate that the transcoders did not receive any segments for a bunch of sessions in awhile so they are all removed.

@leszko
Copy link
Contributor

leszko commented Nov 15, 2021

Thanks @yondonfu for the comment.

I'm not sure if this could be the cause of the problem here, but it looks like the one scenario where selectTranscoder() does return ErrNoTranscodersAvailable even when there are transcoders available is here - if the transcoder at the head of the transcoder queue is at capacity selectTranscoder() assumes that the rest of the transcoders in the queue are at capacity as well and just returns ErrNoTranscodersAvailable. Makes me wonder whether the load for transcoders isn't being updated correctly so that it looks like all the transcoders are at capacity even when they're not.

That may be the case. I spent some time checking it, but cannot find anything obvious.

These logs are expected whenever a transcoder hasn't received any segments for a session for a period of time. At that point, the transcoder will consider the session as timed out and will remove the session and cleanup any associated state/resources for it. So, in this case, those logs would indicate that the transcoders did not receive any segments for a bunch of sessions in awhile so they are all removed.

Ok, I see. Then, the transcoder loop looks ok.

@0xVires
Copy link

0xVires commented Nov 21, 2021

Not sure if it's related, but I just had the following situation :

  • T1 (on the same machine as the O) failed with Fatal error with remote transcoder=T1:59014 taskId=14531293 fname=https://T1:8935/stream/2c3da27f/5.tempfile err=Remote transcoder took too long and got removed from the transcoder map.
  • This somehow triggered the no transcoders available error and T2 & T3 being stuck with the Transcode loop timed out & Deleted transcode session issue
  • T1 restarted and started transcoding again
  • T2 & T3 were stuck with the issue mentioned above, only a restart helped

Here's the full log of the O and the Ts (bottom)
errors.txt

@leszko
Copy link
Contributor

leszko commented Nov 22, 2021

@0xVires Thanks for the comment and the logs. Were you able to repeat the same scenario or it just happened once? I'm looking for some steps to reproduce.

@0xVires
Copy link

0xVires commented Nov 22, 2021

No, I'm not able to reproduce since I don't know what triggered the fatal error in the first place... My setup has been running without any issues for the past two weeks.

I have the following setup:
Machine 1: O and T
Machine 2: T
Machine 3: T

So maybe you can try setting up something similar and crash one of the Ts to see if the other Ts and/or O is affected?

@leszko
Copy link
Contributor

leszko commented Nov 22, 2021

So maybe you can try setting up something similar and crash one of the Ts to see if the other Ts and/or O is affected?

I already tried it, but cannot reproduce the issue. I kill one trascoder, but the others work correctly...

@payton
Copy link

payton commented Nov 29, 2021

I've also experienced a similar issue. In this case, I had 3/4 Ts actively running maybe 9 sessions. The intent was to downscale and check how Livepeer handled these cutover situations.

  1. I killed the LP process on T1, which wasn't transcoding at the time. O successfully deregistered so there were no issues.
  2. I killed the LP process on T2, which was transcoding at the time. The O states that it removed T2 from the live transcoder map and continues to error on all sessions with no transcoders available. I believe the 9 active sessions were distributed and not only on the one transcoder. However, I cannot back this up with anything at the moment.

The logs on my O at the time of step 2 were as follows:

Nov 28 03:03:59 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:03:59.151886   26434 orchestrator.go:847] Stream closed for transcoder=172.31.59.215:54434, err=context canceled
Nov 28 03:03:59 ip-172-31-49-222 livepeer.sh[26422]: I1128 03:03:59.151918   26434 orchestrator.go:865] Got transcoder=172.31.59.215:54434 eof, removing from live transcoders map
...

Followed by

...
Nov 28 03:04:00 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:04:00.173242   26434 orchestrator.go:559] Error transcoding manifestID=a445af74-9c4f-4dec-b972-16d567245607 sessionID=172e3af2 segNo=5814 segName=https://orchestrator.lib.community:8935/stream/172e3af2/5814.tempfile err=no transcoders available
Nov 28 03:04:00 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:04:00.173551   26434 segment_rpc.go:214] Could not transcode manifestID=a445af74-9c4f-4dec-b972-16d567245607 sessionID=172e3af2 seqNo=5814 err=no transcoders availabl

@leszko
Copy link
Contributor

leszko commented Nov 29, 2021

Thanks for the input @payton. I think this is an important issue to address, however, I tried a few times with the load (like you did, 3/4T and 9 sessions), but I still never encountered the issue.

Is it something you were able to reproduce or it just happened only once?

@payton
Copy link

payton commented Nov 30, 2021

@leszko It only happened one time that I have observed. I'll try to setup some contained tests and reproduce this week.

@leszko
Copy link
Contributor

leszko commented Nov 30, 2021

Thanks @payton, some steps to reproduce will be great!

@leszko
Copy link
Contributor

leszko commented Feb 4, 2022

Closing since I believe it's fixed in #2208 . Reopen if you still encounter this issue after the next release.

@leszko leszko closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants