Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

yondonfu · 2021-10-30T14:20:28Z

Had a transcoder crash on the newest version of livepeer due to CUDA_ERROR_ILLEGAL_ADDRESS. The transcoder restarted but the Orchestrator could not recover and was stuck with err=no transcoders available. The other two Transcoders did not show any errors, they just stopped transcoding. Here are the relevant log parts from the O and Ts:

See https://discord.com/channels/423160867534929930/426114749370204170/903963246749446184 for attached logs.

leszko · 2021-11-15T12:57:41Z

Attaching the logs from Discord.
cuda_error.txt

leszko · 2021-11-15T12:57:50Z

I spent some time on the issue, but I was not able to reproduce it. If the issue repeats, then could someone add the steps to reproduce?

It's possible that we'll not see the issue after the next release, because of this fix #2094, but to be honest I'm not sure about it. There are actually two weird things that happen in the logs.

1. Orchestrator is not able to recover after transcoder(s) crash(es).

It points to the function orchestrator.go:selectTranscoder()

@reubenr0d @darkdarkdragon do you maybe have any clue why this function may return ErrNoTranscodersAvailable even though new transcoders register to the orchestrator?

2. Transcoders 1&2 loop time out

Transcoder 3 fails, that's fine. However, the weird thing is that at this point Transcoders 1&2 keep printing these logs.

Oct 30 05:51:36  livepeer[2159602]: I1030 05:51:36.162803 2159602 lb.go:181] LB: Transcode loop timed out for 9043b617_0
Oct 30 05:51:36  livepeer[2159602]: I1030 05:51:36.228699 2159602 lb.go:123] LB: Deleted transcode session for 9043b617_0

It's like they sent some data to transcode, it failed and waiting in the loop for the result....

yondonfu · 2021-11-15T14:47:30Z

do you maybe have any clue why this function may return ErrNoTranscodersAvailable even though new transcoders register to the orchestrator?

I'm not sure if this could be the cause of the problem here, but it looks like the one scenario where selectTranscoder() does return ErrNoTranscodersAvailable even when there are transcoders available is here - if the transcoder at the head of the transcoder queue is at capacity selectTranscoder() assumes that the rest of the transcoders in the queue are at capacity as well and just returns ErrNoTranscodersAvailable. Makes me wonder whether the load for transcoders isn't being updated correctly so that it looks like all the transcoders are at capacity even when they're not.

However, the weird thing is that at this point Transcoders 1&2 keep printing these logs.

These logs are expected whenever a transcoder hasn't received any segments for a session for a period of time. At that point, the transcoder will consider the session as timed out and will remove the session and cleanup any associated state/resources for it. So, in this case, those logs would indicate that the transcoders did not receive any segments for a bunch of sessions in awhile so they are all removed.

leszko · 2021-11-15T16:16:22Z

Thanks @yondonfu for the comment.

I'm not sure if this could be the cause of the problem here, but it looks like the one scenario where selectTranscoder() does return ErrNoTranscodersAvailable even when there are transcoders available is here - if the transcoder at the head of the transcoder queue is at capacity selectTranscoder() assumes that the rest of the transcoders in the queue are at capacity as well and just returns ErrNoTranscodersAvailable. Makes me wonder whether the load for transcoders isn't being updated correctly so that it looks like all the transcoders are at capacity even when they're not.

That may be the case. I spent some time checking it, but cannot find anything obvious.

These logs are expected whenever a transcoder hasn't received any segments for a session for a period of time. At that point, the transcoder will consider the session as timed out and will remove the session and cleanup any associated state/resources for it. So, in this case, those logs would indicate that the transcoders did not receive any segments for a bunch of sessions in awhile so they are all removed.

Ok, I see. Then, the transcoder loop looks ok.

0xVires · 2021-11-21T11:06:08Z

Not sure if it's related, but I just had the following situation :

T1 (on the same machine as the O) failed with Fatal error with remote transcoder=T1:59014 taskId=14531293 fname=https://T1:8935/stream/2c3da27f/5.tempfile err=Remote transcoder took too long and got removed from the transcoder map.
This somehow triggered the no transcoders available error and T2 & T3 being stuck with the Transcode loop timed out & Deleted transcode session issue
T1 restarted and started transcoding again
T2 & T3 were stuck with the issue mentioned above, only a restart helped

Here's the full log of the O and the Ts (bottom)
errors.txt

leszko · 2021-11-22T08:57:29Z

@0xVires Thanks for the comment and the logs. Were you able to repeat the same scenario or it just happened once? I'm looking for some steps to reproduce.

0xVires · 2021-11-22T11:01:41Z

No, I'm not able to reproduce since I don't know what triggered the fatal error in the first place... My setup has been running without any issues for the past two weeks.

I have the following setup:
Machine 1: O and T
Machine 2: T
Machine 3: T

So maybe you can try setting up something similar and crash one of the Ts to see if the other Ts and/or O is affected?

leszko · 2021-11-22T11:29:24Z

So maybe you can try setting up something similar and crash one of the Ts to see if the other Ts and/or O is affected?

I already tried it, but cannot reproduce the issue. I kill one trascoder, but the others work correctly...

payton · 2021-11-29T13:02:27Z

I've also experienced a similar issue. In this case, I had 3/4 Ts actively running maybe 9 sessions. The intent was to downscale and check how Livepeer handled these cutover situations.

I killed the LP process on T1, which wasn't transcoding at the time. O successfully deregistered so there were no issues.
I killed the LP process on T2, which was transcoding at the time. The O states that it removed T2 from the live transcoder map and continues to error on all sessions with no transcoders available. I believe the 9 active sessions were distributed and not only on the one transcoder. However, I cannot back this up with anything at the moment.

The logs on my O at the time of step 2 were as follows:

Nov 28 03:03:59 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:03:59.151886   26434 orchestrator.go:847] Stream closed for transcoder=172.31.59.215:54434, err=context canceled
Nov 28 03:03:59 ip-172-31-49-222 livepeer.sh[26422]: I1128 03:03:59.151918   26434 orchestrator.go:865] Got transcoder=172.31.59.215:54434 eof, removing from live transcoders map
...

Followed by

...
Nov 28 03:04:00 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:04:00.173242   26434 orchestrator.go:559] Error transcoding manifestID=a445af74-9c4f-4dec-b972-16d567245607 sessionID=172e3af2 segNo=5814 segName=https://orchestrator.lib.community:8935/stream/172e3af2/5814.tempfile err=no transcoders available
Nov 28 03:04:00 ip-172-31-49-222 livepeer.sh[26422]: E1128 03:04:00.173551   26434 segment_rpc.go:214] Could not transcode manifestID=a445af74-9c4f-4dec-b972-16d567245607 sessionID=172e3af2 seqNo=5814 err=no transcoders availabl

leszko · 2021-11-29T15:45:35Z

Thanks for the input @payton. I think this is an important issue to address, however, I tried a few times with the load (like you did, 3/4T and 9 sessions), but I still never encountered the issue.

Is it something you were able to reproduce or it just happened only once?

payton · 2021-11-30T00:11:01Z

@leszko It only happened one time that I have observed. I'll try to setup some contained tests and reproduce this week.

leszko · 2021-11-30T09:19:33Z

Thanks @payton, some steps to reproduce will be great!

leszko · 2022-02-04T09:03:35Z

Closing since I believe it's fixed in #2208 . Reopen if you still encounter this issue after the next release.

leszko mentioned this issue Jan 25, 2022

Improve Feed PubSub mechanism #2207

Closed

leszko closed this as completed Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

yondonfu commented Oct 30, 2021

leszko commented Nov 15, 2021

leszko commented Nov 15, 2021

yondonfu commented Nov 15, 2021

leszko commented Nov 15, 2021 •

edited

Loading

0xVires commented Nov 21, 2021 •

edited

Loading

leszko commented Nov 22, 2021

0xVires commented Nov 22, 2021

leszko commented Nov 22, 2021

payton commented Nov 29, 2021

leszko commented Nov 29, 2021

payton commented Nov 30, 2021

leszko commented Nov 30, 2021

leszko commented Feb 4, 2022

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079

Comments

yondonfu commented Oct 30, 2021

leszko commented Nov 15, 2021

leszko commented Nov 15, 2021

yondonfu commented Nov 15, 2021

leszko commented Nov 15, 2021 • edited Loading

0xVires commented Nov 21, 2021 • edited Loading

leszko commented Nov 22, 2021

0xVires commented Nov 22, 2021

leszko commented Nov 22, 2021

payton commented Nov 29, 2021

leszko commented Nov 29, 2021

payton commented Nov 30, 2021

leszko commented Nov 30, 2021

leszko commented Feb 4, 2022

leszko commented Nov 15, 2021 •

edited

Loading

0xVires commented Nov 21, 2021 •

edited

Loading