-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split O/T: no transcoders available on O connected to multiple Ts after a single T restart due to CUDA_ERROR_ILLEGAL_ADDRESS #2079
Comments
Attaching the logs from Discord. |
I spent some time on the issue, but I was not able to reproduce it. If the issue repeats, then could someone add the steps to reproduce? It's possible that we'll not see the issue after the next release, because of this fix #2094, but to be honest I'm not sure about it. There are actually two weird things that happen in the logs. 1. Orchestrator is not able to recover after transcoder(s) crash(es). It points to the function orchestrator.go:selectTranscoder() @reubenr0d @darkdarkdragon do you maybe have any clue why this function may return 2. Transcoders 1&2 loop time out Transcoder 3 fails, that's fine. However, the weird thing is that at this point Transcoders 1&2 keep printing these logs.
It's like they sent some data to transcode, it failed and waiting in the loop for the result.... |
I'm not sure if this could be the cause of the problem here, but it looks like the one scenario where
These logs are expected whenever a transcoder hasn't received any segments for a session for a period of time. At that point, the transcoder will consider the session as timed out and will remove the session and cleanup any associated state/resources for it. So, in this case, those logs would indicate that the transcoders did not receive any segments for a bunch of sessions in awhile so they are all removed. |
Thanks @yondonfu for the comment.
That may be the case. I spent some time checking it, but cannot find anything obvious.
Ok, I see. Then, the transcoder loop looks ok. |
Not sure if it's related, but I just had the following situation :
Here's the full log of the O and the Ts (bottom) |
@0xVires Thanks for the comment and the logs. Were you able to repeat the same scenario or it just happened once? I'm looking for some steps to reproduce. |
No, I'm not able to reproduce since I don't know what triggered the fatal error in the first place... My setup has been running without any issues for the past two weeks. I have the following setup: So maybe you can try setting up something similar and crash one of the Ts to see if the other Ts and/or O is affected? |
I already tried it, but cannot reproduce the issue. I kill one trascoder, but the others work correctly... |
I've also experienced a similar issue. In this case, I had 3/4 Ts actively running maybe 9 sessions. The intent was to downscale and check how Livepeer handled these cutover situations.
The logs on my O at the time of step 2 were as follows:
Followed by
|
Thanks for the input @payton. I think this is an important issue to address, however, I tried a few times with the load (like you did, 3/4T and 9 sessions), but I still never encountered the issue. Is it something you were able to reproduce or it just happened only once? |
@leszko It only happened one time that I have observed. I'll try to setup some contained tests and reproduce this week. |
Thanks @payton, some steps to reproduce will be great! |
Closing since I believe it's fixed in #2208 . Reopen if you still encounter this issue after the next release. |
See https://discord.com/channels/423160867534929930/426114749370204170/903963246749446184 for attached logs.
The text was updated successfully, but these errors were encountered: