Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnecting a remote T causes O to lose all transcoders #2605

Closed
eliteprox opened this issue Sep 23, 2022 · 2 comments
Closed

Disconnecting a remote T causes O to lose all transcoders #2605

eliteprox opened this issue Sep 23, 2022 · 2 comments
Assignees
Labels
area: orchestrator QoL need: investigation need to investigate what is happening before proceeding

Comments

@eliteprox
Copy link
Collaborator

eliteprox commented Sep 23, 2022

Describe the bug

When operating an orchestrator with multiple remote transcoders, disconnecting one remote T can sometimes cause orchestrator to lose all transcoders and go into a no transcoders available state.

To Reproduce
Steps to reproduce the behavior:

  1. Run standalone orchestrator
  2. Run two transcoder processes on localhost with orchestrator
  3. Run a third transcoder process on a separate IP address connected to the same orchestrator.
  4. Stop the third transcoder process while orchestrator has streams running on all three transcoders.
  5. Orchestrator logs are attached.
    orchestratorlogs.txt
    local-transcoder2.txt
    remote-transcoder3.txt

Expected behavior
The stopped transcoder should disconnect without affecting workflow to the other transcoders

Desktop (please complete the following information):

  • OS: Ubunutu Linux 22.04.01 on both machines
  • Version 0.5.34

Additional context
See attached logs.
orchestratorlogs.txt shows the orchestrator logs during this time period.

  • We stopped remote transcoder 70.132.135.145:34772 on purpose using systemctl stop
Sep 21 14:39:39 whitebox systemd[1]: Stopping Livepeer Trancoder 1 Service...
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183235    3963 ot_rpc.go:116] Exiting Livepeer Transcoder: terminated
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183300    3963 ot_rpc.go:128] End of stream receive cycle because of err="execution interrupted", waiting for running transcode jobs to complete
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183372    3963 ot_rpc.go:60] Unregistering transcoder: execution interrupted
Sep 21 14:39:39 whitebox livepeer[3963]: I0921 14:39:39.183377    3963 ot_rpc.go:62] Terminating transcoder because of execution interrupted
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: State 'stop-sigterm' timed out. Killing.
Sep 21 14:41:09 whitebox systemd[1]: transcoder1.service: Killing process 3963 (livepeer) with signal SIGKILL.
  • no transcoders available begins to show at Sep 21 14:39:49
Sep 21 14:39:48 blackbox livepeer[358249]: I0921 14:39:48.949666  358249 orchestrator.go:770] manifestID=e101ced7-65ea-4df6-96ae-31e2f143fcde seqNo=750 orchSessionID=da65b479 clientIP=0.0.0.130 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59952 segments=2 taskId=1309469 fname=https://70.132.135.146:8935/stream/da65b479/750.tempfile dur=101.090851ms
Sep 21 14:39:49 blackbox livepeer[358249]: I0921 14:39:49.329143  358249 orchestrator.go:770] manifestID=147bb333-d2e6-4f0c-b6c6-de6eaf0aa93c seqNo=8357 orchSessionID=cf465ddb clientIP=0.0.0.98 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=127.0.0.1:59944 segments=1 taskId=1309470 fname=https://70.132.135.146:8935/stream/cf465ddb/8357.tempfile dur=53.886134ms
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.373572  358249 orchestrator.go:559] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=0.0.0.193 Error transcoding segName=https://70.132.135.146:8935/stream/2aaaf010/5.tempfile err="no transcoders available"

Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.374196  358249 segment_rpc.go:230] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 clientIP=0.0.0.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745008  358249 orchestrator.go:559] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Error transcoding segName=https://70.132.135.146:8935/stream/2a965663/57.tempfile err="no transcoders available"
Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745111  358249 segment_rpc.go:230] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"
  • transcoder1.txt and transcoder2.txt show logs from two transcoders on localhost that stopped receiving work.
@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Sep 23, 2022
@cyberj0g cyberj0g self-assigned this Oct 3, 2022
@cyberj0g
Copy link
Contributor

cyberj0g commented Oct 3, 2022

Thanks for reporting. Couldn't reproduce this issue on a few initial attempts. Tried with both v0.5.34 and master.
There are only two spots in code, which could return No transcoders available error. I suspect there may be a data race somewhere around streamSessions and remoteTranscoders maps, maybe it happens only when transcoders are at full capacity.

@eliteprox would appreciate some input to investigate further:

  1. Is this error fuzzy, or consistently reproducible?
  2. Is there a difference in capabilities between local and remote transcoders (different GPUs)?
  3. Is it reproducible with only local transcoders?
  4. Could you share a full command line for O and Ts?

@ArcAster ArcAster added need: investigation need to investigate what is happening before proceeding area: orchestrator QoL and removed status: triage this issue has not been evaluated yet labels Oct 5, 2022
@cyberj0g
Copy link
Contributor

I believe this to have the same cause as #2706, which is currently fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: orchestrator QoL need: investigation need to investigate what is happening before proceeding
Projects
None yet
Development

No branches or pull requests

3 participants