Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many connections between boostd and worker machines #732

Closed
6 of 11 tasks
haocai2868 opened this issue Aug 25, 2022 · 9 comments
Closed
6 of 11 tasks

Too many connections between boostd and worker machines #732

haocai2868 opened this issue Aug 25, 2022 · 9 comments
Labels

Comments

@haocai2868
Copy link

Checklist

  • This is not a question or a support request. If you have any boost related questions, please ask in the discussion forum.
  • This is not a new feature or enhancement request. If it is, please open a new idea discussion instead. New feature and enhancement requests would be entertained by the boost team after a thorough discussion only.
  • I have searched on the issue tracker and the discussion forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to boost.

Boost component

  • boost daemon - storage providers
  • boost client
  • boost UI
  • boost data-transfer
  • boost index-provider
  • Other

Boost Version

boostd version 1.4.0-rc1+git.cf9ae30

Describe the Bug

image
image
image

There are too many connections between boostd and worker machines. Every time a task is added, it will grow and not be released, resulting in intermittent disconnection between boostd and lotus, miner, and workers after the storage, resulting in deal failure.

Logging Information

2022-08-17T17:38:56.106+0800    ERROR   rpc     go-jsonrpc@v0.1.5/websocket.go:667      Connection timeout      {"remote": "10.126.36.6:2345"}
2022-08-17T17:38:56.107+0800    WARN    dagstore        dagstore/miner_api.go:152       failed to check/retrieve unsealed sector: failed to check if se
ctor 3406 for deal 9038184 was unsealed: acquiring read sector lock: handler: websocket connection closed
2022-08-17T17:38:56.107+0800    WARN    dagstore        dagstore/miner_api.go:152       failed to check/retrieve unsealed sector: failed to check if se
ctor 3408 for deal 9038186 was unsealed: acquiring read sector lock: handler: websocket connection closed
2022-08-17T17:38:56.107+0800    WARN    dagstore        dagstore/miner_api.go:152       failed to check/retrieve unsealed sector: failed to check if se
ctor 3407 for deal 9038195 was unsealed: acquiring read sector lock: handler: websocket connection closed
2022-08-17T17:38:56.107+0800    WARN    dagstore        dagstore/miner_api.go:152       failed to check/retrieve unsealed sector: failed to check if se
ctor 3404 for deal 9038202 was unsealed: acquiring read sector lock: handler: websocket connection closed
2022-08-17T17:38:56.107+0800    WARN    dagstore        dagstore/miner_api.go:152       failed to check/retrieve unsealed sector: failed to check if se
ctor 3405 for deal 9038192 was unsealed: acquiring read sector lock: handler: websocket connection closed
2022-08-17T17:38:56.111+0800    ERROR   rpc     go-jsonrpc@v0.1.5/websocket.go:667      Connection timeout      {"remote": "10.126.36.6:2345"}
2022-08-17T17:38:56.111+0800    WARN    boost-storage-deal      logs/log.go:46  failed to addPiece for deal, will-retry {"id": "5c0ae7f8-08e4-4bb7-a307
-a29d453b679c", "err": "handler: websocket connection closed"}
2022-08-17T17:38:56.112+0800    ERROR   rpc     go-jsonrpc@v0.1.5/websocket.go:498      sending ping message: write tcp 10.126.38.10:43568->10.126.36.6
:2345: use of closed network connection
2022-08-17T17:38:56.112+0800    ERROR   rpc     go-jsonrpc@v0.1.5/websocket.go:498      sending ping message: write tcp 10.126.38.10:43532->10.126.36.6
:2345: use of closed network connection
2022-08-17T17:38:56.112+0800    ERROR   rpc     go-jsonrpc@v0.1.5/websocket.go:667      Connection timeout      {"remote": "10.126.36.6:1234"}
2022-08-17T17:38:56.112+0800    WARN    boost-storage-deal      logs/log.go:46  failed to addPiece for deal, will-retry {"id": "82169a39-fed9-4f67-9424
-97f92bf47935", "err": "handler: websocket connection closed"}

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...'
    ...
@LexLuthr
Copy link
Collaborator

This is current being investigated in https://filecoinproject.slack.com/archives/C03CKDLEWG1/p1659290352038479
We have identified that these messages are originating from retrieval queries. Further investigation is required to identify in more details where these requests are coming from.

@dirkmc
Copy link
Contributor

dirkmc commented Aug 26, 2022

I'm not sure if this is the same issue as in the slack thread.

@haocai2868 thank you for the bug report. To help track down the problem, can you please give us some numbers:

  • how many connections do you see in total before you start getting errors?
  • approximately how many connections are being opened per minute?
  • are you sure that every connection remains open? or is it possible that some are closed, but there are many more being opened than being closed?
  • are you serving retrievals, or do you have retrievals disabled?

@haocai2868
Copy link
Author

  1. When the error started, I saw more than 1000 links, followed by more than 3000.
  2. The speed depends on the speed of boostd importing transactions. Because of importing a transaction, boostd starts to distribute to workers for commp, and the number of connections between boostd and workers will increase and will not be released, and will continue to increase in the future.
  3. Make sure each state is ESTABLISHED
  4. The retrieval is turned on, but no one retrieves me (calibration network)

@dirkmc
Copy link
Contributor

dirkmc commented Aug 29, 2022

@haocai2868 we think that the problem may be fixed by filecoin-project/lotus#9230
I will give you instructions on how to test it once the PR is approved by the lotus team.

@dirkmc
Copy link
Contributor

dirkmc commented Aug 29, 2022

@haocai2868 are you familiar with how to cherry-pick a commit with git?

If so, maybe you could try:

  1. cherry-pick the commit with the fix to your lotus repo
  2. compile and deploy the code to your worker node
  3. let us know if it fixes the problem with too many connections between boostd and worker machines

@TippyFlitsUK
Copy link

@haocai2868 Just out of interest, what version of Ubuntu are you using?

@haocai2868
Copy link
Author

18.04

@jacobheun
Copy link
Contributor

@haocai2868 are you still experiencing this issue on new versions of lotus & boost?

@LexLuthr
Copy link
Collaborator

LexLuthr commented Dec 6, 2023

I am closing this as there has been no new input from the author.

@LexLuthr LexLuthr closed this as completed Dec 6, 2023
@github-project-automation github-project-automation bot moved this to Done in Boost Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants