-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalizing Sectors from Workers #9033
Comments
Typo on the RemoteFinalize flag* |
This relates to the implementation of #8710 |
Thanks for the report. I will add labels and assign it to the right team for analysis. 🙏 |
Hey @piknikSteven2021! Can you check that all! workers has access to the long-term storage-paths?
Can you also check the output of: |
Would this apply to all workers attached to the lotus-miner? Normally, we would only have the long-term storage access on the workers that holds the PC2 role, as this is where the sectors needs to be moved from in the finalize phase. Wouldn't it be a bit odd if AP, PC1 and C2 workers also need the access? I mean C2 can even run without any local storage right... |
@magik6k has some thoughts to be shared soooooon |
|
Did some more tinkering. Workers were detached over night for approximately 16 hours, running v1.17-rc2. Miner was reverted during this time from v1.17-rc2 to v1.16. As soon as this happened the workers began to finalize directly to storage and we saw new logs in the PC2 logs. I attempted to replicate this on another miner that has been seeing the same issue of workers only finalizing through the miner by keeping the miner on v1.16 and upgrading workers to v1.17-rc2. This time, it had no affect on the behavior. |
It would be really useful to see the output of Seeing scheduler logs when there are stuck sectors could help as well. |
@magik6k very likely that this is the case. Probably triggered due to the FinalizeEarly. I have one thats stuck;
and the cache files + sealed files are already on the long-term storage;
but the sector is still in This in turn is probably the reason for this log spam: #8783 (comment) This might indicate at what part the code starts to "loop", and this in turn probably uses a lot of CPU cycles on the scheduler - the sector is deadlocked. |
I have moved the cache + sealed file from the long time storage back to a PC1 machine, and restarted the PC1 worker that has that storage locally. This triggered something;
After 2 failed C1's, its now redoing the PC1. |
Should be fixed by #9648 |
@piknikSteven2021 did you ever mange to fix this? Because keeping an older miner is obviously not an option :) In my opinion this is still broken on 1.20.1. Still have a bunch stuck in |
Having the storage attached to workers always made the scheduler perform worse. Unfortunately no, we just work around the issue by not committing too much to a single miner. |
Checklist
Latest release
, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Describe the Bug
We've been having the issue of the workers not sending sectors to long-term storage despite storage being declared on the worker - this has happened on v1.15.2, v1.16.0 and v1.17-rc2.
DisableRemoteFinalize turned on to stop the miner from pulling sectors and congesting its NFS connection to the storage. (v1.17-rc2)
The miner is tasking the worker to finalize but there is nothing happening on the worker.
Sectors are stuck in CommitFinalize and will not move no matter what we do.
Logging Information
Repo Steps
DisableRemoteFinalize = true
CommitFinalizing
with no data transfersThe text was updated successfully, but these errors were encountered: