Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sched: Finalize* move selectors #8710

Merged
merged 5 commits into from
May 26, 2022
Merged

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented May 23, 2022

Related Issues

On top of #8700 (to avoid conflicts and for easier testing - those PRs make most sense together)

Proposed Changes

  • Add new mechanism to worker selectors which lets them decide that some workers are so much better than others, that other workers aren't even worth considering, even if that means that the task might be queued a bit longer
  • Add a new move worker selector, which will strongly prefer workers which can perform a local move of data (they either have both storage paths locally, or already have the sector in the correct local path)
  • Add a new config which makes it possible Finalize* to perform local-storage moves
[Storage]
  # DisallowRemoteFinalize when set to true will force all Finalize tasks to
  # run on workers with local access to both long-term storage and the sealing
  # path containing the sector.
  # --
  # WARNING: Only set this if all workers have access to long-term storage
  # paths. If this flag is enabled, and there are workers without long-term
  # storage access, sectors will not be moved from them, and Finalize tasks
  # will appear to be stuck.
  # --
  # If you see stuck Finalize tasks after enabling this setting, check
  # 'lotus-miner sealing sched-diag' and 'lotus-miner storage find [sector num]'
  #
  # type: bool
  # env var: LOTUS_STORAGE_DISALLOWREMOTEFINALIZE
  DisallowRemoteFinalize = true

@magik6k magik6k changed the title Feat/stor fin move selector feat: sched: Finalize* move selectors May 23, 2022
@magik6k magik6k force-pushed the feat/stor-fin-move-selector branch from 0fce142 to 8c6cba7 Compare May 23, 2022 21:36
@codecov
Copy link

codecov bot commented May 23, 2022

Codecov Report

Merging #8710 (70f3b98) into feat/multi-sched (3de34ea) will decrease coverage by 0.03%.
The diff coverage is 69.23%.

Impacted file tree graph

@@                 Coverage Diff                  @@
##           feat/multi-sched    #8710      +/-   ##
====================================================
- Coverage             40.88%   40.85%   -0.04%     
====================================================
  Files                   691      692       +1     
  Lines                 75992    76046      +54     
====================================================
- Hits                  31068    31067       -1     
- Misses                39563    39617      +54     
- Partials               5361     5362       +1     
Impacted Files Coverage Δ
extern/sector-storage/sched.go 82.89% <ø> (ø)
extern/sector-storage/selector_alloc.go 68.96% <37.50%> (ø)
extern/sector-storage/selector_existing.go 73.33% <50.00%> (ø)
extern/sector-storage/selector_move.go 65.00% <65.00%> (ø)
extern/sector-storage/selector_task.go 61.11% <66.66%> (ø)
extern/sector-storage/manager.go 62.01% <100.00%> (+0.06%) ⬆️
extern/sector-storage/sched_assigner_common.go 80.23% <100.00%> (+1.48%) ⬆️
itests/kit/ensemble.go 92.08% <100.00%> (+0.03%) ⬆️
itests/kit/node_opts.go 79.01% <100.00%> (+1.09%) ⬆️
node/config/storage.go 74.35% <100.00%> (+0.67%) ⬆️
... and 16 more

@magik6k magik6k marked this pull request as ready for review May 23, 2022 23:17
@magik6k magik6k requested a review from a team as a code owner May 23, 2022 23:17
Base automatically changed from feat/multi-sched to master May 26, 2022 19:20
@magik6k magik6k merged commit 7836e20 into master May 26, 2022
@magik6k magik6k deleted the feat/stor-fin-move-selector branch May 26, 2022 19:20
@piknikSteven2021
Copy link

piknikSteven2021 commented Jul 13, 2022

Hello, we're currently testing the remote finalize feature against some Ubuntu w/ ZFS NAS systems - but having the worker and miner on v1.17-rc2 we are not finding a way to push the sectors that are stuck finalizing. We've restarted the worker, miner, tried lotus-worker storage attach, and remounting the storage.
I've seen nothing in the worker debug logs to indicate that anything is being worked on with the sectors in finalizing:
CommitFinalize: 87
Here are some logs, do let me know if more would be helpful:

2022/07/13 04:10:02 proto: duplicate proto type registered: pb.NoiseHandshakePayload
Worker version:  1.6.0
CLI version: lotus-worker version 1.17.0-rc2+mainnet+git.6aaacf8ef

Session: 92627aa1-a63c-47eb-a3fd-4c44d8992177
Enabled: true
Hostname: gpua-3371d1
CPUs: 48; GPUs: [NVIDIA RTX A5000]
RAM: 51.3 GiB/503.6 GiB; Swap: 1.464 GiB/2 GiB
Task types: FIN GET FRU C1 PC2 PR1 

6c5a6083-6252-4d48-b7c5-1f1daab360af:
        Weight: 10; Use: Seal 
        Local: /mnt/pc2/lotus-pc2-1
800fed86-5f9c-43dd-bb5c-506ba2689bbb:
        Weight: 10; Use: Store
        Local: /mnt/nas/nfs-las1-nas7c3b
khpy2020@gpua-3371d1:~$ sudo iftop -i enp
enp193s0f0  enp193s0f1  enp66s0f0   enp66s0f1   
khpy2020@gpua-3371d1:~$ sudo iftop -i enp
enp193s0f0  enp193s0f1  enp66s0f0   enp66s0f1   
khpy2020@gpua-3371d1:~$ sudo iftop -i enp193s0f0 -t
[sudo] password for khpy2020: 
interface: enp193s0f0
IP address is: 10.32.13.40
IPv6 address is: 2001:550:9c00:104:3eec:efff:fec9:d498
MAC address is: 3c:ec:ef:c9:d4:98
Listening on enp193s0f0
   # Host name (port/service if enabled)            last 2s   last 10s   last 40s cumulative
--------------------------------------------------------------------------------------------
   1 gpua-3371d1.las1.corp.piknik.com         =>     3.45Mb     3.45Mb     3.45Mb      883KB
     sd-ab47-miner.las1.corp.piknik.com     <=     4.08Mb     4.08Mb     4.08Mb     1.02MB
   2 gpua-3371d1.las1.corp.piknik.com         =>     4.41Kb     4.41Kb     4.41Kb     1.10KB
     zabbix1.las1.corp.piknik.com             <=     4.83Kb     4.83Kb     4.83Kb     1.21KB
   3 gpua-3371d1.las1.corp.piknik.com         =>     2.80Kb     2.80Kb     2.80Kb       718B
     ceph1-mon1.las1.corp.piknik.com          <=       712b       712b       712b       178B
   4 gpua-3371d1.las1.corp.piknik.com         =>     1.18Kb     1.18Kb     1.18Kb       302B
     10.32.30.60                              <=     2.20Kb     2.20Kb     2.20Kb       562B
   5 gpua-3371d1.las1.corp.piknik.com         =>       448b       448b       448b       112B
     10.32.30.16                              <=       208b       208b       208b        52B
   6 gpua-3371d1.las1.corp.piknik.com         =>       356b       356b       356b        89B
     ceph1-mon2.las1.corp.piknik.com          <=       196b       196b       196b        49B
--------------------------------------------------------------------------------------------
Total send rate:                                     3.46Mb     3.46Mb     3.46Mb
Total receive rate:                                  4.09Mb     4.09Mb     4.09Mb
Total send and receive rate:                         7.55Mb     7.55Mb     7.55Mb
--------------------------------------------------------------------------------------------
Peak rate (sent/received/total):                     3.46Mb     4.09Mb     7.55Mb
Cumulative (sent/received/total):                     885KB     1.02MB     1.89MB
============================================================================================```

```$ lotus-miner sealing sched-diag --force-sched | grep 25054 -A3 -B5
2022/07/13 04:14:19 proto: duplicate proto type registered: pb.NoiseHandshakePayload
    "Requests": [
      {
        "Priority": 0,
        "Sector": {
          "Miner": 1862225,
          "Number": 25054
        },
        "TaskType": "seal/v0/fetch"
      },

$ lotus-miner sectors status 25054
2022/07/13 04:14:29 proto: duplicate proto type registered: pb.NoiseHandshakePayload
SectorID:       25054
Status:         CommitFinalize
CIDcommD:       baga6ea4seaqao7s73y24kcutaosvacpdjgfe5pw76ooefnyqw4ynr3d2y6x2mpq
CIDcommR:       bagboea4b5abcbhf24um2hs3lj6fzzhk67uleucyqdc77tiwt63fqntzqzegothis
Ticket:         313fb7c6a3978add6fda217520d837a910ca4e7c846d73c9fe1ba0dc5a68bdfd
TicketH:        1977571
Seed:           e6593b06e755ecabc5e72297cb1fc29f4f3ec77c2d3cb1b760a5ac7fe7ff5ac2
SeedH:          1979139
Precommit:      bafy2bzacebfgclxoyd35nmufjwqclqj6ammantqaeh34ulezbemnmiccv4ivq
Commit:         <nil>
Deals:          [0]
Retries:        0```

@piknikSteven2021
Copy link

2022-07-13T04:04:53.174 INFO filcrypto::util::types > get_gpu_devices: end
2022-07-13T04:04:53.201Z        INFO    main    lotus-worker/main.go:588        Making sure no local tasks are running
2022-07-13T04:04:53.210 INFO filcrypto::util::types > get_gpu_devices: start
2022-07-13T04:04:53.210 INFO filcrypto::util::types > get_gpu_devices: end
2022-07-13T04:04:53.973Z        INFO    main    lotus-worker/main.go:611        Worker registered successfully, waiting for tasks
2022-07-13T04:04:54.493Z        INFO    main    lotus-worker/main.go:611        Worker registered successfully, waiting for tasks
2022-07-13T04:04:55.105Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25596} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25596  }
2022-07-13T04:04:55.108Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25054} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25054  }
2022-07-13T04:04:55.108Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25597} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25597  }
2022-07-13T04:04:55.115Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25596} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25596  }
2022-07-13T04:04:55.115 INFO filcrypto::util::types > clear_cache: start
2022-07-13T04:04:55.115 INFO filecoin_proofs::api::post_util > clear_cache:start
2022-07-13T04:04:55.116 INFO filecoin_proofs::api::post_util > clear_cache:finish
2022-07-13T04:04:55.116 INFO filcrypto::util::types > clear_cache: end
2022-07-13T04:04:55.117Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25597} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25597  }
2022-07-13T04:04:55.117 INFO filcrypto::util::types > clear_cache: start
2022-07-13T04:04:55.117Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25054} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25054  }
2022-07-13T04:04:55.117 INFO filecoin_proofs::api::post_util > clear_cache:start
2022-07-13T04:04:55.117 INFO filcrypto::util::types > clear_cache: start
2022-07-13T04:04:55.117 INFO filecoin_proofs::api::post_util > clear_cache:start
2022-07-13T04:04:55.117 INFO filecoin_proofs::api::post_util > clear_cache:finish
2022-07-13T04:04:55.117 INFO filcrypto::util::types > clear_cache: end
2022-07-13T04:04:55.117 INFO filecoin_proofs::api::post_util > clear_cache:finish
2022-07-13T04:04:55.117 INFO filcrypto::util::types > clear_cache: end
2022-07-13T04:04:55.260Z        INFO    main    lotus-worker/main.go:588        Making sure no local tasks are running
2022-07-13T04:04:55.266 INFO filcrypto::util::types > get_gpu_devices: start
2022-07-13T04:04:55.266 INFO filcrypto::util::types > get_gpu_devices: end
2022-07-13T04:04:55.643Z        INFO    main    lotus-worker/main.go:588        Making sure no local tasks are running
2022-07-13T04:04:55.650 INFO filcrypto::util::types > get_gpu_devices: start
2022-07-13T04:04:55.650 INFO filcrypto::util::types > get_gpu_devices: end
2022-07-13T04:04:56.089Z        INFO    main    lotus-worker/main.go:611        Worker registered successfully, waiting for tasks
2022-07-13T04:04:56.090Z        INFO    main    lotus-worker/main.go:611        Worker registered successfully, waiting for tasks
2022-07-13T04:04:56.092Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25441} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25441  }
2022-07-13T04:04:56.093Z        DEBUG   advmgr  sealer/worker_local.go:158      acquired sector {{1862225 25441} 8} (e:4; a:0): {{0 0}   /mnt/pc2/lotus-pc2-1/cache/s-t01862225-25441  }
2022-07-13T04:04:56.093 INFO filcrypto::util::types > clear_cache: start
2022-07-13T04:04:56.093 INFO filecoin_proofs::api::post_util > clear_cache:start
2022-07-13T04:04:56.094 INFO filecoin_proofs::api::post_util > clear_cache:finish
2022-07-13T04:04:56.094 INFO filcrypto::util::types > clear_cache: end```

@benjaminh83 benjaminh83 mentioned this pull request Jul 14, 2022
18 tasks
@long568
Copy link

long568 commented Aug 6, 2022

Same as you...

@xiaoliwe
Copy link

xiaoliwe commented Aug 8, 2022

Do you have solved this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants