Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Runtime Segregation cause panic and crash #6792

Closed
tmyuu opened this issue Jul 19, 2021 · 2 comments · Fixed by #6800
Closed

[BUG] Runtime Segregation cause panic and crash #6792

tmyuu opened this issue Jul 19, 2021 · 2 comments · Fixed by #6800
Assignees
Labels
kind/bug Kind: Bug

Comments

@tmyuu
Copy link

tmyuu commented Jul 19, 2021

Describe the bug
Runtime Segregation cause panic with subsystems of mining/sealing/proving node when publishing deals.

Version (run lotus version):

Daemon:  1.11.1-dev+mainnet+git.1c30d07d6+api1.3.0
Local: lotus version 1.11.1-dev+mainnet+git.1c30d07d6

To Reproduce

  • Separating processes according to the introduction.
  • Running both nodes on the same machine with different ports.
  • Accepting a deal and then publishing it.
  • *All sealing processes are undertaken by lotus-worker as same as before the separation.

Expected behavior
Only lotus-miner mining/sealing/proving node shuts down.

Logs
miner with mining/sealing/providing log:

2021-07-19T19:27:31.794+0900  DEBUG  advmgr sector-storage/sched.go:450   SCHED Acceptable win: [[2 3]]
2021-07-19T19:27:31.794+0900  DEBUG  advmgr sector-storage/sched.go:465   SCHED try assign sqi:0 sector 672 to window 2
2021-07-19T19:27:31.794+0900  DEBUG  advmgr sector-storage/sched.go:472   SCHED ASSIGNED sqi:0 sector 672 task seal/v0/addpiece to window 2
2021-07-19T19:27:31.794+0900  DEBUG  advmgr sector-storage/sched_worker.go:368   assign worker sector 672
2021-07-19T19:27:31.794+0900  DEBUG  advmgr sector-storage/sched.go:354   SCHED 0 queued; 4 open windows
2021-07-19T19:27:31.795+0900  DEBUG  advmgr sector-storage/sched_worker.go:278   task done    {"workerid": "f2ba1640-f9ce-41b6-8146-1c85ae48404c"}
2021-07-19T19:27:31.795+0900  DEBUG  advmgr sector-storage/sched.go:354   SCHED 0 queued; 4 open windows
2021-07-19T19:27:36.092+0900  INFO  miner  miner/miner.go:478   completed mineOne    {"tookMilliseconds": 9, "forRound": 946136, "baseEpoch": 946135, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1041980, "lookbackEpochs": 900, "networkPowerAtLookback": "8931068333354450944", "minerPowerAtLookback": "26725724160000", "isEligible": true, "isWinner": false, "error": null}
2021-07-19T19:28:06.393+0900  INFO  miner  miner/miner.go:478   completed mineOne    {"tookMilliseconds": 20, "forRound": 946137, "baseEpoch": 946136, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1041981, "lookbackEpochs": 900, "networkPowerAtLookback": "8931360528569532416", "minerPowerAtLookback": "26725724160000", "isEligible": true, "isWinner": false, "error": null}
2021-07-19T19:28:36.027+0900  INFO  miner  miner/miner.go:478   completed mineOne    {"tookMilliseconds": 7, "forRound": 946138, "baseEpoch": 946137, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1041982, "lookbackEpochs": 900, "networkPowerAtLookback": "8931410934305718272", "minerPowerAtLookback": "26725724160000", "isEligible": true, "isWinner": false, "error": null}
2021-07-19T19:29:06.194+0900  INFO  miner  miner/miner.go:478   completed mineOne    {"tookMilliseconds": 9, "forRound": 946139, "baseEpoch": 946138, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1041983, "lookbackEpochs": 900, "networkPowerAtLookback": "8931410934305718272", "minerPowerAtLookback": "26725724160000", "isEligible": true, "isWinner": false, "error": null}
2021-07-19T19:29:36.244+0900  INFO  miner  miner/miner.go:478   completed mineOne    {"tookMilliseconds": 18, "forRound": 946140, "baseEpoch": 946139, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1041984, "lookbackEpochs": 900, "networkPowerAtLookback": "8931330463798460416", "minerPowerAtLookback": "26725724160000", "isEligible": true, "isWinner": false, "error": null}
panic: close of closed channel
goroutine 3627 [running]:
github.com/filecoin-project/lotus/lib/rpcenc.(*waitReadCloser).Close(0xc0111598d8, 0x0, 0x0)
    /usr/local/filecoin/.filecoin-project/lotus_master/lib/rpcenc/reader.go:95 +0x2f
net/http.(*readTrackingBody).Close(0xc00eeb7458, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:637 +0x37
net/http.(*transferWriter).writeBody(0xc0001ef400, 0x37561a0, 0xc0143a8e28, 0x0, 0x0)
    /usr/local/go/src/net/http/transfer.go:381 +0x356
net/http.(*Request).write(0xc000fa6d00, 0x3750380, 0xc019fe28c0, 0x0, 0xc0107e4210, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/request.go:697 +0x7c6
net/http.(*persistConn).writeLoop(0xc000166c60)
    /usr/local/go/src/net/http/transport.go:2385 +0x1a7
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1744 +0xc9c

miner with markets log:

2021-07-19T19:20:16.130+0900	INFO	storageadapter	storageadapter/dealpublisher.go:164	force publishing deals
2021-07-19T19:20:16.132+0900	INFO	storageadapter	storageadapter/dealpublisher.go:331	publishing 1 deals in publish deals queue with piece CIDs: baga6ea4seaqlqqlbs2w2eeepf6pjdeliaihs5cf4rh6pkg6zkizm3vmcqkrzcly
2021-07-19T19:20:17.091+0900	INFO	markets	loggers/loggers.go:20	storage provider even{"name": "ProviderEventDealPublishInitiated", "proposal CID": "bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi", "state": "StorageDealPublishing", "message": ""}
2021-07-19T19:22:58.032+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWQnuFbKRNpP2ZAm5Xp9LdPFHk2M9owd7nDhka5tMB8y3X", "error": "stream reset"}
2021-07-19T19:23:10.543+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWJCHP6B61BRtTpPVNDM4irARqadrwv4GVCeEsoeoA2jAd", "error": "stream reset"}
2021-07-19T19:23:44.980+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWK94gKwiBY57NjUZz1PeM1UciKkXe5W255nWTzySXezr3", "error": "stream reset"}
2021-07-19T19:23:52.756+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWHaFasS3TNEUZzmtserWzkP91ikzzJtnGhKDXiY5AeMst", "error": "stream reset"}
2021-07-19T19:24:44.531+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWRmMXQSV18YzEUJVu8XjQ3HHpep3gtLxbPhkDGeLf7nYQ", "error": "stream reset"}
2021-07-19T19:26:00.522+0900	INFO	markets	loggers/loggers.go:20	storage provider even{"name": "ProviderEventFundsReleased", "proposal CID": "bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi", "state": "StorageDealPublishing", "message": ""}
2021-07-19T19:26:00.523+0900	INFO	markets	loggers/loggers.go:20	storage provider even{"name": "ProviderEventDealPublished", "proposal CID": "bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi", "state": "StorageDealStaged", "message": ""}
2021-07-19T19:26:48.410+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWDSvFxhvKUCKFjwbAukNG2rPFYBxdppyDVbrLDrLQ6awn", "error": "stream reset"}
2021-07-19T19:26:50.343+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWNFDWovgU1qVR8iv9yWNRJ7oEBiBjmAMHF7qceWzNVCNe", "error": "stream reset"}
2021-07-19T19:27:57.758+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWHrpkfmeUpTuTJTrEe44P81a4736EuGMdfxQAc4TqHN1u", "error": "stream reset"}
2021-07-19T19:28:13.946+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWCRwrG3zJRZDMyqDUebU67ASnp2ZtS8DB54hbT4xZY7ZF", "error": "stream reset"}
2021-07-19T19:28:42.614+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWFSaMBwDwoiYihQ3hHtskg7XpCR274HSsJLXoiFWdKyYY", "error": "stream reset"}
2021-07-19T19:29:02.901+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWEE2pHhs25crDkR4gkviry7Gz21PUNZmSLtEfLw9vBHo6", "error": "stream reset"}
2021-07-19T19:29:35.962+0900	INFO	net/identify	identify/id.go:376	failed negotiate identify protocol with peer	{"peer": "12D3KooWFFAChKWmBWwcEfKGicLTneCmV8joKiXn3asWCfmnAbaG", "error": "stream reset"}
2021-07-19T19:29:42.345+0900	ERROR	storageadapter	storageadapter/provider.go:114	failed to addPiece for deal 2210328, err: handler: websocket connection closed
2021-07-19T19:29:42.346+0900	ERROR	rpcenc	rpcenc/reader.go:63	sending reader param: Post "http://192.168.0.10:2345/rpc/streams/v0/push/d2017cdb-0252-44f8-86a8-17c81a344202": EOF
2021-07-19T19:29:42.346+0900	INFO	markets	loggers/loggers.go:20	storage provider even{"name": "ProviderEventDealHandoffFailed", "proposal CID": "bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi", "state": "StorageDealFailing", "message": "handing off deal to node: packing piece <nil>: AddPiece failed: handler: websocket connection closed"}
2021-07-19T19:29:42.346+0900	WARN	providerstates	providerstates/provider_states.go:536deal bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi failed: handing off deal to node: packing piece <nil>: AddPiece failed: handler: websocket connection closed
2021-07-19T19:29:42.468+0900	INFO	markets	loggers/loggers.go:20	storage provider even{"name": "ProviderEventFailed", "proposal CID": "bafyreibpmjchx2fj4vpbhhu3m6yv5bjzkelt5moqnlhwic23oxaztyghdi", "state": "StorageDealError", "message": "handing off deal to node: packing piece <nil>: AddPiece failed: handler: websocket connection closed"}

worker log:

2021-07-19T19:13:51.474+0900	INFO	main	lotus-seal-worker/main.go:520	Worker registered successfully, waiting for tasks
2021-07-19T19:27:31.799+0900	DEBUG	advmgr	sector-storage/worker_local.go:137	acquired sector {{104671 672} 8} (e:0; a:1): {{0 0} /usr/local/filecoin/.attached/sealing_1/unsealed/s-t0104671-672 }
2021-07-19T19:27:40.856+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:27:50.858+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:00.859+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:10.860+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:20.861+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:30.862+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:40.864+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:28:50.865+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:00.866+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:10.867+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:20.867+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:30.869+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:40.870+0900	WARN	stores	stores/local.go:126	negative reserved storage: p.reserved=34359738368, reserved: -8192
2021-07-19T19:29:42.359 INFO filcrypto::proofs::api > generate_data_commitment: start
2021-07-19T19:29:42.359 INFO filecoin_proofs::api::seal > compute_comm_d:start
2021-07-19T19:29:42.359 INFO filecoin_proofs::pieces > verifying 8192 pieces
2021-07-19T19:29:42.362 INFO filecoin_proofs::api::seal > compute_comm_d:finish
2021-07-19T19:29:42.362 INFO filcrypto::proofs::api > generate_data_commitment: finish
2021-07-19T19:29:42.363+0900	ERROR	advmgr	sector-storage/worker_local.go:150	declare sector error: RPC client error: sendRequest failed: Post "http://192.168.0.10:2345/rpc/v0": dial tcp 192.168.0.10:2345: connect: connection refused
2021-07-19T19:29:42.363+0900	ERROR	advmgr	sector-storage/worker_local.go:282	return error, will retry in 5s: AddPiece: RPC client error: sendRequest failed: Post "http://192.168.0.10:2345/rpc/v0": dial tcp 192.168.0.10:2345: connect: connection refused
2021-07-19T19:29:47.364+0900	ERROR	advmgr	sector-storage/worker_local.go:282	return error, will retry in 5s: AddPiece: RPC client error: sendRequest failed: Post "http://192.168.0.10:2345/rpc/v0": dial tcp 192.168.0.10:2345: connect: connection refused

Additional context
Lotus-Miner info results in an error corresponding to separating section.

lotus-miner info with mining/sealing/providing:

ERROR: fatal error calling 'Filecoin.MarketListIncompleteDeals': panic in rpc method 'Filecoin.MarketListIncompleteDeals': runtime error: invalid memory address or nil pointer dereference

lotus-miner info with markets:

ERROR: getting actor address: fatal error calling 'Filecoin.ActorAddress': panic in rpc method 'Filecoin.ActorAddress': runtime error: invalid memory address or nil pointer dereference
@tmyuu tmyuu changed the title [BUG] Runtime Segregation cause panic and crash [BUG] Runtime Segregation causes panic and crash Jul 19, 2021
@tmyuu tmyuu changed the title [BUG] Runtime Segregation causes panic and crash [BUG] Runtime Segregation cause panic and crash Jul 19, 2021
@raulk
Copy link
Member

raulk commented Jul 19, 2021

waitReadCloser looks like it has several bugs:

  • Close() is not idempotent, and will panic if called more than once.
  • Read() returning an error will close the channel, so a Close() after an unsuccessful Read() wil also panic.
  • Read() will close even if the error is an EOF (which will happen if we provide a buffer larger than the remaining bytes).

Note: we've been using this component and the ReaderParamDecoder in the JSON-RPC layer in the seal-worker for a long time, but what's new here is that we're now using it in the lotus-miner to receive pieces from the marketes process and push them to the seal-workers.

Given that the JSON-RPC methods use io.Readers, I don't think we have a double-close case here. I think what might be happening is a side-effect from piping that io.Reader from the miner's JSON-RPC server side, to the corresponding seal-worker JSON-RPC client.

I believe the EOF bug was not triggered previously because we knew exactly how many bytes to consume from the io.Reader. However, when we bridge the JSON-RPC server (miner) with the JSON-RPC client (seal-worker), the client is probably reading until EOF, and closing the stream, and then something closes it again, causing the panic.

I think the solution here is to make waitReadCloser safe.

Thoughts, @magik6k?

@nonsense
Copy link
Member

We were reading up until getting EOF previously as well. I think the reason this is crashing now is because the JSON-RPC client is closing the stream as @raulk suggests, and previously we had not tested this with a separate lotus-worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants