Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node being stucked #1157

Closed
rjonczy opened this issue Aug 25, 2023 · 9 comments
Closed

Node being stucked #1157

rjonczy opened this issue Aug 25, 2023 · 9 comments

Comments

@rjonczy
Copy link

rjonczy commented Aug 25, 2023

Issue from archived repo: paritytech/cumulus#3052

Sometimes, our parachain node gets stucks at some block.

Parachain, related blocks looks like this:

2023-08-23 12:35:15.944  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (21 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.6kiB/s ⬆ 1.0kiB/s  
2023-08-23 12:35:20.944  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (21 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 3.0kiB/s ⬆ 0.2kiB/s  
2023-08-23 12:35:25.945  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.3kiB/s ⬆ 0.3kiB/s  
2023-08-23 12:35:30.945  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.2kiB/s ⬆ 0.2kiB/s  
2023-08-23 12:35:35.946  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 174.2kiB/s ⬆ 0.2kiB/s
2023-08-23 12:35:40.946  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 16.4kiB/s ⬆ 1.0kiB/s 
2023-08-23 12:35:45.947  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 3.0kiB/s ⬆ 0.2kiB/s  
2023-08-23 12:35:50.947  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.2kiB/s ⬆ 0.2kiB/s  
2023-08-23 12:35:55.947  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 6.3kiB/s ⬆ 0.2kiB/s  
2023-08-23 12:36:00.948  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.4kiB/s ⬆ 0.7kiB/s  
2023-08-23 12:36:05.948  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 75.6kiB/s ⬆ 0.2kiB/s 
2023-08-23 12:36:10.949  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (22 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 121.3kiB/s ⬆ 0.2kiB/s
2023-08-23 12:36:15.949  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (21 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 0.4kiB/s ⬆ 0.8kiB/s  
2023-08-23 12:36:20.950  INFO tokio-runtime-worker substrate: [Parachain] ⚙️  Preparing  0.0 bps, target=#3090498 (21 peers), best: #3089160 (0xeac1…b76e), finalized #3084883 (0x372a…d05f), ⬇ 3.0kiB/s ⬆ 0.2kiB/s  
@rjonczy
Copy link
Author

rjonczy commented Aug 25, 2023

@skunert - here are logs with tracing on sync enabled

picasso_node-03.log.gz

@skunert
Copy link
Contributor

skunert commented Aug 25, 2023

Hmm sadly this log is not too interesting, it does not show the start of the state sync.

But some things I noticed:

@altonen
Copy link
Contributor

altonen commented Aug 25, 2023

There's definitely something wrong but this looks like a different issue from the one that was originally reported since it's getting stuck at Downloading state and not Preparing.

@rjonczy could you capture -lsync=trace logs over a longer period of time, preferably before the node gets stuck if possible? As @skunert suggested, check if --sync warp works without getting stuck

@blasrodri
Copy link

@skunert which version would you recommend us to upgrade, coming from v0.9.39?

@skunert
Copy link
Contributor

skunert commented Aug 25, 2023

I mean its up to you how close to the tip you want to go. The fixes I mentioned in my earlier response should be included in v0.9.43 and v0.9.44, see also this comment. This should at least get rid of the issue initially described in this issue (Preparing 0.0 bps).

@rjonczy
Copy link
Author

rjonczy commented Aug 29, 2023

@skunert @altonen thanks guys. I've wiped up data and starting to synch from scratch with:

--sync warp for relaychain
--sync fast for parachain

Full command to run node:

/nix/store/bgwwc65f5b7j2zxrcpzdx37z4i4bqyji-composable/bin/composable \
-lsubstrate_basic_authorship=trace,txpool=trace,sync::import-queue=trace,sync=trace,txpool::api=trace \
--chain=picasso \
--name=composable-shared-artifacts-centauri-cosmos-prod-03 \
--listen-addr=/ip4/0.0.0.0/tcp/40334 \
--prometheus-external \
--prometheus-port 19615 \
--base-path /data \
--execution=wasm \
--pruning=100800 \
--database=paritydb \
--sync=fast \
--ws-external \
--rpc-external \
--unsafe-rpc-external \
--rpc-methods=unsafe \
--ws-port 8844 \
--unsafe-ws-external \
--rpc-cors=all \
--rpc-port 8833 \
--in-peers 1000 \
--out-peers 1000 \
--log=ibc_transfer=trace,pallet_ibc=trace \
--offchain-worker=always \
--enable-offchain-indexing=true \
--ws-max-connections 10000 \
--rpc-max-subscriptions-per-connection 10240 \
-- \
--execution=wasm \
--listen-addr=/ip4/0.0.0.0/tcp/40333 \
--ws-external \
--rpc-external \
--unsafe-rpc-external \
--rpc-methods=unsafe \
--ws-port 8845 \
--rpc-port 8834 \
--pruning=201600 \
--database=paritydb \
--sync=warp \
--ws-max-connections 10000 \
--rpc-max-subscriptions-per-connection 10240 \
--unsafe-ws-external \
--name=composable-shared-artifacts-centauri-cosmos-prod-03 \
--rpc-cors=all

I will get back with more logs / results later.

Btw, which engine do you recommend to run? paritydb or rocksdb ? Or are there some cases where partidb is more accurate than rocksdb? or opposite?

@skunert
Copy link
Contributor

skunert commented Aug 29, 2023

@rjonczy The command in your comment still shows --sync=fast for the parachain.

Afaik ParityDB should be a bit faster, is considered stable and will be the default in the future. However, there is no difference in correctness or something, both are supported and should work well.

@rjonczy
Copy link
Author

rjonczy commented Aug 30, 2023

@skunert sorry, initially I misunderstood your comment.

When I run node with both --sync=warp for parachain and relaychain, I am getting this error then:

picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 Target environment: gnu                                                         [15/4949]
picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 CPU: Intel(R) Xeon(R) CPU @ 2.20GHz                                                      
picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 CPU cores: 8                                                                             
picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 Memory: 64290MB                                                                          
picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 Kernel: 6.2.0-1011-gcp                                                                   
picasso_node       | 2023-08-30 11:37:55.939  INFO main sc_sysinfo: [Relaychain] 💻 Virtual machine: yes                                                                     
picasso_node       | 2023-08-30 11:37:55.940  INFO main sc_service::builder: [Relaychain] 📦 Highest known block at #0                                                       
picasso_node       | 2023-08-30 11:37:55.940  INFO main sc_rpc_server: [Relaychain] Running JSON-RPC HTTP server: addr=0.0.0.0:8834, allowed origins=["*"]                   
picasso_node       | 2023-08-30 11:37:55.940  INFO main sc_rpc_server: [Relaychain] Running JSON-RPC WS server: addr=0.0.0.0:8845, allowed origins=["*"]                     
picasso_node       | 2023-08-30 11:37:55.940  INFO tokio-runtime-worker substrate_prometheus_endpoint: [Relaychain] 〽️ Prometheus exporter started at 127.0.0.1:9616         
picasso_node       | 2023-08-30 11:37:55.945  INFO tokio-runtime-worker parachain::approval-voting: [Relaychain] Starting with an empty approval vote DB.                    
picasso_node       | 2023-08-30 11:37:55.946 DEBUG tokio-runtime-worker sync::import-queue: [Parachain] Stopping block import because result channel was closed!             
picasso_node       | 2023-08-30 11:37:55.946 ERROR tokio-runtime-worker sc_service::task_manager: [Parachain] Essential task `basic-block-import-worker` failed. Shutting dow
n service.                                                                                                                                                                   
picasso_node       | 2023-08-30 11:37:55.954 ERROR tokio-runtime-worker overseer: [Relaychain] Overseer exited with error err=QueueError(SendError { kind: Disconnected })   
picasso_node       | 2023-08-30 11:37:55.954 ERROR tokio-runtime-worker sc_service::task_manager: [Relaychain] Essential task `overseer` failed. Shutting down service.      
picasso_node       | Error: 
picasso_node       |    0: Other: Warp sync enabled, but no warp sync provider configured.
picasso_node       | 
picasso_node       |   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
picasso_node       |                                 ⋮ 3 frames hidden ⋮                               
picasso_node       |    4: composable::main::ha8288304ff5363d1
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |    5: std::sys_common::backtrace::__rust_begin_short_backtrace::h805e5c944ae35c2e
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |    6: std::rt::lang_start::{{closure}}::h0da2b81be274c3c4
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |    7: std::rt::lang_start_internal::h0a3e2dc9776cccd4
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |    8: main<unknown>
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |    9: __libc_start_call_main<unknown>
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |   10: __libc_start_main@@GLIBC_2.34<unknown>
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       |   11: _start<unknown>
picasso_node       |       at <unknown source file>:<unknown line>
picasso_node       | 
picasso_node       | Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
picasso_node       | Run with RUST_BACKTRACE=full to include source snippets.

@skunert
Copy link
Contributor

skunert commented Aug 30, 2023

This is interesting, parachain warp sync was introduced with v0.9.39 but seems to be disabled for this collator implementation, not sure why:
https://github.com/composablefi/composable/blob/eb2d88f274f2e49f9eb22a6e13ecbba6f6cf13fc/code/parachain/node/src/service.rs#L332

Sorry, I was not aware of that.

bkchr pushed a commit that referenced this issue Apr 10, 2024
* fix/ignore clippy

* --locked check && test

* RUSTC_WRAPPER="" for clippy-nightly, check-nightly, test-nightly
@bkchr bkchr closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants