Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Client occasionally stops importing new blocks #11758

Closed
mdben1247 opened this issue Jun 3, 2020 · 78 comments
Closed

Client occasionally stops importing new blocks #11758

mdben1247 opened this issue Jun 3, 2020 · 78 comments
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.

Comments

@mdben1247
Copy link

  • OpenEthereum version: v3.0.0-stable-b079d1746
  • Operating system: Linux
  • Installation: built from source
  • Fully synchronized: yes
  • Network: ethereum
  • Restarted: yes

Openethereum occasianally (1-3x per month) stops importing new blocks, simply goes silent without producing any error. Then we issue SIGTERM and we get "Finishing work, please wait..." message. But it does not finish in many minutes... so we kill the process with SIGKILL. Upon restarting everything works normally, until next such 'freeze'.

We first noticed this behaviour in Parity 2.7.2 several months ago and it is now present with Openethereum 3.0.0. We noticed this independently on two different hardware configurations, one fast synced, the other fatdb synced from scratch. There is a similar issue reported in #11737 and also #11539, but the latter is linked to slow block import. That is not the case here. Except for the 'freezes', everything is working responsively, fast and overall very well, block imports are very fast.

Any suggestions on how to debug this issue much appreciated. Thanks.

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 4, 2020

This sounds concerning and debugging it seems non-trivial. The best thing would be a backtrace of what all the threads are doing when the node freezes up, which requires attaching gdb to the process when it's is in the bad state and executing thread apply all bt.
One way of maybe collecting a hint on what is going on could be to run the node with -l shutdown and see what is logged after SIGTERM: I'd expect you to see messages from the various subsystems shutting down (until one of them don't; try comparing what you see on a non-stuck node to a stuck one).
You could also turn on logging for the sync subsystem with -l sync=debug but that will produce a LOT of logs and might fill up your disk if you need it on for a month.

Can you post your configuration please, it might help us reproduce.

@mtbitcoin
Copy link

we are also experiencing similar issues after the upgrade to ver 3.0

@mdben1247
Copy link
Author

dvdplm, thanks for suggestions, will try all that and report back.

Attaching a config.toml from current hardware with 64gb ram that is fatdb synced. Previous hardware (that also exhibited the same problem) was fast synced and had 32gb, so configuration had smaller caches sizes, but otherwise it was similar.

config.toml.txt

@trailtacos
Copy link

@dvdplm There is a thread backtrace on https://github.com/openethereum/openethereum/issues/11737 which seems like the same issue as this one.

@palkeo
Copy link

palkeo commented Jun 5, 2020

I see the same issue, it matches the 1-3x per month originally reported. I saw it happen on parity 2.7.2 and saw it again on openethereum 3.0.0.

I noticed this only on nodes with tracing enabled, and #11737 also has tracing. But @Marko429 you don't have it do you?

@palkeo
Copy link

palkeo commented Jun 5, 2020

@dvdplm Could you maybe develop on what you would want exactly from gdb? Only the output of "threads apply all bt" ? I tried it once but it only spits out addresses as there is no debug symbols. That's still helpful to you ?
I need to know in advance to I can run that quickly before restarting the node if I see it happen again.

@mdben1247
Copy link
Author

@palkeo, I'll make the gdb trace the next time it happens. Can you please share your cache-sizes and ram size?

@palkeo
Copy link

palkeo commented Jun 6, 2020

@palkeo, I'll make the gdb trace the next time it happens. Can you please share your cache-sizes and ram size?

My cache size are here: https://github.com/openethereum/openethereum/issues/11735
Ram size is 16G.

@nysxah
Copy link

nysxah commented Jun 8, 2020

Experiencing both slow block import and periodic freeze on both full/fast and archive nodes since 2.7.2, tracing enabled.

2.5.13 runs fine.

#11737 / #11494 — same behavior.

Could this be because of “database format changes” introduced in 2.7.2?

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 8, 2020

I tried it once but it only spits out addresses as there is no debug symbols. That's still helpful to you ?

You are right, without the debug symbols in there it's not useful. My bad.

@peterbitfly
Copy link

We are also experiencing those issues since version 2.7.2 on multiple independent nodes.

@denisgranha
Copy link

I can confirm this happens, without more data points / metrics we are blind to diagnose, @adria0 is working on having prometheus metrics and that might add more light here

@palkeo
Copy link

palkeo commented Jun 12, 2020

Is there anything you can think of that could be useful to you to troubleshoot it?
Maybe providing a debug build (with symbols and anything else that could help) could help in getting you interesting core dumps and backtraces? Any logging option you think could help?

@trailtacos
Copy link

Is there anything that is helpful for debugging while it is stuck ? I have a stuck node right now ...

@vorot93 vorot93 added F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Jun 29, 2020
@vorot93 vorot93 pinned this issue Jun 29, 2020
@PavelNiedoba
Copy link

I had this problem too, v3.0.1,

/opt/openethereum
--warp
--jsonrpc-port=60007
--logging="rpc=trace"
--accounts-refresh=0
--db-path=/disk600/ethereumBlockchain
--jsonrpc-server-threads=32
--jsonrpc-apis="all"
--log-file=/home/parity/parity.log

ssd disk, serverrom with 1GBps internet, restart didn't help, reboot didn't help. Before that RPC got really slow, 10 sec for account get balance, had about 3500 accounts

to recover, I started over, deleted accounts, whole blockchain, it failed to sync snapshots,

2020-06-29 19:17:49 IO Worker #0 INFO import Syncing snapshot 3664/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:17:50 IO Worker #2 INFO import Syncing snapshot 3665/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:17:57 IO Worker #3 INFO import Syncing snapshot 3670/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:01 IO Worker #2 INFO import Syncing snapshot 3673/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:05 IO Worker #1 INFO import Syncing snapshot 3675/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:10 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:15 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:20 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:25 IO Worker #2 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:30 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:35 IO Worker #0 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:40 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:45 IO Worker #3 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs 2020-06-29 19:18:50 IO Worker #1 INFO import Syncing snapshot 3676/5112 #0 15/25 peers 920 bytes chain 3 MiB db 0 bytes queue 298 KiB sync RPC: 0 conn, 0 req/s, 0 µs

very anoying, business down second day, switching to geth

@mdben1247
Copy link
Author

Had the issue again. Attaching openethereum shutdown log and gdb output.

gdb.txt
shutdown.txt

@rakita
Copy link

rakita commented Jul 1, 2020

Thank you @Marko429 !

I will analyze it more later, but till then what I can see now is that threads 46,47,48,49 get stuck on SyncSupplier::dispatch_packet. Four threads correspond to our four Sync/Network IoWorkers.

Threads 24,25,26,27 are Client IoWorkers.

Thread 39 is last one that is deadlocked, it is called from Importers thread.

@PavelNiedoba
Copy link

I did re-sync again, no hardware changes, no software changes. Snapshot got synced. Now syncing blocks. I suspect there is race condition dead lock problem.

@rakita
Copy link

rakita commented Jul 1, 2020

Putting link to important gdb log: https://github.com/openethereum/openethereum/issues/11737#issuecomment-637042123
Just to consolidate information that we have

From this log I can see that Network/Sync IoWorker threads 33,34,35,36 are stuck

And Client IoWorkers are:
Thread 22, 20 calls Informant timer and tries to lock
Thread 21 calls EthSync transactions_received
Thread 19 calls informant on timer but this time it locks mutex and it is calling EthSync ethcore_sync::api::SyncProvider::status

@trailtacos
Copy link

@rakita One thing there is in #11737 as well is a full debug logging up to the point of the hang - not sure if you have seen that (https://github.com/openethereum/openethereum/issues/11737#issuecomment-649679781) - I'm also happy to do any debugging on our node when it hangs again, or enable any extra logging you would like to try and track this issue down.

@rakita
Copy link

rakita commented Jul 1, 2020

@CrispinFlowerday I did see it and thank you for both gdb and debug log!

@rakita
Copy link

rakita commented Jul 2, 2020

Compiling the data:

  • In both gdb logs i can see that All four Network IoWorkers that receive msg from peer want to do ReadLock , probably on sync: &RwLock<ChainSync>, from supplier::SyncSupplier::dispatch_packet() function.

  • In the first log there are four Client IoWorkers and only one in the second log, thay want to do WriteLock by calling .sync.write(); on ChainSync, from function (EthSync as client_traits::ChainNotify)::transactions_received

  • In first log we can see that we want to WriteLock in (EthSync as client_traits::ChainNotify)::new_blocks by calling sync.write() on ChainSync. This call was not from IoWorker but from Importer::import_verified_blocks thread.

  • In second log there are left three IoWorkers and all of them are calling Informant. The first informant locks its internal mutex and goes on while the other two waits for first to release mutex. the first informant after locking internal mutex asks for ReadLock by calling .sync.read().status() on ChainSync from (EthSync as ethcore_sync::api::SyncProvider)::status

@trailtacos
Copy link

I can confirm our node is working fine as well with the rakita/11758_use_std_rwlock branch. I have also now seen a hang of a ropsten node which hasn't occurred for us previously (using the release code - although right now I'm not sure exactly what release we are using there)

@rakita
Copy link

rakita commented Aug 21, 2020

@CrispinFlowerday and @Marko429 thank you both for the help!

@cogmon
Copy link

cogmon commented Aug 24, 2020

Both our OpenEthereum nodes (16 core, 16GB memory, RAID-5 SSD Pool, Archiving+Tracing) are stuck since ~3 weeks.

Even restarting doesn't solve the problem anymore. The nodes are utilizing the SSD pool and just do nothing. Please raise the priority for this issue. OpenEthereum is pretty much useless at this point.

Edit:

std_rwlock version has been running for about three weeks now without an issue.

Oh and unfortunately this did not help in our case. We are now running one node with this rakita/11758_use_std_rwlock branch and one with the last official release (v3.0.1).

Happy to provide logs if it helps.

@adria0 adria0 unpinned this issue Aug 24, 2020
@rakita
Copy link

rakita commented Aug 24, 2020

Hello @cogmon if a node does not recover after restart that means this is a different problem.

On that note, we did notice a lot of regression introduced with big changes in 2.7.x, and 3.0.x is mostly based on 2.7.x with rebranding. In next release, we decided to move back to stable 2.5.x and make it ready for berlin fork. #11858

@ghost
Copy link

ghost commented Aug 24, 2020

Hi There, thanks for the info. I have started Parity v2.5.13 and it does not find any peers:

2020-08-24 12:43:11 UTC Starting Parity-Ethereum/v2.5.13-stable-253ff3f-20191231/x86_64-linux-gnu/rustc1.40.0
2020-08-24 12:43:11 UTC Keys path /data/parity/base/keys/ethereum
2020-08-24 12:43:11 UTC DB path /data/parity/db/ethereum/db/906a34e69aec8c0d
2020-08-24 12:43:11 UTC State DB configuration: fast +Trace
2020-08-24 12:43:11 UTC Operating mode: active
2020-08-24 12:43:11 UTC Configured for Ethereum using Ethash engine
2020-08-24 12:43:11 UTC Removed existing file '/data/parity/base/jsonrpc.ipc'.
2020-08-24 12:43:12 UTC Updated conversion rate to Ξ1 = US$406.73 (11707778 wei/gas)
2020-08-24 12:43:16 UTC Public node URL: enode://xxxxxxxxxxxxxx@<ip>:30303
2020-08-24 12:43:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs

Is there any special requirement to run this version?

@AlexSSD7
Copy link

@g574, same here

@islishude

This comment has been minimized.

@adria0
Copy link

adria0 commented Aug 24, 2020

Hi There, thanks for the info. I have started Parity v2.5.13 and it does not find any peers:

2020-08-24 12:43:11 UTC Starting Parity-Ethereum/v2.5.13-stable-253ff3f-20191231/x86_64-linux-gnu/rustc1.40.0
2020-08-24 12:43:11 UTC Keys path /data/parity/base/keys/ethereum
2020-08-24 12:43:11 UTC DB path /data/parity/db/ethereum/db/906a34e69aec8c0d
2020-08-24 12:43:11 UTC State DB configuration: fast +Trace
2020-08-24 12:43:11 UTC Operating mode: active
2020-08-24 12:43:11 UTC Configured for Ethereum using Ethash engine
2020-08-24 12:43:11 UTC Removed existing file '/data/parity/base/jsonrpc.ipc'.
2020-08-24 12:43:12 UTC Updated conversion rate to Ξ1 = US$406.73 (11707778 wei/gas)
2020-08-24 12:43:16 UTC Public node URL: enode://xxxxxxxxxxxxxx@<ip>:30303
2020-08-24 12:43:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  166 µs
2020-08-24 12:44:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:45:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:16 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs
2020-08-24 12:46:46 UTC    0/25 peers   832 bytes chain  512 KiB db  0 bytes queue 448 bytes sync  RPC:  0 conn,    0 req/s,  131 µs

Is there any special requirement to run this version?

Yes @g574, Parity 2.5.13 has harcoded list of bootnodes, but parity tech are not running them. You can switch to the Gnosis maintained bootnodes by using

--bootnodes enode://68f46370191198b71a1595dd453c489bbfe28036a9951fc0397fabd1b77462930b3c5a5359b20e99677855939be47b39fc8edcf1e9ff2522a922b86d233bf2df@144.217.153.76:30303,enode://ffed6382e05ee42854d862f08e4e39b8452c50a5a5d399072c40f9a0b2d4ad34b0eb5312455ad8bcf0dcb4ce969dc89a9a9fd00183eaf8abf46bbcc59dc6e9d5@51.195.3.238:30303

this should work.

@ghost
Copy link

ghost commented Aug 24, 2020

@adria0 Thanks a lot, it works.

@quietnan
Copy link

@islishude see #11858.

@mtbitcoin
Copy link

On that note, we did notice a lot of regression introduced with big changed in 2.7.x, and 3.0.x is mostly based on 2.7.x with rebranding. In next release, we decided to move back to stable 2.5.x and make it ready for berlin fork. #11858

@rakita 2.5.13 works fairly, so that looks like a good a strategy

@nysxah
Copy link

nysxah commented Aug 24, 2020

we're running 2.5.13 where possible as well, so yes, looking forward to it 👍

@ymonye
Copy link

ymonye commented Aug 24, 2020

I'm not at all surprised, yet somehow now relieved that my issue is no longer CentOS 7 related. I'd eventually closed the old issue after adding the --reserved-peers nodes.txt argument to sync to my local Go-Ethereum archive node.

https://github.com/openethereum/openethereum/issues/11297

@matkt
Copy link

matkt commented Aug 25, 2020

@BurtonQin yes, that is a big difference between parking_lot and glibc and musl that is effecting how app works. Both glibc and musl I think favors read over write lock, while parking_lot favors write over read. I think this is good explanation on this topic for others: link.

But why do I think that problem is with parking lot and not with who favors what is the test we did and logs we added (branch).
The problem where somebody takes read and does not release it is something I expected to see with test logs and parking_lot, but a strange thing happened, and we got an empty list. With this, I can see that nobody has locked that mutex (output) while some threads are waiting to lock the write lock and we got deadlock. There are two assumptions with this output, one is, logs that I put are not covering some case, and there is a bug in parking_lot. Artem additionally checked the code and couldn't find anything wrong and this is why my conclusion is that this is a problem with parking_lot.

It looks like this behavior? https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html

"This lock uses a task-fair locking policy which avoids both reader and writer starvation. This means that readers trying to acquire the lock will block even if the lock is unlocked when there are writers waiting to acquire the lock. Because of this, attempts to recursively acquire a read lock within a single thread may result in a deadlock. "

@rakita
Copy link

rakita commented Aug 25, 2020

@matkt, to be honest, I expected something like that when I first looked at parking_lot, but when I looked at gdb logs and code this didn't fit. It was a strange conclusion, and that is why we added logs after every lock to try to deduce who is last one that locks the lock and we got empty list.

@ngotchac
Copy link
Contributor

I started running a node with https://github.com/openethereum/openethereum/pull/11769 and https://github.com/openethereum/openethereum/pull/11766 in, and I haven't seen the issue for the past ~30h (whereas it would occur multiple times per day before) ; it might be worth merging the PRs.

@rakita
Copy link

rakita commented Aug 27, 2020

Hello @ngotchac,
#11769 can potentially happen only if there are calls to parity_addReservedPeer/parity_removeReservedPeer API. I wrote bigger reasoning in the Issues.
#11766 is tied to restore_db that only happens when warping finishes.
Both of these commits do not explain how they can fix this issue or what gdb showed us. And information that for you this issue happens multiple times a day is inconsistent from what multiple users reported and from what I saw with testing.

@ngotchac
Copy link
Contributor

You might be right, I haven't looked to much into the PRs. However it might still be worth it to merge them if the code is OK, so that other people can try out.
Regarding the issue arising multiple times per day, it might be related to the fact that I'm connected to many peers (~500), which might trigger a deadlock more often than for other nodes.

@mdben1247
Copy link
Author

std_rwlock version is still working without fault. It's been almost a month now. I'd wager very strongly now that this issue is fixed by switching to standard rust locks.

@mdben1247
Copy link
Author

@CrispinFlowerday, is std_rwlock still working well for you? We are still running without a single lock up. I don't think we ever run this long since 2.7 has been out. It seems to me, the issue is solved.

@trailtacos
Copy link

Yep, still going strong for us - been running since July 29th with no issues.

@edobry
Copy link

edobry commented Sep 10, 2020

@rakita could we please consider merging these PRs and cutting a version? It would be great to reduce the uncertainty about the next version, specifically whether we'll have to do a full re-sync.

@vorot93
Copy link

vorot93 commented Sep 10, 2020

Please use 2.5 if this issue affects you. Next release (3.1) will be based on 2.5 which is not affected.

@vorot93 vorot93 closed this as completed Sep 10, 2020
@edobry
Copy link

edobry commented Sep 10, 2020

@vorot93 has this been 100% decided? and if so, is there any word on a downward migration script? considering the concerns expressed in #11858

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Projects
None yet
Development

No branches or pull requests