-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node stops syncing for a while with err="shutting down" spam in logs #24623
Comments
thanks for the help. moving to other software. |
You opened this issue on a Friday, today is Tuesday. Your node was synced before the log spam, thus shut down the downloader. But your nodes seems to be falling behind with block processing so it starts the downloader again, which sees that the downloader is shutting down thus printing the error. |
Apologies, I worked all weekend, so it felt like a long time. The logs I pasted are on the mild side, sometimes it gets 10 minutes behind. Do you think this could be a lack of processing power? If there a recommended AWS instance type? |
We usually use |
I'm on an m5.2xl ... |
What are the exact command line parameters used for starting the node? |
We fixed a similar issue earlier, but that doesn't appear to be the same cause: #24202 (comment) . If you can obtain a stacktrace from when it is in this state (e.g. |
will work on getting the stacks to you asap |
gethlogs.zip |
gethlogs1 has One example:
So it's waiting for the fetcher to deliver something. It indicates that something is amiss in the dispatcher, similar to the bug fixed here |
Does the CPU usage look high too? Or is that normal? |
Original error report was:
IMO this is fixed by #24652,so it should indeed be closed. |
No worries, thanks! |
Can't say much about that CPU usage. Looks a bit on the high side, unless it's a multi-core machine and e.g. 800% is the max |
I'm seeing the exact same thing. Here's my setup: My observations are that the problem develops over time. It will run fine for about a week and then slowly it gets worse and worse. I set up metrics collection using prometheus/grafana to get a better idea of what is going on, which you can see below. It seems to be correlated with db compacting. During that time it seems as if geth basically freezes. It stops providing metrics, stops responding to RPC calls, drops peers, etc. I'm going to try moving my swap file to the Samsung, since the SD cards generally have very poor performance. |
When LevelDB starts compacting - if enough load piles up - it essentially blocks all database reads, causing all subsystems to lock up eventually. I think in general you should avoid using a swap file at all, it usually just makes things significantly worse since you're using an already overloaded disk for memory accesses. Oh, and just don't use an SD card for anything. They are insanely slow. |
Thanks! I've been thinking about disabling the swap file, but was looking for some confirmation from someone with more expertise. |
@karalabe Just thought I would report back. I followed your suggestion to disable the swap file, but after about 24 hours geth crashed with a simple "Killed" message on the screen. Couldn't find anything more detailed in journalctl. So I put the swap file back on. I can file a separate issue if you want. |
msg="synchronisation failed,retrying" err= "shutting down"maybe because the other nodes are unstable? |
did you fix it? |
System information
Geth version: 1.10.16-stable
OS & Version: Linux
Expected behaviour
Node to stay synced with the latest block
Actual behaviour
The node will be syncing happily.
Then, regularly, something is causing my node to go crazy, spamming
Synchronisation failed, retrying err="shutting down"
1-3000 times, then fall behind.to reduce the spam in the example log below, i've change the spammed lines to ... REPEAT X TIMES ...
Steps to reproduce the behaviour
Synced 1.10.16-stable on a fresh aws2 instance.
The text was updated successfully, but these errors were encountered: