-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodeos 3.x/4.x SHiP stops serving blocks #1358
Comments
Yes, i had 4.0.3 ship on EOS for this. Our EOS-EVM stopped syncing when eosusa disconnect the AA reader from our ship. |
Do you have any process to repeatably reproduce this? I've been trying various combinations of connecting/disconnecting/killing clients and have not been able to reproduce this. |
I don't have steps to reproduce. It only happened once for me, that I know of. |
Keeping this open pending reproducible steps. But for now we can't reproduce to fix. |
Just tried the same reconnection process with our Atomic indexer and Nations StateHist node and it did NOT experience the same issue this time. So def not something easily reproducible, but I experience it from time to time on our internal nodes so will keep an eye out for it and reach out if/when i see it again (and try to leave it broken so we can troubleshoot it) |
We have been finding similar issues.
In both scenarios, the SHIP reader has a healthy connection to nodeos, except not getting any blocks. The issue seem to be related to asking for ABI, as that is the last call the SHIP reader does. Potentially this is what Nodeos get stuck on, and not syncing any blocks, and get stuck in some loop? We are continuing to investigate and narrow down this issue. But I've been able to replicate both situations, and nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened. I will ask @pnx if he can add any further information as he has investigated the network traffic between client (ship reader) and nodeos. |
The issue seams to be (in this case) that a call to So I believe that there is either (or both) of this events that causes nodeos to hang. |
@coachbjork and @pnx Thank you for documenting your findings related to this issue. Based on the information you shared, it appears to be an intermittent problem related to SHiP and ABI requests. To effectively address it, we'll need to:
All in an effort to determine the root cause of some requests causing |
Got this issue again with running thalos. nodeos still has connection to peers but wont respond to api requests, and wont receive any blocks. When the client (thalos) disconnects from ship websocket. Everything starts to work again. Config: block-log-retain-blocks = 1209600
state-history-log-retain-blocks = 1209600
# Antelope 4 settings
read-only-read-window-time-us = 50001
max-transaction-time = 30
# Database Mode
database-map-mode = mapped_private
#blocks-dir = "blocks"
state-dir = "state-dir"
# Endpoints
agent-name = "shufflingpixels_ship1"
p2p-listen-endpoint = 0.0.0.0:8887
http-server-address = 0.0.0.0:8888
state-history-endpoint = 0.0.0.0:8889
p2p-accept-transactions = false
api-accept-transactions = false
wasm-runtime = eos-vm-jit
eos-vm-oc-compile-threads = 2
eos-vm-oc-enable = 1
chain-threads = 2
validation-mode = light
chain-state-db-size-mb = 130072
resource-monitor-space-threshold = 99
contracts-console = false
read-mode = head
access-control-allow-origin = *
access-control-allow-headers = Origin, X-Requested-With, Content-Type, Accept
http-max-response-time-ms = -1
abi-serializer-max-time-ms = 990000
verbose-http-errors = true
http-validate-host = false
max-clients = 20
sync-fetch-span = 1000
# Plugins
trace-history = true
chain-state-history = true
plugin = eosio::http_plugin
plugin = eosio::state_history_plugin
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin
plugin = eosio::net_plugin
plugin = eosio::net_api_plugin
# waxswedenorg: SE, Sweden
#p2p-peer-address = p2p.waxsweden.org:35777
# waxhiveguild: FI, Finnland
#p2p-peer-address = peer1.hivebp.io:9876
# waxhiveguild: DE, Germany
#p2p-peer-address = peer2.hivebp.io:9876
# sentnlagents: GB, United Kingdom
#p2p-peer-address = waxp2p.sentnl.io:9876
# guild.nefty: DE, Germany
#p2p-peer-address = p2p-node1.neftyblocks.com:9876
# amsterdamwax: NL, Amsterdam
#p2p-peer-address = wax.eu.eosamsterdam.net:9101
# TACO - EU
#p2p-peer-address = peer1.wax.tacocrypto.io:9999
# EOSDAC - NL
p2p-peer-address = wax-p2p.eosdac.io:29876
# wecan - NL
#p2p-peer-address = seed1-wax-mainnet.wecan.dev:14998
# Blacklusion - DE
p2p-peer-address = peer1.wax.blacklusion.io:4646
# Bountyblok - NL
p2p-peer-address = p2p.wax.bountyblok.io:29876 nodeos version: |
This sounds like you're experiencing a deadlock due to the state history threading approach used in leap 4.0 and 5.0. While state history is in the process of streaming a block out of its websocket it holds a mutex on nodeos main thread which will prevent nodeos from doing much of anything.. apply blocks, service RPC requests that require running on main thread ( So if the state history client decides to make an RPC request and wait for its response, without reading the entirety of a state history message (to the point that the send buffer in nodeos doesn't have any free space), there will be a deadlock. There are no plans to fix this in 4.0 or 5.0. It is fixed in Spring 1.0. As a potential workaround you might be able to set |
I've had this issue many times in the past where one SHiP reader (Hyperion) is using the same SHiP node as another server/service (Atomic) and had something trigger where the SHiP node just stops sending blocks to the readers for the requested range. It's not logging disconnects and even restarting the reader shows the websocket connecting (and actually able to read ABIs), but refuses to send any requested block information. I thought this was just something with my nodes (either version or memory), but experienced the issue using one of Nation's 4.0.3 SHiP nodes for EOS (my issues tended to be on WAX mostly using 3.1.4).
So in this scenario, I'm attempting to sync AtomicAssets API on EOS, and it connected and indexed 1000s of blocks properly. I needed to update the servers default gateway, so I stopped the reader, made the change, and then no matter what, the Nations SHiP node would connect but refused to send any blocks. I tried everything to get it working again (changing gateway back, rebooting server) but it just refused to get any blocks (but is successfully connecting). Apparently at that same time, another service that Nation has that was reading from the same SHiP node immediately stopped receiving block information too. After checking the SHiP logs and seeing no errors, eventually he restarted the SHiP node and immediately both his and my readers started receiving blocks again after reconnection.
Here's a snipped of the Atomic indexer logs when its launching, connecting, and then unable to get any blocks:
So it seems that something related to the stopping of the Atomic reader was able to both prevent me from receiving blocks on reconnection, and even worse, seems to have caused the SHiP node to stop sending blocks to ANY of it's connected clients. But a simple restart of the SHiP nodeos allowed the clients to properly reconnect and receive the block information being requested.
The text was updated successfully, but these errors were encountered: