nodeos 3.x/4.x SHiP stops serving blocks #1358

eosusa · 2023-06-30T18:22:13Z

I've had this issue many times in the past where one SHiP reader (Hyperion) is using the same SHiP node as another server/service (Atomic) and had something trigger where the SHiP node just stops sending blocks to the readers for the requested range. It's not logging disconnects and even restarting the reader shows the websocket connecting (and actually able to read ABIs), but refuses to send any requested block information. I thought this was just something with my nodes (either version or memory), but experienced the issue using one of Nation's 4.0.3 SHiP nodes for EOS (my issues tended to be on WAX mostly using 3.1.4).

So in this scenario, I'm attempting to sync AtomicAssets API on EOS, and it connected and indexed 1000s of blocks properly. I needed to update the servers default gateway, so I stopped the reader, made the change, and then no matter what, the Nations SHiP node would connect but refused to send any blocks. I tried everything to get it working again (changing gateway back, rebooting server) but it just refused to get any blocks (but is successfully connecting). Apparently at that same time, another service that Nation has that was reading from the same SHiP node immediately stopped receiving block information too. After checking the SHiP logs and seeing no errors, eventually he restarted the SHiP node and immediately both his and my readers started receiving blocks again after reconnection.

Here's a snipped of the Atomic indexer logs when its launching, connecting, and then unable to get any blocks:

0|eosio-contract-api-filler  | 2023-06-30T17:31:05.395Z [PID:4772] [info] : Init handler atomicmarket for reader atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.396Z [PID:4772] [info] : Init handler atomictools for reader atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [info] : Starting reader: atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [warn] : Reader (atomic-1) start block cannot be lower than the last processed block. Ignoring config.
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [info] : Reader atomic-1 starting on block #99177100
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.402Z [PID:4772] [info] : Connecting to ship endpoint ws://x.x.x.x:9090
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.402Z [PID:4772] [info] : Ship connect options {"start_block_num":99177100,"end_block_num":4294967295,"max_messages_in_flight":50,"have_positions":"removed","irreversible_only":false,"fetch_block":true,"fetch_traces":true,"fetch_deltas":true}
0|eosio-contract-api-filler  | 2023-06-30T17:31:15.404Z [PID:4772] [warn] : Reader atomic-1 - No blocks processed [DS:0|SH:0|JQ:0] - Stopping in 3590 seconds
0|eosio-contract-api-filler  | 2023-06-30T17:31:20.404Z [PID:4772] [warn] : Reader atomic-1 - No blocks processed [DS:0|SH:0|JQ:0] - Stopping in 3585 seconds

So it seems that something related to the stopping of the Atomic reader was able to both prevent me from receiving blocks on reconnection, and even worse, seems to have caused the SHiP node to stop sending blocks to ANY of it's connected clients. But a simple restart of the SHiP nodeos allowed the clients to properly reconnect and receive the block information being requested.

The text was updated successfully, but these errors were encountered:

matthewdarwin · 2023-06-30T20:14:52Z

Yes, i had 4.0.3 ship on EOS for this. Our EOS-EVM stopped syncing when eosusa disconnect the AA reader from our ship.

heifner · 2023-07-06T14:13:09Z

Do you have any process to repeatably reproduce this? I've been trying various combinations of connecting/disconnecting/killing clients and have not been able to reproduce this.

matthewdarwin · 2023-07-06T14:36:44Z

I don't have steps to reproduce. It only happened once for me, that I know of.

bhazzard · 2023-07-13T17:42:42Z

Keeping this open pending reproducible steps. But for now we can't reproduce to fix.

eosusa · 2023-07-13T18:10:14Z

Just tried the same reconnection process with our Atomic indexer and Nations StateHist node and it did NOT experience the same issue this time. So def not something easily reproducible, but I experience it from time to time on our internal nodes so will keep an eye out for it and reach out if/when i see it again (and try to leave it broken so we can troubleshoot it)

coachbjork · 2023-12-06T02:22:05Z

We have been finding similar issues.
We can replicate the issue in two different ways.

SHIP lost connection for a bit (few minutes), and while replaying blocks it suddenly stops, and no new blocks is accepted. While it keep asking peers for the same blocks over and over again.
SHIP reader asks for a block-range that the SHIP node don't have, creates same situation as above.

In both scenarios, the SHIP reader has a healthy connection to nodeos, except not getting any blocks.
Nodeos continues to sync directly when we kill the websocket connection to nodeos.

The issue seem to be related to asking for ABI, as that is the last call the SHIP reader does. Potentially this is what Nodeos get stuck on, and not syncing any blocks, and get stuck in some loop?

We are continuing to investigate and narrow down this issue.

But I've been able to replicate both situations, and nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

I will ask @pnx if he can add any further information as he has investigated the network traffic between client (ship reader) and nodeos.

pnx · 2023-12-06T14:10:00Z

We have been finding similar issues. We can replicate the issue in two different ways.

SHIP lost connection for a bit (few minutes), and while replaying blocks it suddenly stops, and no new blocks is accepted. While it keep asking peers for the same blocks over and over again.

SHIP reader asks for a block-range that the SHIP node don't have, creates same situation as above.

In both scenarios, the SHIP reader has a healthy connection to nodeos, except not getting any blocks. Nodeos continues to sync directly when we kill the websocket connection to nodeos.

The issue seem to be related to asking for ABI, as that is the last call the SHIP reader does. Potentially this is what Nodeos get stuck on, and not syncing any blocks, and get stuck in some loop?

We are continuing to investigate and narrow down this issue.

But I've been able to replicate both situations, and nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

I will ask @pnx if he can add any further information as he has investigated the network traffic between client (ship reader) and nodeos.

The issue seams to be (in this case) that a call to /v1/chain/get_abi hangs. and therefore my application will stop reading tcp data resulting in the tcp receive buffer being filled on the client side.

So I believe that there is either (or both) of this events that causes nodeos to hang.
I have attached a screenshot from wireshark.

bhazzard · 2023-12-19T17:49:53Z

@coachbjork and @pnx Thank you for documenting your findings related to this issue. Based on the information you shared, it appears to be an intermittent problem related to SHiP and ABI requests. To effectively address it, we'll need to:

Continue to document instances of the issue, including log data and configurations.
Attempt to reproduce the issue in a controlled environment.
Analyze network traffic and SHiP node logs for anomalies during the issue.

All in an effort to determine the root cause of some requests causing /v1/chain/get_abi to hang.

pnx · 2024-06-27T07:36:15Z

Got this issue again with running thalos. nodeos still has connection to peers but wont respond to api requests, and wont receive any blocks. When the client (thalos) disconnects from ship websocket. Everything starts to work again.

nodes.log
thalos.log

Config:

block-log-retain-blocks = 1209600
state-history-log-retain-blocks = 1209600

# Antelope 4 settings
read-only-read-window-time-us = 50001
max-transaction-time = 30

# Database Mode
database-map-mode = mapped_private
#blocks-dir = "blocks"
state-dir = "state-dir"

# Endpoints
agent-name = "shufflingpixels_ship1"
p2p-listen-endpoint = 0.0.0.0:8887
http-server-address = 0.0.0.0:8888
state-history-endpoint = 0.0.0.0:8889

p2p-accept-transactions = false
api-accept-transactions = false

wasm-runtime = eos-vm-jit
eos-vm-oc-compile-threads = 2
eos-vm-oc-enable = 1
chain-threads = 2
validation-mode = light

chain-state-db-size-mb = 130072
resource-monitor-space-threshold = 99

contracts-console = false
read-mode = head
access-control-allow-origin = *
access-control-allow-headers = Origin, X-Requested-With, Content-Type, Accept
http-max-response-time-ms = -1
abi-serializer-max-time-ms = 990000
verbose-http-errors = true
http-validate-host = false
max-clients = 20
sync-fetch-span = 1000

# Plugins
trace-history = true
chain-state-history = true
plugin = eosio::http_plugin
plugin = eosio::state_history_plugin
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin
plugin = eosio::net_plugin
plugin = eosio::net_api_plugin

# waxswedenorg: SE, Sweden
#p2p-peer-address = p2p.waxsweden.org:35777

# waxhiveguild: FI, Finnland
#p2p-peer-address = peer1.hivebp.io:9876

# waxhiveguild: DE, Germany
#p2p-peer-address = peer2.hivebp.io:9876

# sentnlagents: GB, United Kingdom
#p2p-peer-address = waxp2p.sentnl.io:9876

# guild.nefty: DE, Germany
#p2p-peer-address = p2p-node1.neftyblocks.com:9876

# amsterdamwax: NL, Amsterdam
#p2p-peer-address = wax.eu.eosamsterdam.net:9101


# TACO - EU
#p2p-peer-address = peer1.wax.tacocrypto.io:9999
# EOSDAC - NL
p2p-peer-address = wax-p2p.eosdac.io:29876
# wecan - NL
#p2p-peer-address = seed1-wax-mainnet.wecan.dev:14998
# Blacklusion - DE
p2p-peer-address = peer1.wax.blacklusion.io:4646
# Bountyblok - NL
p2p-peer-address = p2p.wax.bountyblok.io:29876

nodeos version: v5.0.1wax01

spoonincode · 2024-08-08T17:12:40Z

When the client (thalos) disconnects from ship websocket. Everything starts to work again.

nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

This sounds like you're experiencing a deadlock due to the state history threading approach used in leap 4.0 and 5.0. While state history is in the process of streaming a block out of its websocket it holds a mutex on nodeos main thread which will prevent nodeos from doing much of anything.. apply blocks, service RPC requests that require running on main thread (get_abi) etc.

So if the state history client decides to make an RPC request and wait for its response, without reading the entirety of a state history message (to the point that the send buffer in nodeos doesn't have any free space), there will be a deadlock.

There are no plans to fix this in 4.0 or 5.0. It is fixed in Spring 1.0.

As a potential workaround you might be able to set max_messages_in_flight to 1, ensure you read the entire get_blocks_result_v0 message, do any sort of RPC calls etc, and only then send get_blocks_ack_request_v0; repeat for each block. This will ensure that state history is not holding the main thread mutex when you perform your RPC requests.

enf-ci-bot added the triage label Jun 30, 2023

enf-ci-bot added this to Team Backlog Jun 30, 2023

github-project-automation bot moved this to Todo in Team Backlog Jun 30, 2023

bhazzard added bug Something isn't working more-info waiting for submitter to reply with more information and removed triage labels Jul 13, 2023

BenjaminGormanPMP added this to the Leap v4.0.6 milestone Dec 18, 2023

bhazzard added 👍 lgtm and removed more-info waiting for submitter to reply with more information labels Dec 19, 2023

pnx mentioned this issue Dec 28, 2023

Issue when reading events from a syncing node eosswedenorg/thalos#25

Closed

bhazzard modified the milestones: Leap v4.0.6, Leap v4.0.7 Feb 27, 2024

arhag modified the milestones: Leap v4.0.7, Leap v4.0.8 Mar 20, 2024

bhazzard closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2024

github-project-automation bot moved this from Todo to Done in Team Backlog Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodeos 3.x/4.x SHiP stops serving blocks #1358

nodeos 3.x/4.x SHiP stops serving blocks #1358

eosusa commented Jun 30, 2023 •

edited

Loading

matthewdarwin commented Jun 30, 2023

heifner commented Jul 6, 2023

matthewdarwin commented Jul 6, 2023 •

edited

Loading

bhazzard commented Jul 13, 2023

eosusa commented Jul 13, 2023

coachbjork commented Dec 6, 2023

pnx commented Dec 6, 2023

bhazzard commented Dec 19, 2023

pnx commented Jun 27, 2024 •

edited

Loading

spoonincode commented Aug 8, 2024

nodeos 3.x/4.x SHiP stops serving blocks #1358

nodeos 3.x/4.x SHiP stops serving blocks #1358

Comments

eosusa commented Jun 30, 2023 • edited Loading

matthewdarwin commented Jun 30, 2023

heifner commented Jul 6, 2023

matthewdarwin commented Jul 6, 2023 • edited Loading

bhazzard commented Jul 13, 2023

eosusa commented Jul 13, 2023

coachbjork commented Dec 6, 2023

pnx commented Dec 6, 2023

bhazzard commented Dec 19, 2023

pnx commented Jun 27, 2024 • edited Loading

spoonincode commented Aug 8, 2024

eosusa commented Jun 30, 2023 •

edited

Loading

matthewdarwin commented Jul 6, 2023 •

edited

Loading

pnx commented Jun 27, 2024 •

edited

Loading