Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

ship stopped writing new blocks #9483

Open
cc32d9 opened this issue Sep 10, 2020 · 19 comments
Open

ship stopped writing new blocks #9483

cc32d9 opened this issue Sep 10, 2020 · 19 comments

Comments

@cc32d9
Copy link
Contributor

cc32d9 commented Sep 10, 2020

eosio-2.0.7, Ubuntu 18.04. The node has resynched Telos from genesis. All of a sudden, it stopped writing state history logs, without any messages related to the moment when it stopped (log level is set to error).

config.ini (producer_api_plugin was added several hours after this happened):

chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048

wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true

p2p-accept-transactions = false
api-accept-transactions = false

http-server-address = 0.0.0.0:8802
p2p-listen-endpoint = 0.0.0.0:9802
p2p-server-address = :9857

access-control-allow-origin = *

#verbose-http-errors = true
contracts-console = true

max-clients = 100

plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin

plugin = eosio::producer_api_plugin

plugin = eosio::state_history_plugin
trace-history = true
chain-state-history = true
trace-history-debug-mode = true
state-history-endpoint = 0.0.0.0:8082

sync-fetch-span = 10

p2p-peer-address = p2p.telosarabia.net:9876
p2p-peer-address = telos.cryptolions.io:9871
p2p-peer-address = telos.eosdublin.io:9876
root@eosio:/srv/telos# ls -al data/state-history/
total 291419499
drwxr-xr-x 2 root root            6 Sep  6 13:31 .
drwxr-xr-x 6 root root            6 Sep  6 13:31 ..
-rw-r--r-- 1 root root    831008728 Sep  9 08:12 chain_state_history.index
-rw-r--r-- 1 root root 117728914750 Sep  9 08:12 chain_state_history.log
-rw-r--r-- 1 root root    831008728 Sep  9 08:12 trace_history.index
-rw-r--r-- 1 root root 178821446038 Sep  9 08:12 trace_history.log
root@eosio:/srv/telos# ls -al data/blocks/
total 47089027
drwxr-xr-x 3 root root            5 Sep  6 13:31 .
drwxr-xr-x 6 root root            6 Sep  6 13:31 ..
-rw-r--r-- 1 root root    877516224 Sep 10 07:29 blocks.index
-rw-r--r-- 1 root root 128877141011 Sep 10 07:29 blocks.log
drwxr-xr-x 2 root root            3 Sep  6 13:31 reversible

After restarting nodeos, it prints the blocks range in state history, and doesn't collect any new data. The websocket is responding, but obviously not delivering any traces.

In attached log, I restarted nodeos with default logging at Sep 10 07:32:54.It started printing "missed a block in trace_history.log"

telos.nodeos.log

@cc32d9
Copy link
Contributor Author

cc32d9 commented Sep 10, 2020

the node went on with the head block, so it's difficult to reproduce or test a bugfix. But what it should've done is stop and not continue after there was a failure.

@aclark-b1
Copy link

Have you been able to do this successfully before? Just trying to determine if something may have changed recently that caused this.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Sep 16, 2020

no, it's the first time that I see the state history data corrupted. But what is a systematic problem, is that nodeos prints a warning and goes ahead, instead of aborting if the data is corrupted.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Feb 22, 2021

Just experienced it with 2.0.6 on WAX on two independent servers within a couple of days. It just stops updating state history silently. The network has been experiencing intensive forks. This behavior is quite random. I'm maintaining 4 ship servers on WAX, and only one falls out at a time.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Feb 22, 2021

It looks like it's related to increased forks frequency, so #10080 might be related

@n8d
Copy link

n8d commented Feb 22, 2021

We have seen this several times on EOS recently where we get this warning with every block:

missed a block in chain_state_history.log
    {"name":"chain_state_history"}
    nodeos  state_history_log.hpp:92 write_entry

And we have to stop the node and restore from backup to fix it. This is on 2.0.9.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Feb 22, 2021

in my case, logging was limited to error level, so no logs here

@heifner
Copy link
Contributor

heifner commented Feb 23, 2021

We have seen this several times on EOS recently where we get this warning with every block:

Do you have a full log available. Interested in if there was a warn/error before this started warning on every block. Once in this state it really can't recover without a restore from backup.

@heifner
Copy link
Contributor

heifner commented Feb 23, 2021

in my case, logging was limited to error level, so no logs here

Would recommend you run with at least warn level logging.

@n8d
Copy link

n8d commented Feb 23, 2021

We have seen this several times on EOS recently where we get this warning with every block:

Do you have a full log available. Interested in if there was a warn/error before this started warning on every block. Once in this state it really can't recover without a restore from backup.

Looking closer, it actually happened with a restart (we do auto stop + zfs snap + start). On stop it put this in the log:

corrupted size vs. prev_size

And then when it started it immediately had the missed block in state history issue.

@matthewdarwin
Copy link

see "crash on exit with error "corrupted size vs. prev_size" #8450

@n8d
Copy link

n8d commented Feb 23, 2021

see "crash on exit with error "corrupted size vs. prev_size" #8450

Yep, I thought that one was fixed long ago but doesn't look like it or maybe a recent regression.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Feb 24, 2021

whatever the cause is, nodeos should stop if it cannot keep writing the ship data. Currently this failure is difficult to detect because it keeps sending empty ship data via websicket.

@eosusa
Copy link

eosusa commented Feb 25, 2021

Had a WAX SHiP node start missing blocks in the SHiP logs and see the following just before it started missing blocks:

fc: exeption: 8 out_of_range_exception: Out of Range write datastream of length 11327 over by -51
{"method:"::"write", "len":"11327","over":"-51"}
nodeos datastream.cpp:6 throw_datastream_range_error

@cc32d9
Copy link
Contributor Author

cc32d9 commented Sep 14, 2021

Had this error occurring again few days ago, and that was the only node where I forgot to put read-mode=head. It was isolated from speculative transactions though:

p2p-accept-transactions = false
api-accept-transactions = false

Could also be related to #9972

@cc32d9
Copy link
Contributor Author

cc32d9 commented Oct 21, 2021

there's at least one occurrence of this error with read-mode=head. 2.0.13 (WAX) could not write to state history and stopped.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Nov 17, 2021

One more occasion, and this time it's not at head block: I was resynching a Telos ship node from genesis, and it stopped about halfway through the history (last block in blocks log: 103875765, current heads block is above 180M)

@cc32d9
Copy link
Contributor Author

cc32d9 commented Nov 17, 2021

attached, is the log from the Telos node (log level is set to error). The node started from near the genesis on Nov 12 , then stopped on Nov 14 12:06:20. later on, I tried to start it on Nov 17.
telos_ship.log

@cc32d9
Copy link
Contributor Author

cc32d9 commented Dec 3, 2021

the error is quite definitely related to how busy the storage is for ship files. Seems like the write operation times out too early.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants