High Read Disk I/O recently??? #21809

fabdarice · 2020-11-10T00:40:28Z

System information

Geth version: 1.9.23-stable
OS & Version: Arch Linux

Expected behaviour

I have been experiencing for the past month an average of under 1MB/s for read and write disk i/o

Actual behaviour

Recently (last 4 days or so), I have ben experiencing an average of 20MB/s for read disk i/o.
Doing sudo iotop clearly shows that geth is responsible for this.
The weird thing is that I've tried restarting geth with different parameters (such as turning off --metrics --pprof), but the same thing happens, but 1 out of 15-20 times, the geth node is doing normal write and read i/o (under 1MB/s) and as long as I don't restart the process, it'll work fine.

The text was updated successfully, but these errors were encountered:

ligi · 2020-11-12T09:23:27Z

Is your RPC open to the outside? Can you provide some logs? I see you have --metrics on - can you share them?

fabdarice · 2020-11-12T13:29:18Z

Is your RPC open to the outside? Can you provide some logs? I see you have --metrics on - can you share them?

Hey @ligi , thanks for looking into this.
Nah, I'm running geth with the flags --http --http.addr localhost
Metrics: https://gist.github.com/fabdarice/7fb03e6730aaffa7a7a6f8d0632eb2d6

Also one thing I've noticed, everytime I restart geth, for 2 minutes, I get an massive spike of disk i/o (write ~130MB/s and read 70MB/s).
For unknown reasons, it always causes my internet to drop.

holiman · 2020-11-12T13:42:47Z

I get an massive spike of disk i/o (write ~130MB/s and read 70MB/s).

That's geth flusing a lot of in-memory data structures (cached trie nodes) to disk, that's to be expected.

For unknown reasons, it always causes my internet to drop.

That's.. not to be expected... :/

fabdarice · 2020-11-22T01:27:58Z

do you guys know if those numbers are expected?

holiman · 2020-11-22T08:12:09Z

I'm investigating an archive node where the leveldb database for some reason has placed all files in level0, thus totally failing to do compaction. This causes massive IO overhead. I would be interested to find out if your node has a similar issue. I'll send a link to a PR which has an command for checking the db stats

holiman · 2020-11-22T09:30:57Z

Could you please give this one a test-run: #21887 ? Should be quick to run, and will tell us if the leveldb database is in a somwhat 'sane' state.

fabdarice · 2020-11-22T12:00:05Z

@holiman thanks for looking into it.

Here's some results:
$ ./build/bin/geth dbstats

Compactions
 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   0   |          5 |    1145.36859 |       0.00000 |       0.00000 |       0.00000
   1   |         50 |      99.40603 |       0.00000 |       0.00000 |       0.00000
   2   |        609 |     999.01700 |       0.00000 |       0.00000 |       0.00000
   3   |       7786 |    9998.56247 |       0.00000 |       0.00000 |       0.00000
   4   |      50948 |   99998.97362 |       0.00000 |       0.00000 |       0.00000
   5   |      20982 |   41855.70432 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      80380 |  154097.03203 |       0.00000 |       0.00000 |       0.00000

$ ./build/bin/geth --nousb dbstats

Compactions
 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   0   |          1 |       9.62615 |       0.00000 |       0.00000 |       0.00000
   1   |         56 |     106.88867 |       0.00000 |       0.00000 |       0.00000
   2   |        627 |    1075.44599 |       0.00000 |       0.00000 |       0.00000
   3   |       8336 |   10753.07116 |       0.00000 |       0.00000 |       0.00000
   4   |      51095 |  100269.49947 |       0.00000 |       0.00000 |       0.00000
   5   |      20982 |   41855.70432 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      81097 |  154070.23576 |       0.00000 |       0.00000 |       0.00000

EDIT: Unfortunately, as I'm running those commands, my disk i/o were back to normal.
I'll run it again next time we're back to 20-30MB/s

fabdarice · 2020-11-22T12:03:06Z

Note that it fluctuates, and I haven't been able to find a correlation.
The only thing I'm sure of is that the high spikes are caused by geth.

See metrics below.

fabdarice · 2020-11-23T13:38:14Z

hey @holiman, just ran the command as my disk I/O are again on avg 20-30MB/s

 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   0   |          2 |     430.23841 |       0.00000 |       0.00000 |       0.00000
   1   |        477 |     926.86194 |       0.00000 |       0.00000 |       0.00000
   2   |        823 |    1293.13021 |       0.00000 |       0.00000 |       0.00000
   3   |       7873 |    9999.59098 |       0.00000 |       0.00000 |       0.00000
   4   |      50962 |   99999.91003 |       0.00000 |       0.00000 |       0.00000
   5   |      21537 |   42954.67393 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      81674 |  155604.40551 |       0.00000 |       0.00000 |       0.00000

ca42 · 2020-11-30T21:49:01Z

@holiman we are experiencing this issue as well across all of our full nodes regardless of there being active client connections or not. We are regularly seeing sustained read throughput of around 100 MB/s:

This is happening for client versions 1.9.21 and 1.9.24, both compiled with go1.15.5 running on linux.

holiman · 2021-10-25T20:34:14Z

We're always looking into how to reduce the disk IO, but since this bug is fairly old, I don't see any reason to keep it open any longer.

fabdarice changed the title ~~High Disk I/O recently???~~ High Read Disk I/O recently??? Nov 10, 2020

ethereum deleted a comment from watsoncoders Nov 10, 2020

ligi added the status:triage label Nov 12, 2020

ligi removed the status:triage label Nov 12, 2020

holiman mentioned this issue Nov 22, 2020

cmd/geth: add dbstats command #21887

Closed

holiman closed this as completed Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Read Disk I/O recently??? #21809

High Read Disk I/O recently??? #21809

fabdarice commented Nov 10, 2020 •

edited

Loading

ligi commented Nov 12, 2020

fabdarice commented Nov 12, 2020

holiman commented Nov 12, 2020

fabdarice commented Nov 22, 2020

holiman commented Nov 22, 2020 via email

holiman commented Nov 22, 2020

fabdarice commented Nov 22, 2020 •

edited

Loading

fabdarice commented Nov 22, 2020 •

edited

Loading

fabdarice commented Nov 23, 2020

ca42 commented Nov 30, 2020

holiman commented Oct 25, 2021

High Read Disk I/O recently??? #21809

High Read Disk I/O recently??? #21809

Comments

fabdarice commented Nov 10, 2020 • edited Loading

System information

Expected behaviour

Actual behaviour

ligi commented Nov 12, 2020

fabdarice commented Nov 12, 2020

holiman commented Nov 12, 2020

fabdarice commented Nov 22, 2020

holiman commented Nov 22, 2020 via email

holiman commented Nov 22, 2020

fabdarice commented Nov 22, 2020 • edited Loading

fabdarice commented Nov 22, 2020 • edited Loading

fabdarice commented Nov 23, 2020

ca42 commented Nov 30, 2020

holiman commented Oct 25, 2021

fabdarice commented Nov 10, 2020 •

edited

Loading

fabdarice commented Nov 22, 2020 •

edited

Loading

fabdarice commented Nov 22, 2020 •

edited

Loading