Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive per-block write IO activity #2131

Closed
4 tasks
mdyring opened this issue Aug 23, 2018 · 28 comments
Closed
4 tasks

Excessive per-block write IO activity #2131

mdyring opened this issue Aug 23, 2018 · 28 comments
Assignees
Labels
T:Bug T: Performance Performance improvements
Milestone

Comments

@mdyring
Copy link

mdyring commented Aug 23, 2018

Summary of Bug

Excessing write IO activity for each block.

Appears to be upwards of 40 MB written per block, making it unnecessarily hard to spin up a full node and catch up with the network.

The activity seems to be centered around data/gaia.db, excerpt from the LOG file in same directory:

3:17:10.741444 table@remove removed @67320
13:17:10.741531 table@remove removed @67321
13:17:10.741609 table@remove removed @67322
13:17:10.741687 table@remove removed @67323
13:17:10.741765 table@remove removed @67324
13:17:10.741837 table@remove removed @67325
13:17:19.045515 table@compaction L1·1 -> L2·11 S·20MiB Q·78844571
13:17:19.076806 table@build created L2@67400 N·21455 S·2MiB "s/k..08D,v37526980":"s/k..335,v68374100"
13:17:19.109740 table@build created L2@67401 N·21422 S·2MiB "s/k..062,v36987389":"s/k..53A,v36279697"
13:17:19.145881 table@build created L2@67402 N·21389 S·2MiB "s/k..7F9,v61224864":"s/k..2B3,v32492752"
13:17:19.167930 table@build created L2@67403 N·21432 S·2MiB "s/k..B73,v37833306":"s/k..57E,v78692588"
13:17:19.189592 table@build created L2@67404 N·21431 S·2MiB "s/k..C64,v22931364":"s/k..4BC,v35049316"
13:17:19.213970 table@build created L2@67405 N·21449 S·2MiB "s/k..D82,v39794490":"s/k..49E,v8828758"
13:17:19.236666 table@build created L2@67406 N·21442 S·2MiB "s/k..01A,v60100601":"s/k..602,v941815"
13:17:19.259804 table@build created L2@67407 N·21471 S·2MiB "s/k..351,v20888038":"s/k..9B0,v67797389"
13:17:19.281040 table@build created L2@67408 N·21371 S·2MiB "s/k..7D8,v21824578":"s/k..F08,v68107408"
13:17:19.302025 table@build created L2@67409 N·21377 S·2MiB "s/k..734,v51262793":"s/k..7F2,v41321596"
13:17:19.302220 table@build created L2@67410 N·5 S·613B "s/k..FCC,v6161581":"s/k..5C2,v22347125"
13:17:19.302342 version@stat F·[0 8 118] S·220MiB[0B 2MiB 218MiB] Sc·[0.00 0.02 0.22]
13:17:19.302495 table@compaction committed F-1 S-156KiB Ke·0 D·2012 T·256.835847ms
13:17:19.302592 table@remove removed @67375
13:17:19.302711 table@remove removed @67272
13:17:19.302795 table@remove removed @67251
13:17:19.302876 table@remove removed @67252
13:17:19.302956 table@remove removed @67253
13:17:19.303030 table@remove removed @67254
13:17:19.303106 table@remove removed @67255
13:17:19.303193 table@remove removed @67256
13:17:19.303277 table@remove removed @67257
13:17:19.303349 table@remove removed @67258
13:17:19.303419 table@remove removed @67259
13:17:19.303488 table@remove removed @67260
13:17:21.111785 table@compaction L1·1 -> L2·10 S·20MiB Q·78846279
13:17:21.143318 table@build created L2@67411 N·21439 S·2MiB "s/k..1F7,v10462741":"s/k..7A8,v76676418"
13:17:21.178527 table@build created L2@67412 N·21438 S·2MiB "s/k..631,v68298584":"s/k..644,v14305475"
13:17:21.229220 table@build created L2@67413 N·21410 S·2MiB "s/k..A49,v36257895":"s/k..641,v66389798"
13:17:21.253648 table@build created L2@67414 N·21381 S·2MiB "s/k..AED,v53323503":"s/k..6B9,v8168541"
13:17:21.277524 table@build created L2@67415 N·21389 S·2MiB "s/k..C69,v49211289":"s/k..846,v23921805"
13:17:21.299747 table@build created L2@67416 N·21420 S·2MiB "s/k..793,v42554802":"s/k..307,v28286350"
13:17:21.320809 table@build created L2@67417 N·21439 S·2MiB "s/k..8D9,v33792045":"s/k..434,v24947960"
13:17:21.342135 table@build created L2@67418 N·21393 S·2MiB "s/k..DC8,v66679886":"s/k..7F0,v56705919"
13:17:21.366599 table@build created L2@67419 N·21433 S·2MiB "s/k..844,v19039774":"s/k..A50,v7869655"
13:17:21.394589 table@build created L2@67420 N·21441 S·2MiB "s/k..B0A,v45510309":"s/k..D6A,v66006804"
13:17:21.394992 table@build created L2@67421 N·57 S·5KiB "s/k..0E2,v36658606":"s/k..B4E,v27887217"
13:17:21.395196 version@stat F·[0 7 119] S·220MiB[0B 2MiB 218MiB] Sc·[0.00 0.02 0.22]
13:17:21.395404 table@compaction committed F~ S-158KiB Ke·0 D·2108 T·283.487393ms
13:17:21.395535 table@remove removed @67372
13:17:21.395743 table@remove removed @67283
13:17:21.395913 table@remove removed @67284
13:17:21.396043 table@remove removed @67285
13:17:21.396136 table@remove removed @67286
13:17:21.396230 table@remove removed @67287

The compaction above seems to coincide with a new block being created, I imagine after a lot of updates to the table.

Steps to Reproduce

Participate in gaia-8001 testnet.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ebuchman
Copy link
Member

Tagging @liamsi @jlandrews @mossid as this seems AVL related

@ebuchman ebuchman added core T: Performance Performance improvements labels Aug 23, 2018
@ValarDragon
Copy link
Contributor

ValarDragon commented Aug 29, 2018

Iavl seems likely to me, @jlandrews iavl benchmarks show that there is a large amount of memory allocated for a single key insert. (On the order of 23kb iirc for simple configurations)

We could benchmark over a simulation, time / memory performance with an iavl backend to determine if this is a prelaunch problem.

@UnitylChaos
Copy link
Contributor

If this is new behavior on 8001 I think it's unlikely that the cause is IAVL directly, since the dependency hasn't been changed from 700x, (v0.9.2 in both). It's entirely possible that some change in SDK code discovered a latent problem in IAVL though, or that it's related to pruning somehow.

@ValarDragon
Copy link
Contributor

I imagine noone had looked at the total writes previously, not that its actually new this release. Or it could be the staking changes have more IAVL writes, so this could have become a noticeable problem without IAVL changing.

@alexanderbez
Copy link
Contributor

Pruning could be something worthwhile to look at. I recall it causing majorrrr slowdowns in Ethermint profiles.

@mdyring
Copy link
Author

mdyring commented Aug 30, 2018

To diagnose this further, I'd suggest adding prometheus metrics #2169.

@mdyring
Copy link
Author

mdyring commented Aug 31, 2018

I just had a quick look at the iavl repo in relation to #2169.

I have no intimate knowledge of the codebase. With this disclaimer, it appears to me that if SaveVersion() is invoked it will flush the entire tree to leveldb.

Wouldn't it be more efficient/natural to just update the underlying database as the changes are made at the node level? So, basically, just doing incremental updates to the DB when necessary instead of flushing everything on block boundaries.

This might make versioning a harder, but perhaps leveldb snapshots could be used here.

Anyway, would be good to get some better Prometheus metrics first to determine what is causing it.

@alexanderbez
Copy link
Contributor

I had a similar concern during Ethermint development. I think this will be addressed in #1952. Correct me if I'm wrong @jlandrews

@UnitylChaos
Copy link
Contributor

To @mdyring's point, the current behavior of SaveVersion is that it will recursively flush new nodes to the db, but any nodes that haven't been changed since the last version won't be saved. This does create a lot of non-uniform usage on block boundaries though. #1952 doesn't change that, it's biggest performance related change is not loading all the roots from disk when the tree is loaded, but otherwise most operations are essentially unchanged.
The problem with updating the nodes as you go, is two fold. First it could lead to setting values which then have to be unset, for example if you're using a working tree which you then roll back rather than committing. This could still be done, but it would require more work in the rollback operation, since it would have to actually delete stuff from the DB. Second problem would be the set operation needing to actually write to the disk would likely increase per-operation latency.

It's definitely doable in theory, but I think it would end up being a question of batching vs streaming updates. Streaming updates could lead to less bursty writes on the block boundaries, but more usage overall. That could end up being better though, so it's really hard to say without more data. I'm currently working on some benchmarks the various operations, so hopefully that can help inform this a bit more.

@alexanderbez
Copy link
Contributor

Thanks for the clarification @jlandrews

@mdyring
Copy link
Author

mdyring commented Sep 6, 2018

I am not so concerned about batching/streaming, as the fact that LevelDB feels the need to compact/reorganize this often.

It just occured to me that the reason that it might be doing this because there are a boatload of changes on the primary key/index. At least that would be the typically behaviour on other DBs.

I haven't dived into the code, so just an idea. But primary key should be something immutable for a node (say, some monotonically incremented id) to avoid reorg/compaction.

@hleb-albau
Copy link
Contributor

hleb-albau commented Oct 6, 2018

Hi, we using cosmos-sdk 0.24.2 with tendermint 0.23.1 and GoLevelDb for our chain prototype. Facing absolutely same issue, as described here. Activity is for data/app-name.db db.
Starting new chain, generating about 500 tx each 40kb with overall 100k messages. After some point performance dropped to 1 tx/s. We observing heavy usage of disk - writes from 5 MiB/s to 30-40 MiB/s for periods up to 20 seconds. LevedDb size is about few mb. So, general chain workflow become :

for 30sec block generation:
1) propose,commit block few secs
2) up to 20sec disk heavy usage
3) only few seconds tendermint accepts new mempool txes
4) back to point 1

Starting new chain, gives next disk usage:
image
Here max value is 40MiB/sec. As you can see, most of time, chain spend on db staff.

Same disk usage stays even if we produce empty blocks for a long period of time. Stop sending txes, just create empty blocks.
image

@hleb-albau
Copy link
Contributor

@mdyring Do you have plans to fix this issue in near future? Hoping so, currently it so blocking....

@mdyring
Copy link
Author

mdyring commented Oct 12, 2018

@mdyring Do you have plans to fix this issue in near future? Hoping so, currently it so blocking....

I only reported the issue. I have, as you, high hopes that this will be fixed sooner rather than later. :-)

@hleb-albau
Copy link
Contributor

@mdyring In our case, compiling with cleveldb and fixing multistore usage fixed problem.
For each 30sec-block we previously rewrite about 5mb, that caused a lot of compactions.

@cwgoes
Copy link
Contributor

cwgoes commented Jan 16, 2019

@alessio Are we planning on doing anything here?

@ebuchman
Copy link
Member

Sounds like we just need to be building with clevledb, and/or replace GoLevelDB with something else.

@hleb-albau can you clarify what fixing multistore usage refers to?

@alexanderbez
Copy link
Contributor

Closing this issue as it seems dead -- please reopen if need be.

@mdyring
Copy link
Author

mdyring commented Feb 20, 2019

Would appreciate to keep it open until there is a proposed fix that is confirmed working.

@jackzampolin jackzampolin reopened this Feb 20, 2019
@alessio
Copy link
Contributor

alessio commented Mar 5, 2019

This is blocked on tendermint/tendermint#3378

@mdyring @hleb-albau I'd very much appreciate if you could do some testing with CLevelDB and report back your findings: make install WITH_CLEVELDB=yes

@mdyring
Copy link
Author

mdyring commented Mar 22, 2019

Just ran two nodes alongside each other with cleveldb and goleveldb.

Sync speed was comparable and cleveldb is exhibiting the same behavior with often occuring compactions in application.db:

2019/03/22-15:56:27.261531 7faceadad700 Compacting 1@1 + 6@2 files
2019/03/22-15:56:27.449944 7faceadad700 Generated table #562: 17209 keys, 2115380 bytes
2019/03/22-15:56:27.639600 7faceadad700 Generated table #563: 28571 keys, 2118110 bytes
2019/03/22-15:56:27.851285 7faceadad700 Generated table #564: 33023 keys, 2119570 bytes
2019/03/22-15:56:28.052745 7faceadad700 Generated table #565: 17981 keys, 2112931 bytes
2019/03/22-15:56:28.255814 7faceadad700 Generated table #566: 26925 keys, 2117347 bytes
2019/03/22-15:56:28.351457 7faceadad700 Generated table #567: 10669 keys, 679382 bytes
2019/03/22-15:56:28.351489 7faceadad700 Compacted 1@1 + 6@2 files => 11262720 bytes
2019/03/22-15:56:28.352512 7faceadad700 compacted to: files[ 0 4 7 0 0 0 0 ]
2019/03/22-15:56:30.459228 7faceadad700 Level-0 table #569: started
2019/03/22-15:56:30.647983 7faceadad700 Level-0 table #569: 2419528 bytes OK
2019/03/22-15:56:30.649610 7faceadad700 Delete type=2 #550
2019/03/22-15:56:30.650110 7faceadad700 Delete type=2 #549
2019/03/22-15:56:30.650526 7faceadad700 Delete type=2 #551
2019/03/22-15:56:30.650933 7faceadad700 Delete type=2 #552
2019/03/22-15:56:30.651333 7faceadad700 Delete type=2 #560
2019/03/22-15:56:30.651771 7faceadad700 Delete type=0 #555
2019/03/22-15:56:30.652036 7faceadad700 Delete type=2 #553
2019/03/22-15:56:30.652380 7faceadad700 Delete type=2 #554
2019/03/22-15:56:31.268140 7faceadad700 Compacting 1@0 + 4@1 files
2019/03/22-15:56:31.458295 7faceadad700 Generated table #570: 37278 keys, 2113815 bytes
2019/03/22-15:56:31.839561 7faceadad700 Generated table #571: 38749 keys, 2113477 bytes
2019/03/22-15:56:32.042603 7faceadad700 Generated table #572: 38975 keys, 2112931 bytes
2019/03/22-15:56:32.248191 7faceadad700 Generated table #573: 34536 keys, 2113370 bytes
2019/03/22-15:56:32.249421 7faceadad700 Generated table #574: 176 keys, 6933 bytes
2019/03/22-15:56:32.249507 7faceadad700 Compacted 1@0 + 4@1 files => 8460526 bytes
2019/03/22-15:56:32.250395 7faceadad700 compacted to: files[ 0 5 7 0 0 0 0 ]

It would be nice with some educated guess as to why compaction is needed this often, it must be due to a lot of changes in key values?

@jackzampolin
Copy link
Member

Maybe the next step here is to expose the compaction parameters and test with different settings. Is this possible?

@mdyring
Copy link
Author

mdyring commented Mar 23, 2019

@jackzampolin Before tweaking the compaction parameters I think it would be helpful to understand why compaction is needed this often. As the application.db grows, compaction will just get more and more IO intensive as it seems everything is (re-)written. So it should ideally happen very rarely.

I am not a LevelDB expert, but from my experience with relational databases, I would only expect compaction/reorganization to be needed if the index is fragmented (in this case there is only one index, the "key"). That could be due to pages being full (so tweakable perhaps via config) or because the key is not immutable per node.

Does IAVL ensure that:

  • modified nodes do not create a new node key?
  • modified nodes are not done as an "delete/insert" to LevelDB but as an actual update not affecting the index?
  • what about cosmos-sdk: Doing updates or delete/insert to IAVL?

@alexanderbez
Copy link
Contributor

@jlandrews or @liamsi is there any insight you can share to the above?

@jackzampolin
Copy link
Member

jackzampolin commented Apr 9, 2019

Relevant: syndtr/goleveldb#226 (comment)

@ebuchman
Copy link
Member

or because the key is not immutable per node.

I believe this is the case. If I'm not mistaken, the key in the LevelDB is the hash of the node, which will change if any of its children change, so this could be the reason for the constant changes.

Sounds like it could require some significant engineering work to address this, but I believe this is along the lines of the work Alexey was doing for the Ethermint store and in Turbo-Geth

@mdyring
Copy link
Author

mdyring commented May 13, 2019

Copying slack history here to capture some details, specifically how integer ID would be used and a "root" key used to reference the root node integer ID.

bucky   [28 minutes ago]
are these issues (ie. compaction time) somewhat mitigated using the cleveldb backend?

mdyring   [28 minutes ago]
nope, tested that a while back. no major effect.

Jack   [27 minutes ago]
@bucky this has to due with the IAVL indexing scheme

Jack   [27 minutes ago]
Which I don't really have great insight into

bucky   [24 minutes ago]
> I would not expect any reasonable db implementation to require reorgs/compactions this often if only the value part of the key-value pair changes.

so in a merkle tree, changing the value means changing the key because all the nodes are content addressable. so this could be part of the cause

bucky   [23 minutes ago]
curious that the other in-process dbs tried (boltdb, badgerdb) seem to have *worse* performance though (edited)

bucky   [22 minutes ago]
are there leveldb compaction settings that can be adjusted possibly ?

bucky   [20 minutes ago]
https://github.com/cosmos/cosmos-sdk/issues/2131#issuecomment-431683989 suggests that cleveldb fixes the compaction issue
GitHub
Excessive per-block write IO activity  · Issue #2131 · cosmos/cosmos-sdk
Summary of Bug Excessing write IO activity for each block. Appears to be upwards of 40 MB written per block, making it unnecessarily hard to spin up a full node and catch up with the network. The a...

bucky   [19 minutes ago]
hmm but looks like martin says it actually doesnt help

mdyring   [9 minutes ago]
@bucky yeah but key in the db could theoretically just be a integer value, assigned sequentially to nodes. Then store node "key" (hash) inside the value part of the db.. and store pointers to leafs/parent as the ids as well.

mdyring   [9 minutes ago]
basically as you'd have done in a relational db with integer primary keys.

mdyring   [8 minutes ago]
unless you need to do a lookup on the node hash for some reason.. but I'd think storing integer references should be fine

bucky   [8 minutes ago]
could be interesting. but there’s no canonical order for the nodes since they change on rotation

bucky   [8 minutes ago]
maybe thats ok if the parents are refering to their children via their permanent index

mdyring   [7 minutes ago]
doesn't matter.. just assign new nodes next sequential id..  and when needing to change tree layout/references

bucky   [7 minutes ago]
tree layout changes constantly. so as long as theres no expectation that assigned index corresponds to some position in the tree. which i guess theres no need for

mdyring   [6 minutes ago]
not sure a key-value store is optimal for this TBH, but could work.. personally i'd just go with something like sqlite

bucky   [6 minutes ago]
sounds like a lot of work needs to be done on the avl tree. Alexey was looking into paging so that nodes would be sotred together in a fixed page size structure, but im not sure exactly where he got and he went back to working on ETH 1.X

mdyring   [6 minutes ago]
but could work as well with kv store

mdyring   [5 minutes ago]
avoiding constantly changing keys should fix it AFAICS, assuming leveldb supports updating the value part only.

mdyring   [2 minutes ago]
there should just then be a single key named "root" which value is updated on commit to point to integer id key of tree root.

@jackzampolin
Copy link
Member

We have done a great deal work on both an Interblock Cache and another cache in the IAVL tree to help mitigate this issue. Going to go ahead and close this. If there are additional performance concerns please open additional issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:Bug T: Performance Performance improvements
Projects
None yet
Development

No branches or pull requests

9 participants