-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support concurrency for IAVL and fix Racing conditions #805
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Only minor comments.
Do we have a good understanding of the performance changes under normal conditions? Might be useful to run this before and after the change and compare it with benchstat
// If next fast node from disk is to be removed, skip it. | ||
iter.fastIterator.Next() | ||
iter.Next() | ||
return | ||
} | ||
|
||
nextUnsavedKey := iter.unsavedFastNodesToSort[iter.nextUnsavedNodeIdx] | ||
nextUnsavedNode := iter.unsavedFastNodeAdditions[string(nextUnsavedKey)] | ||
nextUnsavedNodeVal, _ := iter.unsavedFastNodeAdditions.Load(nextUnsavedKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably check whether the second return was set to true before proceeding
@@ -196,7 +203,8 @@ func (iter *UnsavedFastIterator) Next() { | |||
// if only unsaved nodes are left, we can just iterate | |||
if iter.nextUnsavedNodeIdx < len(iter.unsavedFastNodesToSort) { | |||
nextUnsavedKey := iter.unsavedFastNodesToSort[iter.nextUnsavedNodeIdx] | |||
nextUnsavedNode := iter.unsavedFastNodeAdditions[string(nextUnsavedKey)] | |||
nextUnsavedNodeVal, _ := iter.unsavedFastNodeAdditions.Load(nextUnsavedKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worthwhile checking second return here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing changes! do you know if there were any performance regressions by using sync.Map?
Oh sweet! I'll get back tmrw with the benchstats @p0mvn suggested |
I just run this benchmark script locally, I get results like this:
so it seems |
@yihuang Previously, I have attempted to use good old mutexes instead of sync.Maps(prev attempt here mattverse@34d4565) , but I have changed to sync.Map for couple of reasons.
lmk your thoughts on this! |
I am still confused about the parallel use-case of iavl, the parallel writing will break the tree structure, so the root hash will not be deterministic. Actually, the IAVL doesn't provide concurrency now. |
I guess it's concurrent reading and writing, still single writer but multiple readers? |
this is due to the fastnode system, there is an issue with reading and writing that causes a concurrent hashmap error. This aims to fix the issue. Its less so about using the iavl tree in parallel. |
my concern is consistency, nodedb use mutex to protect some maps operations, but here we use |
yeah, I think |
ok, if that's the consensus, I'm down to change it to use sync.RWMutex and re-open PR! |
@mattverse any chance you can make the new pr? otherwise we can do it |
@tac0turtle Code is ready, currently testing, give me 1~2 more days! |
I'm half skeptical it would work, applying RW mutex was pain, alot of places caused deadlocks |
yep getting error
https://github.com/mattverse/iavl/tree/mattverse/rwmutex-iavl This was the attempt I have made, I would be more than happy if you guys want to take it over from this point! lmk on the updates |
id say for the current iavl this pr is fine as is. In iavl 2.0 we will remove the fastnode system so its only present for a little. @cool-develope @yihuang thoughts? |
I don't think the above error is related to RWMutex or sync.Map, it seems like trying parallel writing and breaking deterministic. |
no problem, since it's simpler. |
@Mergifyio backport release/v1.x.x |
✅ Backports have been created
|
Co-authored-by: Marko <marko@baricevic.me> (cherry picked from commit ba6beb1)
@tac0turtle Can we get this in for sdk v0.50? |
yup already backported, we will test it on existing chains for a bit then cut the final release alongside the sdk eden release |
Sweet, should I also create subsequent PRs for old major versions so that its compatible with old sdk versions as well? It's not state breaking so should be patchable |
we would need to backport it to v0.20 for older versions. We can do this and bump the old ones. |
ok so the step would be first backport to v0.20 and then we'd be able to bump the older ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattverse seems like this version differs from the original a little. it is causing some dead locks in testing within the sdk.
@tac0turtle wdym by "original"? The one without changes? If so, which version of sdk are you testing with? My node seemed fine |
sorry different issue. git bisect led to tthis first but it was wrong |
cref: #696
This PR fixes the problem stated above by replacing maps that causes racing conditions to sync map to be thread safe.
Tested this on various chains' nodes that's having concurrency issue by applying patch manually, confirmed that after applying this patch concurrency issue is not happening any more. (special shout out to @keplr-wallet for helping out testing!)
Also confirmed and tested on a testnet node with the following procedure: