Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study alternative containers for Storage #2344

Closed
damip opened this issue Feb 26, 2022 · 16 comments
Closed

Study alternative containers for Storage #2344

damip opened this issue Feb 26, 2022 · 16 comments

Comments

@damip
Copy link
Member

damip commented Feb 26, 2022

The shared Storage has few concurrent writes by a couple of threads, heavy concurrent reads by many threads.

We are currently planning on using Arc<RwLock<PreHashMap<BlockId, Arc<RwLock<BlockInfo>>>> as the underlying storage container.

Now I recently watched this: https://www.youtube.com/watch?v=HJ-719EGIts

Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !

Then I looked for stuff and found crates proposing this kind of stuff:

It looks like those structures can yield an order-of-magnitude performance improvement under the right conditions (that look like the conditions we have).

However, depending on the implementation, there could be some unexpected problems such as a temporary lack of read/write consistency.

We need to:

  • study the proposed containers (and look for others)
  • get a more precise idea of the Read/Write concurrency and throughput expectations in our architecture. Maybe wait to have the first version of Storage implemented to do an actual benchmark
  • try changing the underlying Storage container with the various options and bench the impact on performance
@gterzian
Copy link
Contributor

gterzian commented Mar 1, 2022

Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !

I am doubtful of the higher performance of these things.

For example, see the below excerpt from https://www.chromium.org/developers/lock-and-condition-variable/

Programmers assume that locking is expensive, and that using atomic
operations will be more efficient. But in reality, acquiring and releasing a
lock is cheaper than a cache miss; attention to cache behaviour is usually a
more fruitful way to improve performance. Furthermore, lock-free data
structures are often more expensive than using locks. This is because a lock
allows arbitrary changes to be made to a complex data structure; if the same
changes must be made without a lock, the result is likely to take more
atomic read-modify-write instructions, not fewer.

In our specific situation, we have the below structure:

Arc<RwLock<Map<BlockId, Arc<RwLock<StoredBlock>>>>>

Most of the "work" will occurs while locking an individual StoredBlock, so there should be little contention on the lock around the top-level map.

That lock around the top-level map could be replaced by using some sort of lock-free map, however we should consider:

  1. Probably losing the optimization of our current pre-hashed maps.
  2. That going lock-free there probably doesn't really give any relevant performance boost.
  3. That we also may want the synchronization that comes with a lock. For example if a block is pruned, we may not want it to be able to be read logically after that point. Lock-free structures may have looser synchronization, and/or if they do allow such "hard" synchronization, they might not be any faster than simply a lock around a map.

@AurelienFT
Copy link
Contributor

AurelienFT commented Mar 1, 2022

I don't have a full vision of the code so I can't talk about if the synchronization will be a real problem.

But regarding this SO answer : https://stackoverflow.com/a/5680988 it may depend on the algorithm and also in the cases. If we look at this benchmark : https://github.com/xacrimon/conc-map-bench with a lot of reads and less writing it seems that we can have a huge improvement.

@AurelienFT
Copy link
Contributor

AurelienFT commented Mar 1, 2022

Tried to run the benchmark that is pretty cool : https://github.com/ibraheemdev/conc-map-bench (this fork have a fix for flurry) with sha256 with several crates (bitcoin_hashes and RustCrypto) but they don't implement Hasher trait.

Following this issue : RustCrypto/hashes#49 it's not a good way to implement this trait for cryptographic hashes. Also, following the SO answer above (and my mind) the best way is to test in "close" real condition by updating the current existing benchmark on https://github.com/xacrimon/conc-map-bench to have possibly to give a number of writing threads and reading threads and also use our proper hashs.

@AurelienFT
Copy link
Contributor

Btw @damip left-right is a child project of evmap : jonhoo/left-right@abe3769

@AurelienFT
Copy link
Contributor

I have fork and run the benchmark with the hash of Massa and prehash on massa_models. Here is the results:

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy latency
ReadHeavy throughput

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange latency
Exchange throughput

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow latency
RapidGrow throughput

I am working on a fork of bustle (crate used in the benchmark) to split the read and write operations in specific threads and be able to specify how many read/write threads we want.

Links :

@gterzian
Copy link
Contributor

gterzian commented Mar 2, 2022

@AurelienFT Thank you for your investigations.

It is expected that those maps will show an improvement over locks, since these lock-free structures are specifically designed with that in mind.

The question is whether it will show an improvement in our specific application, which will depend on whether the reading/writing is on the critical path.

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

The currently proposed API of Storage would make such experimentation quite easy, since the use of the lock around the top-level map is hidden from the user.

@AurelienFT
Copy link
Contributor

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

I totally agree. The real case situation is always the best and having metrics measurement around the node is pretty useful and can help us see real improvements or not.

@AurelienFT
Copy link
Contributor

AurelienFT commented Mar 2, 2022

I have successfully modified the two crates to allow a number of read and write threads. It's a bit "hacky" and if we want to contribute it upstream I will have to refactor it a bit but I don't think it's our priority right now.

Here is the results :

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy throughput
ReadHeavy latency

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange throughput
Exchange latency

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow throughput
RapidGrow latency

If you want me to rerun with different data/number of threads it's possible you can also run it directly with my own fork of the crates :

To run the benchmark on the fork you can run this command :

./scripts/bench.bash --read-threads="9" --write-threads="2" --read-threads="8"  --write-threads="2" --read-threads="7"  --write-threads="2" --read-threads="7"  --write-threads="3" --read-threads="7"  --write-threads="4"

The a --read-threads flag must always be paired with a --write-threads which will create a pair of test for benchmark and as shown above you can add multiples. (Will try to change to accept comma list directly).

To render the plots :

./scripts/plot.bash

Conclusion

We can see that when we have a lot of read test DashMap or Flurry could be nice. And when we have a lot of write the locks start to be more efficient (which is not our case).

@adrien-zinger
Copy link
Contributor

Dashmap works with sharding, that's smart, not magic and scalable!! 🧔 🤟

@AurelienFT
Copy link
Contributor

Dashmap works with sharding, that's smart, not magic and scalable!! 🧔 🤟

Flurry also :

This map usually acts as a binned (bucketed) hash table. Each key-value mapping is held in a BinEntry

@AurelienFT
Copy link
Contributor

New benchamrk with random data as value instead of 0

Here is the results :

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy throughput
ReadHeavy latency

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange throughput
Exchange latency

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow throughput
RapidGrow latency

@AurelienFT
Copy link
Contributor

AurelienFT commented Mar 3, 2022

@gterzian Do you have a branch with the implementation of storage, used at least one time and that can be run (not necessary on the testnet but on the labnet for example) ?

As there is an interface and the change of storage system will be transparent for the rest of the code. I will try to make sub branches of this one with the implementation of Storage using the Dashmap crate and the Flurrycrate so that we will be able to test and benchmark it.

Also if you have some tasks to help you on the integration of the storage so that I can help you on that because the more usage we have the more the bench will be !

@gterzian
Copy link
Contributor

gterzian commented Mar 4, 2022

@AurelienFT Thank you for your enthousiasm.

I would like to reiterate what I wrote above:

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

For clarity, this means that I don't think we should be making any efforts on this until the following has been reached:

  1. We have (largely) finished the integration of storage into the project.
  2. We have noticed that the application spends a significant amount of time waiting on the lock on the top-level map becoming available.

Point 2 is especially important, since we can always optimize things, but most of it is wasted effort since if it will not materially affect overall performance of the application.

As an example, we could have a 10 times improvement on "something", however this would be meaningless if the app spends only 0,00001% of its time doing that "thing".

if you have some tasks to help you on the integration of the storage so that I can help you

Appreciated, the main discussion would be over at #2178.

I think a good start would be to familiarize yourself with the various issues and currently open PR. I think there will be more specific tasks next week when some initial work is finished.

because the more usage we have the more the bench will be

As per above, the goal is not to optimize this, unless we notice it becomes a problem :)

@AurelienFT
Copy link
Contributor

I have run a simulator with 26 nodes and 100 tx/s on main with the storage and a branch where I implement DashMap and flurry and result are very close so it don't seems to be a bottleneck :

RwLock

100_operations_storage_26_nodes

DashMap

100_operations_storage_lockfree_26_nodes png

Flurry

100_operations_storage_lockfree_flurry_26_nodes png

@AureliaDolo AureliaDolo removed the good first issue Good for newcomers label Mar 31, 2022
@damip
Copy link
Member Author

damip commented Aug 24, 2022

This can be interesting to revisit after Storage is used in the whole codebase

@damip
Copy link
Member Author

damip commented Jun 5, 2023

The performance of the current version is good. Closing

@damip damip closed this as completed Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants