Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DMS-75] Store set size in root node #33

Closed

Conversation

Sereja313
Copy link
Member

@Sereja313 Sereja313 commented Oct 22, 2024

Profiling branch: https://github.com/serokell/canister-profiling/tree/sereja/set-profiling
persistentset_baseline: https://github.com/serokell/motoko-base/tree/sereja/specialize-persistentorderedset (9ebd08f)
persistentset_bh: https://github.com/serokell/motoko-base/tree/sereja/store-bh-in-type (1219b26)

Collection benchmarks

binary_size generate max mem batch_get 50 batch_put 50 batch_remove 50 upgrade
persistentset_baseline+100 196_855 196_106 37_784 49_615 128_635 130_668 210_758
persistentset_bh+100 198_472 202_891 40_856 49_622 133_081 145_795 165_064
persistentset+100 195_778 208_874 40_560 52_372 135_268 136_318 132_766
persistentset_baseline+1000 196_855 2_804_787 532_992 66_718 172_961 177_698 302_292
persistentset_bh+1000 198_472 2_901_910 577_624 66_725 179_379 200_287 251_326
persistentset+1000 195_778 2_943_402 560_968 69_475 180_224 183_299 150_897
persistentset_baseline+10000 196_855 44_279_117 360_508 82_339 214_958 227_419 374_743
persistentset_bh+10000 198_472 45_938_541 400_508 82_346 221_497 262_370 331_581
persistentset+10000 195_778 45_784_053 360_500 85_096 222_773 232_485 166_853
persistentset_baseline+100000 196_855 534_152_351 3_600_508 96_540 260_077 273_659 486_063
persistentset_bh+100000 198_472 553_354_610 4_000_508 96_547 267_902 316_477 419_377
persistentset+100000 195_778 550_390_097 3_600_500 99_297 268_485 278_896 180_468
persistentset_baseline+1000000 196_855 6_241_486_671 36_000_508 114_567 305_638 327_353 551_652
persistentset_bh+1000000 198_472 6_458_837_421 40_000_544 114_574 314_842 382_569 505_973
persistentset+1000000 195_778 6_415_073_281 36_000_536 117_324 314_599 332_258 196_186

set API

size intersect union diff equals isSubset
persistentset_baseline+100 100 157_677 170_099 219_483 156_755 157_147
persistentset_bh+100 100 137_470 137_822 163_025 157_191 156_752
persistentset+100 100 227_133 177_642 235_481 141_054 141_051
persistentset_baseline+1000 1000 339_879 516_679 946_277 1_851_541 1_851_538
persistentset_bh+1000 1000 257_226 255_676 359_488 1_851_541 1_851_538
persistentset+1000 1000 243_126 222_900 310_731 1_694_581 1_694_932
persistentset_baseline+10000 10000 450_437 1_056_800 2_154_884 28_585_151 28_585_107
persistentset_bh+10000 10000 366_799 388_139 565_939 29_005_151 29_005_107
persistentset+10000 10000 258_675 267_119 374_722 27_014_846 27_014_925
persistentset_baseline+100000 100000 589_221 1_922_651 3_769_853 317_013_739 317_013_654
persistentset_bh+100000 100000 464_937 530_657 757_591 321_213_698 321_213_531
persistentset+100000 100000 272_589 311_497 435_111 301_314_377 301_314_046
persistentset_baseline+1000000 1000000 675_690 3_032_485 6_064_414 3_473_642_146 3_473_641_487
persistentset_bh+1000000 1000000 571_282 683_965 950_978 3_515_640_465 3_515_640_175
persistentset+1000000 1000000 287_692 357_319 503_658 3_316_651_476 3_316_649_341

@GoPavel GoPavel added the experiment:to-merge Successful optimization that intented to moved to another PR label Oct 23, 2024
Store the set size in the root node and make set operations to use the
set size instead of calculating black height.
@Sereja313 Sereja313 changed the title Store Set size [DMS-75] Store set size in root node Oct 30, 2024
Comment on lines +33 to +34
public type S <T> = {
#node : (Color, S<T>, T, S<T>);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public type S <T> = {
#node : (Color, S<T>, T, S<T>);
public type Tree<T> = {
#node : (Color, Tree<T>, T, Tree<T>);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should not be public type then
However, it causes problems with visibility in tests.. So leave it public for a while

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check how I did it in the #35
I can rename it instead of you if you want since I did it already for the Map

Comment on lines +38 to +39
/// Set type with a size attached to the root node of the rbtree.
public type Set <T> = (Nat, S<T>);
Copy link

@GoPavel GoPavel Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Set type with a size attached to the root node of the rbtree.
public type Set <T> = (Nat, S<T>);
/// Set type with a size attached to the root node of the rbtree.
public type Set <T> = { size: Nat, root: Tree<T>);

@GoPavel
Copy link

GoPavel commented Oct 31, 2024

Actually, doesn't matter I will steal changes from this branch into #36 and tweak it as I want :)

@GoPavel
Copy link

GoPavel commented Nov 4, 2024

Add into #36

@GoPavel GoPavel closed this Nov 4, 2024
crusso added a commit to dfinity/motoko-base that referenced this pull request Nov 13, 2024
This is an MR for the 3rd Milestone of the Serokell's grant about
improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation
of the set data structure to the' base' library. Also, it brings a few
changes to the new functional map that was added in #664 , #654 .

# General changes:

* rename `PersistentOrderedMap` to `OrderedMap` (same for the
`OrderedSet`)
* improve docs

# Functional Map changes:

## New functionality:
+ add `any`/`all` functions
+ add `contains` function
+ add `minEntry`/`maxEntry`

## Optimizations:
+ Store `size` in the Map, [benchmark
results](serokell#35)

## Fixup: 
+ add `entriesRev()`, remove `iter()`

# NEW functional Set:

The new data structure implements an ordered set interface using
Red-Black trees as well as the new functional map from the 1-2
Milestones.

## API implemented:
* Basic operations (based on the map): `put`, `delete`, `contains`,
`fromIter`, etc
* Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight`
* Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal`
* Additional operations (as for the `OrderedMap`): `min`/`max`,
`all`/`some`

## Maintainance support:
* Unit, property tests
* Documentation

## Applied optimizations:

* Same optimizations that were useful for the functional map:
   * inline node color
   * float-out exceeded matching in iteration
   * `map`/`filterMap` through `foldLeft`
   * direct recursion in `foldLeft`
* [Benchmark results for all four optimizations
together](serokell#27)
* store size in the root of the tree, [benchmark
results](serokell#36 (comment))
* Pattern matching order optimization, [benchmark
results](serokell#36 (comment))
 * Other optimizations:
* Inline code of `OrderedMap` instead of sharing it, [benchmark
results](serokell#25)
* `intersect` optimization: use order of output values to build the
resulting tree faster, see
serokell#39
* `isSubset`, `equal` optimization: use early exit and use order of
subtrees to reduce intermediate tree height, see
serokell#37

## Rejected optimizations:

* Nipkow's implementation of set operation [Tobias Nipkow's "Functional
Data Structures and Algorithms", 117].
Initially, we were planning to use an implementation of set operations
(`intersect`, `union`, `diff`) from Nipkow's book. However, the
experiment shows that naive implementation with a simple size heuristic
performs better. [The benchmark
results](serokell#33) are comparing
3 versions:
* persistentset_baseline -- original implementation that uses Nipkow's
algorithms. However, the black height is calculated before each set
operation (the book assumes it's stored).
* persistentset_bh -- the same as the baseline but the black height is
stored in each node.
* persistentset -- naive implementation that looks up in a smaller set
and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which
is very close to Nipkow's version). Sizes of sets are also stored but
only in the root.
The last one outperforms others and keeps a tree slim in terms of byte
size. Thus, we have picked it.

## Final benchmark results:

### Collection benchmarks

| |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove
50|upgrade|
|--:|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108|
|trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696|

|orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579|

|trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466|

|orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120|

|trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045|

|orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232|

|trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388|

|orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778|

|trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492|

### set API

| |size|intersect|union|diff|equals|isSubset|
|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|100|146_264|157_544|215_871|28_117|27_726|
|trieset+100|100|352_496|411_306|350_935|201_896|201_456|
|orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938|
|trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673|
|orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591|

|trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381|
|orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393|

|trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346|

|orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558|

|trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151|

### new set API

| |size|foldLeft|foldRight|mapfilter|map|
|--:|--:|--:|--:|--:|--:|
|orderedset|100|16_487|16_463|88_028|224_597|
|orderedset|1000|133_685|131_953|1_526_510|4_035_782|
|orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733|
|orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463|
|orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902|

---------

Co-authored-by: Andrei Borzenkov <andreyborzenkov2002@gmail.com>
Co-authored-by: Andrei Borzenkov <root@sandwitch.dev>
Co-authored-by: Sergey Gulin <sergeygulin96@gmail.com>
Co-authored-by: Claudio Russo <claudio@dfinity.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment:to-merge Successful optimization that intented to moved to another PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants